Custom Evaluations

FastRouter’s Custom Evaluations lets you benchmark and compare AI models on your own data—using LLM-based judges to automatically score accuracy, latency, and cost.

Introduction

FastRouter's Custom Evaluations feature allows you to assess and compare the performance of AI models on your datasets. By importing chat completion logs or datasets, generating model outputs (runs), and defining evaluation criteria with LLM-based judges, you can quantitatively measure aspects like accuracy, relevance, latency, and cost. This is ideal for benchmarking models, optimizing prompts, and ensuring high-quality responses in production.

Evaluations are managed through the FastRouter dashboard, where you can create, run, and analyze evaluations asynchronously. Results include detailed metrics and scores, helping you make data-driven decisions.


Key Benefits

  • Automated judging using LLM evaluators for scalable assessments.

  • Support for custom criteria tailored to your use case (e.g., factual accuracy, creativity, conciseness).

  • Integration with your API keys for secure, cost-effective processing.

  • Visual dashboards for easy comparison of runs and metrics.


Creating a New Evaluation

To start, navigate to the "Evaluations" section in your FastRouter dashboard and click "New Evaluation."

Custom Evaluations: Create Evaluation
  1. Name Your Evaluation: Provide a descriptive name (e.g., "Math Query Benchmark").

  2. Import Test Data: Upload or import chat completion logs or datasets. You can:

    • Select a project and model (e.g., "Anthropic Claude 4.5").

    • Filter by date range.

    • Choose input/output text to filter rows.

    • Set a sample size (e.g., 10%) to evaluate a subset of your data for efficiency.

Custom Evaluations: Import Test Data
  1. Add Runs: Select model runs to generate outputs (e.g., "anthropic/claude-4.5"). You can add multiple runs for side-by-side comparison.

Custom Evaluations: Add Run
  1. Add Test Criteria: Create one or more evaluation criteria using an LLM judge.

  • Click "Add Test Criteria" and choose a type (e.g., Model Scorer for quantitative scoring).

  • Configure the LLM judge: Select a model (e.g., "openai/gpt-5"), system prompt (e.g., "You are an expert AI response evaluator..."), and user prompt template (e.g., "Rate the response on [criteria] from 1-10") with any variables.

  • To access any values for evaluation by the LLM judge in the input or generated output, you can use the variables: {{item.input}}, {{item.column_name}} or {{sample.output}}.

  • Define scoring rubrics, such as pass/fail thresholds or numeric scales.

Custom Evaluations: Add Test Criteria
Custom Evaluations: Edit Test Criteria

  1. Select an Evaluation Key: Choose an API key from your account to handle generation and evaluation requests. This key will be used for all API calls during the evaluation.

Custom Evaluations: Configure Evaluation Key
  1. Run the Evaluation: Click "Run" to start processing. The evaluation will generate outputs for each run and apply the judges asynchronously.


Viewing Evaluation Results

Once the evaluation is complete, access the results of a particular custom evaluation from the Evaluations listing page.

  • Overview: See a summary of runs, including model names, request IDs, input samples, generated outputs, testing criteria scores (e.g., Latency, Cost), and overall scores.

Custom Evaluations: Evaluation Runs Overview
  • Comparison and Analysis: Compare multiple runs side-by-side. Metrics include:

    • Score: Aggregated from your criteria (e.g., 7/10 for accuracy).

    • Latency: Time to generate responses.

    • Cost: Token-based billing.

    • Custom metrics based on your judges.

  • Detailed Metrics: For each run, view aggregated stats like average score, latency (in ms), cost, and pass rate. Drill down into individual responses for judge reasoning.

  • Detailed Metrics: For each run, view aggregated stats like average score, latency (in ms), cost, and pass rate. Drill down into individual responses for judge reasoning.

Custom Evaluations: Evaluation Run Details
  • Judge Reasoning: For each test criterion and score, you can drill down into the individual responses for details of the judge reasoning.

Custom Evaluations: Judge Reasoning

Tips & Best Practices

  • Start Small: Begin with a small sample size (e.g., 10-50 rows) to test your setup before scaling to larger datasets.

  • Diverse Criteria: Use multiple judges for comprehensive evaluations (e.g., one for factual accuracy, another for response conciseness).

  • Judge Calibration: Test your LLM judge prompts on sample data to ensure unbiased and consistent scoring.

  • Cost Management: Monitor estimated costs in the setup phase. Use efficient models for judges to minimize expenses.

Last updated