Getting Started with Fiddler Evals

Building reliable LLM and agentic applications requires more than just deploying models; it requires systematic evaluation to ensure quality, safety, and consistent performance. Fiddler Evals provides an evaluation framework that helps you test, measure, and improve your AI applications. Whether you're comparing different prompts, testing model updates, or ensuring quality standards, Fiddler Evals provides the tools to quantify and validate changes.

What Is Fiddler Evals?

Fiddler Evals is an evaluation platform that helps you measure and improve the quality of your LLM applications. It provides built-in evaluators, custom evaluation support, and a comparison interface to:

  • Test systematically: Create comprehensive test suites with real-world scenarios

  • Measure objectively: Use built-in and custom evaluators to assess quality

  • Compare confidently: Analyze experiments side-by-side to make data-driven decisions

  • Improve continuously: Track progress over time and identify areas for enhancement

Core Concepts

Understanding three key concepts will help you get the most from Fiddler Evals:

Diagram showing three connected components: Datasets containing test cases, Experiments running evaluations, and Results providing insights with data flow arrows
Fiddler Evals core concepts: Datasets contain test cases, Experiments run evaluations, and Results provide insights
  1. Datasets: Collections of test cases with inputs and expected outputs

  2. Experiments: Evaluation runs that test your application against a dataset

  3. Evaluators: Metrics that assess specific aspects of your application's performance

New to evaluation terminology? See our Evaluations Glossary for definitions of key terms like evaluator, metric, score, and experiment.

Powered by Fiddler Trust Service

Fiddler Evals evaluators run on Fiddler Trust Models that operate entirely within your environment—no external API calls, zero hidden costs, and enterprise-grade security. Learn more about Trust Service.

Why Choose Fiddler Evals?

Fiddler Evals stands apart from fragmented evaluation tools by providing an integrated approach to AI quality assurance:

Unified Development-to-Production Workflow

Unlike tools that separate pre-production testing from production monitoring, Fiddler Evals integrates seamlessly with Fiddler Agentic Observability. This unified workflow means:

  • Consistent metrics: The same evaluators you use in development run in production monitoring

  • Continuous learning: Production insights feed back into evaluation datasets

  • Seamless transition: Deploy with confidence knowing your production monitoring matches your testing

Cost-Effective with Trust Service

Powered by the Fiddler Trust Service, Fiddler Evals evaluators run on purpose-built Trust Models:

  • Zero Hidden Costs: No external API calls, no per-request fees, no token charges

  • High Performance: <100ms response times enable real-time evaluation

  • Enterprise Security: Your data never leaves your environment—no third-party API exposure

  • Superior Accuracy: 50% more accurate than generic models on LLM evaluation benchmarks

Enterprise-Grade Reliability

  • Scalable: Evaluate thousands of test cases in parallel

  • Collaborative: Team access controls and shared evaluation libraries

  • Auditable: Complete traceability for compliance and debugging

  • Framework-Agnostic: Works with any LLM provider or agentic framework

Why Systematic Evaluation Matters

LLM and agentic applications face unique quality challenges that make systematic evaluation essential:

The Challenge of Variability

LLMs and agentic applications are non-deterministic, meaning they can produce different outputs for the same input, making quality assessment difficult. Without systematic evaluation:

  • You can't reliably detect quality degradation

  • Improvements are based on anecdotal evidence rather than data

  • Edge cases and failure modes go unnoticed until production

The Need for Objectivity

Human evaluation is valuable but subjective and doesn't scale. Automated evaluators provide:

  • Consistent, repeatable measurements

  • Scalable evaluation across thousands of test cases

  • Objective metrics for decision-making

The Power of Comparison

Understanding relative performance is crucial for improvement. Side-by-side comparison helps you:

  • Validate that changes actually improve performance

  • Choose between different approaches with confidence

  • Track progress toward quality goals

The Fiddler Evals interface provides search, filtering, and side-by-side experiment comparison. Let's explore the key areas you'll use.

Experiments Dashboard

The main dashboard provides an overview of all your evaluation experiments, making it easy to track progress and identify trends.

Fiddler Evals experiments dashboard showing searchable table of evaluation runs with status indicators (completed, in progress, failed), dataset names, metadata tags, and action buttons for viewing and comparing experiments
The Experiments dashboard shows all your evaluation runs with status, datasets, and metadata at a glance

Key features of the dashboard:

  • Search and filter: Quickly find experiments by name, application, or dataset

  • Status indicators: See which experiments are completed, in progress, or failed

  • Metadata display: View custom metadata to understand experiment context

  • Quick actions: Access experiment details or start comparisons directly

Viewing Experiment Details

Click on any experiment to explore the results in depth and understand your application's performance.

Detailed experiment view displaying test case inputs, expected outputs, actual outputs, and multiple evaluator scores (faithfulness, relevance, toxicity) with aggregate statistics panel and export functionality
Experiment details show individual test case results with all evaluator scores

The experiment details view provides:

  • Test case results: See inputs, outputs, and expected outputs for each item

  • Evaluator scores: View all metrics calculated for each test case

  • Experiment metadata: View details and labels that describe the experiment

Comparing Experiments

The comparison view shows performance differences between experiments, helping you validate whether changes improve your application.

Side-by-side comparison view showing two experiments and metric selectors
Compare experiments side-by-side to understand how changes affect performance

Comparison features include:

  • Side-by-side metrics: See how each experiment performs on the same test cases

  • Flexible metric selection: Choose which evaluators to compare

Core Workflow

The typical Fiddler Evals workflow follows a simple pattern that scales from quick tests to comprehensive evaluation suites:

TL;DR: Create dataset with test cases → Configure evaluators → Run evaluation → Analyze results. Takes ~15 minutes for first experiment.

The following walk-through is a high-level overview of a basic evaluation workflow. For a fully functional example, refer to our Quick Start Guide and Notebook.

1

Create Your Dataset

Refer to the Fiddler Evals SDK Technical Reference for instructions on installing and initializing the fiddler-evals-sdk Python package.

Start by creating a dataset that represents the scenarios your application needs to handle. This example enters test cases inline in the code, but you may also use a CSV file, JSONL file, or pandas DataFrame to load test cases:

from fiddler_evals import Dataset, NewDatasetItem

# Create dataset
dataset = Dataset.create(
    name="customer-support-qa",
    application_id=app.id,
    description="Common customer support questions"
)

# Add test cases
items = [
    NewDatasetItem(
        inputs={"question": "How do I reset my password?"},
        expected_outputs={"answer": "To reset your password, click 'Forgot Password' on the login page..."},
        metadata={"category": "account"}
    ),
    # Add more test cases...
]
dataset.insert(items)
2

Configure Your Evaluators

Choose evaluators that measure what matters for your use case:

from fiddler_evals.evaluators import AnswerRelevance, Conciseness, Toxicity

evaluators = [
    AnswerRelevance(),  # Is the answer relevant to the question?
    Conciseness(),      # Is the response appropriately brief?
    Toxicity(),         # Is the content safe and appropriate?
]
3

Run Your Experiment

Evaluation Pattern: Fiddler's built-in evaluators use the LLM-as-a-Judge pattern, where language models assess quality dimensions that are difficult to measure with rule-based systems. This provides automated quality assessment that approximates human evaluation patterns while maintaining consistency across thousands of test cases.

Grid showing four evaluator categories: Safety (toxicity, PII detection, prompt injection), Quality (faithfulness, coherence, conciseness), Relevance (answer relevance, context relevance), and Custom (domain-specific logic) with color-coded cards
Fiddler provides evaluators across four main categories to assess different aspects of your application

Execute your evaluation to see how your application performs:

from fiddler_evals import evaluate

def my_application(inputs, extras, metadata):
    # Your LLM application logic
    response = generate_answer(inputs["question"])
    return {"answer": response}

results = evaluate(
    dataset=dataset,
    task=my_application,
    evaluators=evaluators,
    name_prefix="v1.0-baseline"
)
4

Analyze and Compare

Use the Fiddler UI to understand your results and identify improvements:

  1. Review individual scores to find problem areas

  2. Compare experiments to validate improvements

  3. Export data for deeper analysis

  4. Iterate based on insights

Understanding Your Evaluation Results

Interpreting evaluation results effectively helps you make informed decisions about your application.

Reading Score Cards

Each evaluator produces scores that help you understand specific aspects of performance:

  • Binary scores (0 or 1): Pass/fail metrics like relevance or correctness

  • Continuous scores (0.0 to 1.0): Gradual metrics like similarity or confidence

  • Categorical scores: Classifications like sentiment (positive/neutral/negative)

Visual guide showing three score types: Binary scores with pass/fail indicators, Continuous scores with a gradient bar from 0.0-1.0 showing quality zones (needs work, acceptable, excellent), and Categorical scores with positive/neutral/negative tags
Understanding how to interpret different evaluation score types

Identifying Patterns

Look for patterns across your test cases:

  • Consistent failures: Indicate systemic issues that need addressing

  • Category-specific problems: Suggest areas needing specialized handling

  • Score correlations: Reveal trade-offs between different metrics

Making Improvements

Use evaluation insights to guide your optimization efforts:

  1. Focus on lowest scores: Address the most significant quality issues first

  2. Test hypotheses: Use experiments to validate that changes improve metrics

  3. Monitor trade-offs: Ensure improvements don't degrade other aspects

Common Use Cases

Fiddler Evals supports various evaluation scenarios across the LLM application lifecycle:

A/B Testing Prompts

Compare different prompt strategies to find what works best:

# Baseline prompt
baseline_results = evaluate(
    dataset=dataset,
    task=baseline_prompt_app,
    evaluators=evaluators,
    name_prefix="prompt-baseline"
)

# Improved prompt
improved_results = evaluate(
    dataset=dataset,
    task=improved_prompt_app,
    evaluators=evaluators,
    name_prefix="prompt-improved"
)

# Compare in UI to see which performs better

Model Version Comparison

Validate that model updates improve performance:

  • Test the same dataset against different model versions

  • Compare quality metrics side-by-side

  • Ensure no regression in critical capabilities

Regression Testing

Protect against quality degradation as you develop:

  • Run standard test suites before deployments

  • Set quality thresholds that must be met

  • Track performance trends over time

Safety Validation

Ensure your application meets safety standards:

  • Test with adversarial inputs

  • Measure toxicity and bias metrics

  • Validate content filtering effectiveness

Agentic Application Evaluation

Evaluate AI agents and multi-step workflows with specialized patterns:

  • Trajectory Evaluation: Assess agent decision-making sequences and tool selection paths

  • Reasoning Coherence: Validate logical flow from planning through execution

  • Tool Usage Quality: Measure appropriateness and effectiveness of tool calls

  • Multi-Agent Coordination: Track information flow and task delegation patterns

Connect to Production: Use Fiddler Evals during development, then monitor agent behavior in production with Agentic Monitoring.

Best Practices

Follow these practices to get the most value from Fiddler Evals:

Building Representative Datasets

Create test sets that reflect real-world usage:

  • Include edge cases: Don't just test the happy path—use dataset metadata to tag edge cases for focused analysis

  • Balance categories: Ensure coverage across different scenarios, then use experiment comparison to validate your test distribution matches production patterns

  • Use production data: Incorporate actual user inputs when possible (anonymized and sanitized)

  • Update regularly: Keep test cases current with evolving requirements—track dataset versions in metadata

Choosing Appropriate Evaluators

Select metrics that align with your goals:

  • Start with basics: Answer relevance and safety evaluators (toxicity, PII) are essential for most applications

  • Add domain-specific metrics: Build custom evaluators for specialized needs

  • Avoid metric overload: Focus on 3-5 key metrics that actually drive decisions rather than tracking everything

  • Validate with humans: Spot-check evaluator scores against human judgment to ensure they align with your quality standards

Setting Up Evaluation Cycles

Make evaluation a routine part of development:

  • Pre-deployment testing: Always evaluate before production changes

  • Regular benchmarking: Schedule periodic comprehensive evaluations

  • Continuous monitoring: Track key metrics in production

  • Iterative improvement: Use insights to guide development priorities

Getting Started Checklist

Ready to start evaluating? Follow this simple checklist:

Troubleshooting Common Issues

Experiments Not Appearing

If your experiments don't show in the dashboard:

  • Verify the experiment was completed successfully

  • Check that you're viewing the correct project/application

  • Refresh the page to load the latest data

Unexpected Scores

If evaluation scores seem incorrect:

  • Review the evaluator documentation to understand scoring logic

  • Check that inputs/outputs are formatted correctly

  • Validate that the correct evaluator parameters are used

Comparison Not Working

If you can't compare experiments:

  • Ensure both experiments use the same dataset

  • At least one metric/evaluator should be in common to compare the experiments

  • Verify experiments have completed successfully

  • Check that you have permissions to view both experiments

Next Steps

From Evaluation to Production: The Complete Lifecycle

Fiddler Evals is your Test phase in Fiddler's complete end-to-end agentic AI lifecycle:

1. Build → Design and instrument your LLM applications and agents 2. Test → Evaluate systematically with Fiddler Evals (you are here) 3. Monitor → Track production performance with Agentic Monitoring 4. Improve → Use insights to enhance quality and refine your agents

This unified approach ensures your evaluation criteria in development become your monitoring standards in production—no fragmentation, no tool switching.


Choose your path based on your role and goals:

For Developers 🔧

  1. Evaluations SDK Quick Start - Hands-on tutorial with code

  2. Advanced Patterns - Production-ready configurations

  3. Fiddler Evals SDK - Complete technical docs

For Teams Scaling AI 📈

  1. Agentic Monitoring - Monitor agents in production

  2. LLM Monitoring - Production observability

For Product & Business 💼

  1. Review sample dashboards in your Fiddler instance

  2. Schedule a workshop with your Fiddler team

  3. Explore case studies and best practices on the Fiddler blog

Summary

Fiddler Evals adds systematic measurement to LLM application development, replacing ad-hoc testing with quantified assessment. By evaluating your applications, you can:

  • Compare experiments quantitatively: Use side-by-side metrics to validate that changes improve performance

  • Track evaluation trends: Monitor quality over time through the experiments dashboard

  • Establish quality baselines: Define acceptable score thresholds for your use case

  • Reuse test suites: Ensure consistency by testing model versions against the same datasets

Start with 10-20 test cases and gradually expand your evaluation coverage. The metrics you track will help you make informed decisions about model changes and deployment.


Questions? Talk to a product expert or request a demo.

💡 Need help? Contact us at [email protected].