Getting Started with Fiddler Evals

Building reliable LLM and agentic applications requires more than just deploying models; it requires systematic evaluation to ensure quality, safety, and consistent performance. Fiddler Evals provides an evaluation framework that helps you test, measure, and improve your AI applications. Whether you're comparing different prompts, testing model updates, or ensuring quality standards, Fiddler Evals provides the tools to quantify and validate changes.

Private Preview Notice

Fiddler Evals is currently in private preview. This means:

API interfaces may change before general availability
Some features are still under active development
We welcome your feedback to help shape the final product

Please refer to our product maturity definitions for more details.

What Is Fiddler Evals?

Fiddler Evals is an evaluation platform that helps you measure and improve the quality of your LLM applications. It provides built-in evaluators, custom evaluation support, and a comparison interface to:

Test systematically: Create comprehensive test suites with real-world scenarios
Measure objectively: Use built-in and custom evaluators to assess quality
Compare confidently: Analyze experiments side-by-side to make data-driven decisions
Improve continuously: Track progress over time and identify areas for enhancement

Core Concepts

Understanding three key concepts will help you get the most from Fiddler Evals:

Diagram showing three connected components: Datasets containing test cases, Experiments running evaluations, and Results providing insights with data flow arrows

Datasets: Collections of test cases with inputs and expected outputs
Experiments: Evaluation runs that test your application against a dataset
Evaluators: Metrics that assess specific aspects of your application's performance

New to evaluation terminology? See our Evaluations Glossary for definitions of key terms like evaluator, metric, score, and experiment.

Powered by Fiddler Trust Service

Fiddler Evals evaluators run on Fiddler Trust Models that operate entirely within your environment—no external API calls, zero hidden costs, and enterprise-grade security. Learn more about Trust Service.

Why Choose Fiddler Evals?

Fiddler Evals stands apart from fragmented evaluation tools by providing an integrated approach to AI quality assurance:

Unified Development-to-Production Workflow

Unlike tools that separate pre-production testing from production monitoring, Fiddler Evals integrates seamlessly with Fiddler Agentic Observability. This unified workflow means:

Consistent metrics: The same evaluators you use in development run in production monitoring
Continuous learning: Production insights feed back into evaluation datasets
Seamless transition: Deploy with confidence knowing your production monitoring matches your testing

Cost-Effective with Trust Service

Zero Hidden Costs: No external API calls, no per-request fees, no token charges
High Performance: <100ms response times enable real-time evaluation
Enterprise Security: Your data never leaves your environment—no third-party API exposure
Superior Accuracy: 50% more accurate than generic models on LLM evaluation benchmarks

Enterprise-Grade Reliability

Scalable: Evaluate thousands of test cases in parallel
Collaborative: Team access controls and shared evaluation libraries
Auditable: Complete traceability for compliance and debugging
Framework-Agnostic: Works with any LLM provider or agentic framework

Why Systematic Evaluation Matters

LLM and agentic applications face unique quality challenges that make systematic evaluation essential:

The Challenge of Variability

LLMs and agentic applications are non-deterministic, meaning they can produce different outputs for the same input, making quality assessment difficult. Without systematic evaluation:

You can't reliably detect quality degradation
Improvements are based on anecdotal evidence rather than data
Edge cases and failure modes go unnoticed until production

The Need for Objectivity

Human evaluation is valuable but subjective and doesn't scale. Automated evaluators provide:

Consistent, repeatable measurements
Scalable evaluation across thousands of test cases
Objective metrics for decision-making

The Power of Comparison

Understanding relative performance is crucial for improvement. Side-by-side comparison helps you:

Validate that changes actually improve performance
Choose between different approaches with confidence
Track progress toward quality goals

Navigating the Fiddler Evals Interface

The Fiddler Evals interface provides search, filtering, and side-by-side experiment comparison. Let's explore the key areas you'll use.

Experiments Dashboard

The main dashboard provides an overview of all your evaluation experiments, making it easy to track progress and identify trends.

Key features of the dashboard:

Search and filter: Quickly find experiments by name, application, or dataset
Status indicators: See which experiments are completed, in progress, or failed
Metadata display: View custom metadata to understand experiment context
Quick actions: Access experiment details or start comparisons directly

Viewing Experiment Details

Click on any experiment to explore the results in depth and understand your application's performance.

Detailed experiment view displaying test case inputs, expected outputs, actual outputs, and multiple evaluator scores (faithfulness, relevance, toxicity) with aggregate statistics panel and export functionality

The experiment details view provides:

Test case results: See inputs, outputs, and expected outputs for each item
Evaluator scores: View all metrics calculated for each test case
Experiment metadata: View details and labels that describe the experiment

Comparing Experiments

The comparison view shows performance differences between experiments, helping you validate whether changes improve your application.

Side-by-side comparison view showing two experiments and metric selectors

Comparison features include:

Side-by-side metrics: See how each experiment performs on the same test cases
Flexible metric selection: Choose which evaluators to compare

Core Workflow

The typical Fiddler Evals workflow follows a simple pattern that scales from quick tests to comprehensive evaluation suites:

TL;DR: Create dataset with test cases → Configure evaluators → Run evaluation → Analyze results. Takes ~15 minutes for first experiment.

The following walk-through is a high-level overview of a basic evaluation workflow. For a fully functional example, refer to our Quick Start Guide and Notebook.

Create Your Dataset

Refer to the Fiddler Evals SDK Technical Reference for instructions on installing and initializing the fiddler-evals-sdk Python package.

Start by creating a dataset that represents the scenarios your application needs to handle. This example enters test cases inline in the code, but you may also use a CSV file, JSONL file, or pandas DataFrame to load test cases:

from fiddler_evals import Dataset, NewDatasetItem

# Create dataset
dataset = Dataset.create(
    name="customer-support-qa",
    application_id=app.id,
    description="Common customer support questions"
)

# Add test cases
items = [
    NewDatasetItem(
        inputs={"question": "How do I reset my password?"},
        expected_outputs={"answer": "To reset your password, click 'Forgot Password' on the login page..."},
        metadata={"category": "account"}
    ),
    # Add more test cases...
]
dataset.insert(items)

Configure Your Evaluators

Choose evaluators that measure what matters for your use case:

from fiddler_evals.evaluators import AnswerRelevance, Conciseness, Toxicity

evaluators = [
    AnswerRelevance(),  # Is the answer relevant to the question?
    Conciseness(),      # Is the response appropriately brief?
    Toxicity(),         # Is the content safe and appropriate?
]

Run Your Experiment

Evaluation Pattern: Fiddler's built-in evaluators use the LLM-as-a-Judge pattern, where language models assess quality dimensions that are difficult to measure with rule-based systems. This provides automated quality assessment that approximates human evaluation patterns while maintaining consistency across thousands of test cases.

Grid showing four evaluator categories: Safety (toxicity, PII detection, prompt injection), Quality (faithfulness, coherence, conciseness), Relevance (answer relevance, context relevance), and Custom (domain-specific logic) with color-coded cards

Execute your evaluation to see how your application performs:

from fiddler_evals import evaluate

def my_application(inputs, extras, metadata):
    # Your LLM application logic
    response = generate_answer(inputs["question"])
    return {"answer": response}

results = evaluate(
    dataset=dataset,
    task=my_application,
    evaluators=evaluators,
    name_prefix="v1.0-baseline"
)

Analyze and Compare

Use the Fiddler UI to understand your results and identify improvements:

Review individual scores to find problem areas
Compare experiments to validate improvements
Export data for deeper analysis
Iterate based on insights

Understanding Your Evaluation Results

Interpreting evaluation results effectively helps you make informed decisions about your application.

Reading Score Cards

Each evaluator produces scores that help you understand specific aspects of performance:

Binary scores (0 or 1): Pass/fail metrics like relevance or correctness
Continuous scores (0.0 to 1.0): Gradual metrics like similarity or confidence
Categorical scores: Classifications like sentiment (positive/neutral/negative)

Visual guide showing three score types: Binary scores with pass/fail indicators, Continuous scores with a gradient bar from 0.0-1.0 showing quality zones (needs work, acceptable, excellent), and Categorical scores with positive/neutral/negative tags

Identifying Patterns

Look for patterns across your test cases:

Consistent failures: Indicate systemic issues that need addressing
Category-specific problems: Suggest areas needing specialized handling
Score correlations: Reveal trade-offs between different metrics

Making Improvements

Use evaluation insights to guide your optimization efforts:

Focus on lowest scores: Address the most significant quality issues first
Test hypotheses: Use experiments to validate that changes improve metrics
Monitor trade-offs: Ensure improvements don't degrade other aspects

Common Use Cases

Fiddler Evals supports various evaluation scenarios across the LLM application lifecycle:

A/B Testing Prompts

Compare different prompt strategies to find what works best:

# Baseline prompt
baseline_results = evaluate(
    dataset=dataset,
    task=baseline_prompt_app,
    evaluators=evaluators,
    name_prefix="prompt-baseline"
)

# Improved prompt
improved_results = evaluate(
    dataset=dataset,
    task=improved_prompt_app,
    evaluators=evaluators,
    name_prefix="prompt-improved"
)

# Compare in UI to see which performs better

Model Version Comparison

Validate that model updates improve performance:

Test the same dataset against different model versions
Compare quality metrics side-by-side
Ensure no regression in critical capabilities

Regression Testing

Protect against quality degradation as you develop:

Run standard test suites before deployments
Set quality thresholds that must be met
Track performance trends over time

Safety Validation

Ensure your application meets safety standards:

Test with adversarial inputs
Measure toxicity and bias metrics
Validate content filtering effectiveness

Agentic Application Evaluation

Evaluate AI agents and multi-step workflows with specialized patterns:

Trajectory Evaluation: Assess agent decision-making sequences and tool selection paths
Reasoning Coherence: Validate logical flow from planning through execution
Tool Usage Quality: Measure appropriateness and effectiveness of tool calls
Multi-Agent Coordination: Track information flow and task delegation patterns

Connect to Production: Use Fiddler Evals during development, then monitor agent behavior in production with Agentic Monitoring.

Best Practices

Follow these practices to get the most value from Fiddler Evals:

Building Representative Datasets

Create test sets that reflect real-world usage:

Include edge cases: Don't just test the happy path—use dataset metadata to tag edge cases for focused analysis
Balance categories: Ensure coverage across different scenarios, then use experiment comparison to validate your test distribution matches production patterns
Use production data: Incorporate actual user inputs when possible (anonymized and sanitized)
Update regularly: Keep test cases current with evolving requirements—track dataset versions in metadata

Choosing Appropriate Evaluators

Select metrics that align with your goals:

Start with basics: Answer relevance and safety evaluators (toxicity, PII) are essential for most applications
Add domain-specific metrics: Build custom evaluators for specialized needs
Avoid metric overload: Focus on 3-5 key metrics that actually drive decisions rather than tracking everything
Validate with humans: Spot-check evaluator scores against human judgment to ensure they align with your quality standards

Setting Up Evaluation Cycles

Make evaluation a routine part of development:

Pre-deployment testing: Always evaluate before production changes
Regular benchmarking: Schedule periodic comprehensive evaluations
Continuous monitoring: Track key metrics in production
Iterative improvement: Use insights to guide development priorities

Getting Started Checklist

Ready to start evaluating? Follow this simple checklist:

Set up your environment: Install the Fiddler Evals SDK
Create your first dataset: Start with 10-20 representative test cases
Run a baseline experiment: Establish current performance levels
Review results in the UI: Understand your application's strengths and weaknesses
Improve and compare: Validate that changes have a positive impact

Troubleshooting Common Issues

Experiments Not Appearing

If your experiments don't show in the dashboard:

Verify the experiment was completed successfully
Check that you're viewing the correct project/application
Refresh the page to load the latest data

Unexpected Scores

If evaluation scores seem incorrect:

Review the evaluator documentation to understand scoring logic
Check that inputs/outputs are formatted correctly
Validate that the correct evaluator parameters are used

Comparison Not Working

If you can't compare experiments:

Ensure both experiments use the same dataset
At least one metric/evaluator should be in common to compare the experiments
Verify experiments have completed successfully
Check that you have permissions to view both experiments

Next Steps

From Evaluation to Production: The Complete Lifecycle

Fiddler Evals is your Test phase in Fiddler's complete end-to-end agentic AI lifecycle:

1. Build → Design and instrument your LLM applications and agents 2. Test → Evaluate systematically with Fiddler Evals (you are here) 3. Monitor → Track production performance with Agentic Monitoring 4. Improve → Use insights to enhance quality and refine your agents

This unified approach ensures your evaluation criteria in development become your monitoring standards in production—no fragmentation, no tool switching.

Choose your path based on your role and goals:

For Developers 🔧

Evaluations SDK Quick Start - Hands-on tutorial with code
Advanced Patterns - Production-ready configurations
Fiddler Evals SDK - Complete technical docs

For Teams Scaling AI 📈

Agentic Monitoring - Monitor agents in production
LLM Monitoring - Production observability

For Product & Business 💼

Review sample dashboards in your Fiddler instance
Schedule a workshop with your Fiddler team
Explore case studies and best practices on the Fiddler blog

Summary

Fiddler Evals adds systematic measurement to LLM application development, replacing ad-hoc testing with quantified assessment. By evaluating your applications, you can:

Compare experiments quantitatively: Use side-by-side metrics to validate that changes improve performance
Track evaluation trends: Monitor quality over time through the experiments dashboard
Establish quality baselines: Define acceptable score thresholds for your use case
Reuse test suites: Ensure consistency by testing model versions against the same datasets

Start with 10-20 test cases and gradually expand your evaluation coverage. The metrics you track will help you make informed decisions about model changes and deployment.

❓ Questions? Talk to a product expert or request a demo.

💡 Need help? Contact us at [email protected].

PreviousGetting Started with Agentic Monitoring NextGetting Started with Fiddler Guardrails