Getting Started with Fiddler Evals
Building reliable LLM and agentic applications requires more than just deploying models; it requires systematic evaluation to ensure quality, safety, and consistent performance. Fiddler Evals provides an evaluation framework that helps you test, measure, and improve your AI applications. Whether you're comparing different prompts, testing model updates, or ensuring quality standards, Fiddler Evals provides the tools to quantify and validate changes.
Private Preview Notice
Fiddler Evals is currently in private preview. This means:
API interfaces may change before general availability
Some features are still under active development
We welcome your feedback to help shape the final product
Please refer to our product maturity definitions for more details.
What Is Fiddler Evals?
Fiddler Evals is an evaluation platform that helps you measure and improve the quality of your LLM applications. It provides built-in evaluators, custom evaluation support, and a comparison interface to:
Test systematically: Create comprehensive test suites with real-world scenarios
Measure objectively: Use built-in and custom evaluators to assess quality
Compare confidently: Analyze experiments side-by-side to make data-driven decisions
Improve continuously: Track progress over time and identify areas for enhancement
Core Concepts
Understanding three key concepts will help you get the most from Fiddler Evals:

Datasets: Collections of test cases with inputs and expected outputs
Experiments: Evaluation runs that test your application against a dataset
Evaluators: Metrics that assess specific aspects of your application's performance
Why Choose Fiddler Evals?
Fiddler Evals stands apart from fragmented evaluation tools by providing an integrated approach to AI quality assurance:
Unified Development-to-Production Workflow
Unlike tools that separate pre-production testing from production monitoring, Fiddler Evals integrates seamlessly with Fiddler Agentic Observability. This unified workflow means:
Consistent metrics: The same evaluators you use in development run in production monitoring
Continuous learning: Production insights feed back into evaluation datasets
Seamless transition: Deploy with confidence knowing your production monitoring matches your testing
Cost-Effective with Trust Service
Powered by the Fiddler Trust Service, Fiddler Evals evaluators run on purpose-built Trust Models:
Zero Hidden Costs: No external API calls, no per-request fees, no token charges
High Performance: <100ms response times enable real-time evaluation
Enterprise Security: Your data never leaves your environment—no third-party API exposure
Superior Accuracy: 50% more accurate than generic models on LLM evaluation benchmarks
Enterprise-Grade Reliability
Scalable: Evaluate thousands of test cases in parallel
Collaborative: Team access controls and shared evaluation libraries
Auditable: Complete traceability for compliance and debugging
Framework-Agnostic: Works with any LLM provider or agentic framework
Why Systematic Evaluation Matters
LLM and agentic applications face unique quality challenges that make systematic evaluation essential:
The Challenge of Variability
LLMs and agentic applications are non-deterministic, meaning they can produce different outputs for the same input, making quality assessment difficult. Without systematic evaluation:
You can't reliably detect quality degradation
Improvements are based on anecdotal evidence rather than data
Edge cases and failure modes go unnoticed until production
The Need for Objectivity
Human evaluation is valuable but subjective and doesn't scale. Automated evaluators provide:
Consistent, repeatable measurements
Scalable evaluation across thousands of test cases
Objective metrics for decision-making
The Power of Comparison
Understanding relative performance is crucial for improvement. Side-by-side comparison helps you:
Validate that changes actually improve performance
Choose between different approaches with confidence
Track progress toward quality goals
Navigating the Fiddler Evals Interface
The Fiddler Evals interface provides search, filtering, and side-by-side experiment comparison. Let's explore the key areas you'll use.
Experiments Dashboard
The main dashboard provides an overview of all your evaluation experiments, making it easy to track progress and identify trends.

Key features of the dashboard:
Search and filter: Quickly find experiments by name, application, or dataset
Status indicators: See which experiments are completed, in progress, or failed
Metadata display: View custom metadata to understand experiment context
Quick actions: Access experiment details or start comparisons directly
Viewing Experiment Details
Click on any experiment to explore the results in depth and understand your application's performance.

The experiment details view provides:
Test case results: See inputs, outputs, and expected outputs for each item
Evaluator scores: View all metrics calculated for each test case
Experiment metadata: View details and labels that describe the experiment
Comparing Experiments
The comparison view shows performance differences between experiments, helping you validate whether changes improve your application.

Comparison features include:
Side-by-side metrics: See how each experiment performs on the same test cases
Flexible metric selection: Choose which evaluators to compare
Core Workflow
The typical Fiddler Evals workflow follows a simple pattern that scales from quick tests to comprehensive evaluation suites:
Create Your Dataset
Start by creating a dataset that represents the scenarios your application needs to handle. This example enters test cases inline in the code, but you may also use a CSV file, JSONL file, or pandas DataFrame to load test cases:
from fiddler_evals import Dataset, NewDatasetItem
# Create dataset
dataset = Dataset.create(
name="customer-support-qa",
application_id=app.id,
description="Common customer support questions"
)
# Add test cases
items = [
NewDatasetItem(
inputs={"question": "How do I reset my password?"},
expected_outputs={"answer": "To reset your password, click 'Forgot Password' on the login page..."},
metadata={"category": "account"}
),
# Add more test cases...
]
dataset.insert(items)
Configure Your Evaluators
Choose evaluators that measure what matters for your use case:
from fiddler_evals.evaluators import AnswerRelevance, Conciseness, Toxicity
evaluators = [
AnswerRelevance(), # Is the answer relevant to the question?
Conciseness(), # Is the response appropriately brief?
Toxicity(), # Is the content safe and appropriate?
]
Run Your Experiment
Evaluation Pattern: Fiddler's built-in evaluators use the LLM-as-a-Judge pattern, where language models assess quality dimensions that are difficult to measure with rule-based systems. This provides automated quality assessment that approximates human evaluation patterns while maintaining consistency across thousands of test cases.

Execute your evaluation to see how your application performs:
from fiddler_evals import evaluate
def my_application(inputs, extras, metadata):
# Your LLM application logic
response = generate_answer(inputs["question"])
return {"answer": response}
results = evaluate(
dataset=dataset,
task=my_application,
evaluators=evaluators,
name_prefix="v1.0-baseline"
)
Analyze and Compare
Use the Fiddler UI to understand your results and identify improvements:
Review individual scores to find problem areas
Compare experiments to validate improvements
Export data for deeper analysis
Iterate based on insights
Understanding Your Evaluation Results
Interpreting evaluation results effectively helps you make informed decisions about your application.
Reading Score Cards
Each evaluator produces scores that help you understand specific aspects of performance:
Binary scores (0 or 1): Pass/fail metrics like relevance or correctness
Continuous scores (0.0 to 1.0): Gradual metrics like similarity or confidence
Categorical scores: Classifications like sentiment (positive/neutral/negative)

Identifying Patterns
Look for patterns across your test cases:
Consistent failures: Indicate systemic issues that need addressing
Category-specific problems: Suggest areas needing specialized handling
Score correlations: Reveal trade-offs between different metrics
Making Improvements
Use evaluation insights to guide your optimization efforts:
Focus on lowest scores: Address the most significant quality issues first
Test hypotheses: Use experiments to validate that changes improve metrics
Monitor trade-offs: Ensure improvements don't degrade other aspects
Common Use Cases
Fiddler Evals supports various evaluation scenarios across the LLM application lifecycle:
A/B Testing Prompts
Compare different prompt strategies to find what works best:
# Baseline prompt
baseline_results = evaluate(
dataset=dataset,
task=baseline_prompt_app,
evaluators=evaluators,
name_prefix="prompt-baseline"
)
# Improved prompt
improved_results = evaluate(
dataset=dataset,
task=improved_prompt_app,
evaluators=evaluators,
name_prefix="prompt-improved"
)
# Compare in UI to see which performs better
Model Version Comparison
Validate that model updates improve performance:
Test the same dataset against different model versions
Compare quality metrics side-by-side
Ensure no regression in critical capabilities
Regression Testing
Protect against quality degradation as you develop:
Run standard test suites before deployments
Set quality thresholds that must be met
Track performance trends over time
Safety Validation
Ensure your application meets safety standards:
Test with adversarial inputs
Measure toxicity and bias metrics
Validate content filtering effectiveness
Agentic Application Evaluation
Evaluate AI agents and multi-step workflows with specialized patterns:
Trajectory Evaluation: Assess agent decision-making sequences and tool selection paths
Reasoning Coherence: Validate logical flow from planning through execution
Tool Usage Quality: Measure appropriateness and effectiveness of tool calls
Multi-Agent Coordination: Track information flow and task delegation patterns
Connect to Production: Use Fiddler Evals during development, then monitor agent behavior in production with Agentic Monitoring.
Best Practices
Follow these practices to get the most value from Fiddler Evals:
Building Representative Datasets
Create test sets that reflect real-world usage:
Include edge cases: Don't just test the happy path—use dataset metadata to tag edge cases for focused analysis
Balance categories: Ensure coverage across different scenarios, then use experiment comparison to validate your test distribution matches production patterns
Use production data: Incorporate actual user inputs when possible (anonymized and sanitized)
Update regularly: Keep test cases current with evolving requirements—track dataset versions in metadata
Choosing Appropriate Evaluators
Select metrics that align with your goals:
Start with basics: Answer relevance and safety evaluators (toxicity, PII) are essential for most applications
Add domain-specific metrics: Build custom evaluators for specialized needs
Avoid metric overload: Focus on 3-5 key metrics that actually drive decisions rather than tracking everything
Validate with humans: Spot-check evaluator scores against human judgment to ensure they align with your quality standards
Setting Up Evaluation Cycles
Make evaluation a routine part of development:
Pre-deployment testing: Always evaluate before production changes
Regular benchmarking: Schedule periodic comprehensive evaluations
Continuous monitoring: Track key metrics in production
Iterative improvement: Use insights to guide development priorities
Getting Started Checklist
Ready to start evaluating? Follow this simple checklist:
Troubleshooting Common Issues
Experiments Not Appearing
If your experiments don't show in the dashboard:
Verify the experiment was completed successfully
Check that you're viewing the correct project/application
Refresh the page to load the latest data
Unexpected Scores
If evaluation scores seem incorrect:
Review the evaluator documentation to understand scoring logic
Check that inputs/outputs are formatted correctly
Validate that the correct evaluator parameters are used
Comparison Not Working
If you can't compare experiments:
Ensure both experiments use the same dataset
At least one metric/evaluator should be in common to compare the experiments
Verify experiments have completed successfully
Check that you have permissions to view both experiments
Next Steps
From Evaluation to Production: The Complete Lifecycle
Fiddler Evals is your Test phase in Fiddler's complete end-to-end agentic AI lifecycle:
1. Build → Design and instrument your LLM applications and agents 2. Test → Evaluate systematically with Fiddler Evals (you are here) 3. Monitor → Track production performance with Agentic Monitoring 4. Improve → Use insights to enhance quality and refine your agents
This unified approach ensures your evaluation criteria in development become your monitoring standards in production—no fragmentation, no tool switching.
Choose your path based on your role and goals:
For Developers 🔧
Evaluations SDK Quick Start - Hands-on tutorial with code
Advanced Patterns - Production-ready configurations
Fiddler Evals SDK - Complete technical docs
For Teams Scaling AI 📈
Agentic Monitoring - Monitor agents in production
LLM Monitoring - Production observability
For Product & Business 💼
Review sample dashboards in your Fiddler instance
Schedule a workshop with your Fiddler team
Explore case studies and best practices on the Fiddler blog
Summary
Fiddler Evals adds systematic measurement to LLM application development, replacing ad-hoc testing with quantified assessment. By evaluating your applications, you can:
Compare experiments quantitatively: Use side-by-side metrics to validate that changes improve performance
Track evaluation trends: Monitor quality over time through the experiments dashboard
Establish quality baselines: Define acceptable score thresholds for your use case
Reuse test suites: Ensure consistency by testing model versions against the same datasets
Start with 10-20 test cases and gradually expand your evaluation coverage. The metrics you track will help you make informed decisions about model changes and deployment.
❓ Questions? Talk to a product expert or request a demo.
💡 Need help? Contact us at [email protected].