LLM Evaluation Quick Start
Systematically evaluate your LLM applications, RAG systems, and AI agents using the Fiddler Evals SDK with built-in evaluators and custom metrics.
Time to complete: ~20 minutes
What You'll Learn
Initialize the Fiddler Evals SDK and organize your evaluations
Create evaluation datasets with test cases
Use built-in evaluators (faithfulness, toxicity, PII, coherence, etc.)
Create custom evaluators for domain-specific requirements
Run evaluation experiments and analyze results
Prerequisites
Fiddler Account: Active account with API access
Python 3.10+
Fiddler Evals SDK:
pip install fiddler-evalsAccess Token: From Settings > Credentials
Quick Start
Step 1: Connect to Fiddler
Step 2: Create Project and Application
Step 3: Add Test Cases
Step 4: Define Your LLM Task
Step 5: Run Evaluation with Built-In Evaluators
Step 6: Analyze Results
Built-In Evaluators
The Fiddler Evals SDK includes 14+ pre-built evaluators:
Quality & Accuracy
AnswerRelevance - Measures response relevance to the question
Coherence - Evaluates logical flow and consistency
Conciseness - Checks for unnecessary verbosity
AnswerCorrectness - Compares output to expected answer
Safety & Ethics
Toxicity - Detects harmful or offensive content
PIIDetection - Identifies personally identifiable information
Bias - Detects potential biases in responses
RAG-Specific
Faithfulness - Checks if response is supported by context
ContextRelevance - Evaluates relevance of retrieved context
GroundedAnswerRelevance - Combines faithfulness and relevance
Example: RAG Evaluation
Custom Evaluators
Create domain-specific evaluators for your use case:
Advanced Features
Batch Evaluation with Parallel Processing
Import Datasets from Files
Track Experiment Metadata
Complete Example: RAG Evaluation Pipeline
Best Practices
Start Small: Begin with 10-20 test cases to validate your setup
Use Multiple Evaluators: Combine quality, safety, and domain-specific evaluators
Version Your Experiments: Use
name_prefixto track different experiment runsMonitor Over Time: Run evaluations regularly to catch regressions
Custom Evaluators: Create domain-specific evaluators for specialized needs
Leverage Parallelization: Use
max_workersfor faster evaluation of large datasetsOrganize Hierarchically: Use Projects > Applications > Datasets structure
Next Steps
Complete Guides
Evals SDK Quick Start - Full tutorial with detailed examples
Evals SDK Reference - Complete API documentation
Concepts & Background
Evaluations Overview - Why and when to evaluate
Trust Service - Fiddler's evaluation platform
Integration Guides
Evals SDK Integration - Integration patterns and examples
LangGraph SDK - Monitor LangGraph agents
Strands Agents SDK - Monitor Strands agents
Summary
You've learned how to:
✅ Initialize the Fiddler Evals SDK with
init()✅ Create Projects, Applications, and Datasets for organization
✅ Build evaluation datasets with test cases
✅ Use 14+ built-in evaluators for quality, safety, and RAG metrics
✅ Create custom evaluators for domain-specific requirements
✅ Run evaluations with
evaluate()function✅ Analyze results programmatically and in the Fiddler UI
The Fiddler Evals SDK provides a comprehensive framework for systematic LLM evaluation, enabling you to ensure quality, safety, and accuracy before deploying your AI applications.