Evaluations
Master LLM and AI application evaluation with comprehensive tutorials covering the Fiddler Evals SDK, custom evaluators, model comparison, and prompt-based evaluation creation.
Private Preview Notice
Fiddler Evals is currently in private preview. This means:
API interfaces may change before general availability
Some features are still under active development
We welcome your feedback to help shape the final product
Please refer to our product maturity definitions for more details.
Building reliable AI applications requires systematic evaluation to ensure quality, safety, and consistent performance. This section provides comprehensive tutorials and quick starts to help you evaluate your LLM applications, RAG systems, and AI agents using Fiddler's evaluation platform.
What You'll Learn
These tutorials cover the full spectrum of evaluation capabilities in Fiddler:
Recommended Learning Path
New to Fiddler Evals? Follow this progression:
Getting Started with Fiddler Evals - Understand the why and what (15 min read)
Evals SDK Quick Start - Build your first evaluation (20 min hands-on)
Advanced Patterns - Master production patterns (45 min hands-on)
API Reference - Complete SDK documentation (reference)
Already familiar with LLM evaluation? Jump to the Fiddler Evals SDK Reference for API details.
Fiddler Evals SDK Quick Start
Get hands-on with the Fiddler Evals SDK in 20 minutes. Learn to create evaluation datasets, use built-in evaluators (Answer Relevance, Coherence, Toxicity), build custom evaluators, and run comprehensive experiments with detailed analysis.
Perfect for: Developers new to the Fiddler Evals SDK who want to understand evaluation workflows quickly.
Fiddler Evals SDK Advanced Guide
Master advanced evaluation patterns for production LLM applications. Explore complex data import strategies, context-aware evaluators for RAG systems, multi-score evaluators, lambda-based parameter mapping, and production-ready experiment patterns with 11+ evaluators.
Perfect for: Teams building production evaluation pipelines with sophisticated requirements.
LLM Evaluation - Compare Outputs
Learn how to systematically compare outputs from different LLM models (like GPT-3.5 and Claude) to determine the most suitable choice for your application. This guide demonstrates side-by-side model comparison workflows using Fiddler's evaluation framework.
Perfect for: Teams evaluating multiple models or prompt variations to make data-driven decisions.
LLM Evaluation - Prompt Specs Quick Start
Create custom LLM-as-a-Judge evaluations in minutes using Prompt Specs. Learn to define evaluation schemas using JSON, validate and test your evaluations, and deploy custom evaluators to production monitoring—all without manual prompt engineering.
Perfect for: Teams needing domain-specific evaluation logic without extensive prompt tuning effort.
Key Evaluation Capabilities
Comprehensive Test Suites
Create datasets with test cases covering real-world scenarios, edge cases, and expected behaviors. Import data from CSV, JSONL, or pandas DataFrames with flexible column mapping.
Built-in Evaluators
Access production-ready evaluators for common evaluation tasks:
Quality: Answer Relevance, Coherence, Conciseness, Completeness
Safety: Toxicity Detection, Prompt Injection, PII Detection
Context-Aware: Faithfulness for RAG systems
Sentiment: Multi-score sentiment and topic classification
Pattern Matching: Regex-based format validation
Custom Evaluation Logic
Build evaluators tailored to your domain using:
Python-based evaluators with the Evaluator base class
Prompt Specs for schema-based LLM-as-a-Judge evaluation
Multi-score evaluators returning multiple metrics per test case
Function wrappers for existing evaluation functions
Experiment Management
Run comprehensive evaluation experiments with:
Parallel processing for faster evaluation across large datasets
Detailed results tracking with scores, timing, and error handling
Metadata tagging for experiment organization and filtering
Side-by-side comparison to validate improvements
Production Integration
Deploy evaluations to production monitoring:
Enrichment pipeline integration for real-time evaluation
Automated alerting based on evaluation metrics
Dashboard visualization for tracking quality trends
Historical tracking to monitor improvements over time
Enterprise Evaluation Features
Team Collaboration
Shared evaluation libraries: Reuse datasets and evaluators across teams
Access control: Project-level and application-level permissions
Experiment tracking: Compare evaluations across team members and versions
Production Integration
CI/CD pipelines: Automated evaluation before deployment
Quality gates: Set score thresholds that must be met for deployment
Regression detection: Alert when evaluation scores drop
Compliance & Auditing
Evaluation history: Complete audit trail of all experiments
Reproducibility: Frozen datasets and evaluators for regulatory compliance
Export capabilities: Download results for external analysis and reporting
Evaluation Use Cases
Single-Turn Q&A Systems
Evaluate direct question-answering applications with relevance, correctness, and conciseness metrics.
RAG Applications
Assess context-grounded responses by checking for faithfulness, relevance, and completeness.
Multi-Turn Conversations
Evaluate dialogue systems with coherence, context retention, and conversation quality metrics.
Agentic Workflows
Test tool-using agents with trajectory evaluation, tool selection accuracy, and task completion metrics.
Getting Started Paths
For SDK Users:
Start with Evals SDK Quick Start
Progress to Evals SDK Advanced Guide
Review the Fiddler Evals SDK Reference
For Custom Evaluation Needs:
Understand LLM Evaluation Prompt Specs concepts
Follow the Prompt Specs Quick Start
For Model Selection:
Set up comparison experiments with your candidate models
Use evaluation metrics to make data-driven decisions
Best Practices
Start Small, Scale Systematically
Begin with a focused test suite covering critical functionality. Gradually expand coverage as you understand your application's failure modes.
Use Multiple Evaluators
Combine different evaluator types (quality, safety, domain-specific) for a comprehensive assessment. No single metric captures all aspects of AI application quality.
Track Over Time
Establish baselines and monitor evaluation metrics as your application evolves. Systematic tracking reveals degradation before it impacts users.
Leverage Metadata
Tag test cases with categories, difficulty levels, and business context. Rich metadata enables targeted analysis and root cause investigation.
Automate Evaluation
Integrate evaluations into CI/CD pipelines and deploy to production monitoring. Continuous evaluation prevents quality regressions and maintains user trust.
❓ Questions? Talk to a product expert or request a demo.
💡 Need help? Contact us at [email protected].