What You’ll Learn
This interactive notebook demonstrates advanced evaluation patterns for production LLM applications through comprehensive testing with the TruthfulQA benchmark dataset. Key Topics Covered:- Advanced data import with CSV/JSONL and complex column mapping
- Real LLM integration with production-ready task functions
- Context-aware evaluators for RAG and knowledge-grounded applications
- Multi-score evaluators and advanced evaluation patterns
- Complex parameter mapping with lambda functions
- Production experiments with 11+ evaluators and complete analysis
Interactive Tutorial
The notebook guides you through building a comprehensive experiment pipeline for any LLM application, from single-turn Q&A to multi-turn conversations, RAG systems, and agentic workflows. Open the Advanced Evaluations Notebook in Google Colab → Or download the notebook directly from GitHub →Prerequisites
- Fiddler account with API credentials
- Basic familiarity with the Evals SDK Quick Start
- Optional: OpenAI API key for real LLM examples (mock responses available)
Time Required
- Complete tutorial: 45-60 minutes
- Quick overview: 15-20 minutes
Tutorial Highlights
Key Takeaways from the Advanced Tutorial
Even if you prefer to run the notebook, here are the critical patterns you’ll learn:1. Complex Data Import Strategies
CSV Import with Column Mapping:2. Context-Aware Evaluation for RAG Systems
Faithfulness Checking: Fiddler provides two faithfulness evaluators:RAGFaithfulness (LLM-as-a-Judge, part of the RAG Health Metrics triad) for comprehensive diagnostics, and FTLResponseFaithfulness (Centor Faithfulness model) for low-latency guardrails.
RAGFaithfulness with the full RAG Health Metrics triad (Answer Relevance, Context Relevance) for root cause diagnosis. Use FTLResponseFaithfulness for real-time guardrails where latency matters.
3. Multi-Score Evaluators
Sentiment with Probability Scores:4. Production Experiment Patterns
Multiple Evaluators in One Experiment:5. Advanced Parameter Mapping
Complex Data Structures:Advanced Data Import
Learn how to import complex experiment datasets with:- CSV and JSONL file support with column mapping
- Separation of inputs, extras, expected outputs, and metadata
- Source tracking for test case provenance
- Support for RAG context and conversation history
Production Evaluator Suite
Build a comprehensive evaluation with:- Context-aware evaluators: Faithfulness checking for RAG systems
- Safety evaluators: Prompt safety and faithfulness detection
- Quality evaluators: Relevance, coherence, and conciseness
- Custom evaluators: Domain-specific metrics for complete customization
- Multi-score evaluators: Sentiment and topic classification
Complex Parameter Mapping
Master advanced mapping techniques:- Lambda-based parameter transformation
- Access to inputs, extras, outputs, and metadata
- Flexible mapping for any evaluator signature
- Production-ready patterns for all LLM use cases
Comprehensive Analysis
Extract insights from experiment results:- Aggregate statistics by evaluator
- Performance breakdown by category
- DataFrame export for further analysis
- A/B testing and regression detection patterns
Who Should Use This
- AI engineers building production LLM applications
- ML engineers implementing systematic experiment pipelines
- Data scientists analyzing LLM performance and quality
- QA engineers setting up regression testing for AI systems
Use Case Flexibility
The patterns demonstrated work for all LLM application types:- Single-turn Q&A: Direct question-answering without context
- RAG applications: Context-grounded responses with faithfulness checking
- Multi-turn conversations: Dialogue systems with conversation history
- Agentic workflows: Tool-using agents with intermediate outputs
- Multi-task models: Systems handling diverse request types
Centor Models Integration
All evaluators in the advanced tutorial run on Fiddler Centor Models, which means:Cost Efficiency at Scale
Running multiple evaluators on 817 test cases (TruthfulQA dataset) would typically cost:- External LLM API: $50-100+ in API calls (0.01¢ per evaluation × 9,000 evaluations)
- Fiddler Centor Models: $0 (no per-request charges)
Performance at Scale
- Parallel execution: 10 workers process 817 items in ~5 minutes
- Fast evaluators: <100ms per evaluation enables real-time feedback
- No rate limits: No API quota concerns for extensive batch experiments
Security
- Data locality: All evaluations run within your Fiddler environment
- No external calls: Your prompts and responses never leave your infrastructure
- Audit trail: Complete traceability for compliance
Next Steps
After completing the tutorial:- Technical Reference: Fiddler Evals SDK Documentation
- Basic Tutorial: Evals SDK Quick Start for fundamentals
- Getting Started Guide: Getting Started with Fiddler Experiments for UI overview