Evals SDK Advanced Guide

What You'll Learn

This interactive notebook demonstrates advanced evaluation patterns for production LLM applications through comprehensive testing with the TruthfulQA benchmark dataset.

Key Topics Covered:

  • Advanced data import with CSV/JSONL and complex column mapping

  • Real LLM integration with production-ready task functions

  • Context-aware evaluators for RAG and knowledge-grounded applications

  • Multi-score evaluators and advanced evaluation patterns

  • Complex parameter mapping with lambda functions

  • Production experiments with 11+ evaluators and complete analysis

Interactive Tutorial

The notebook guides you through building a comprehensive evaluation pipeline for any LLM application, from single-turn Q&A to multi-turn conversations, RAG systems, and agentic workflows.

Open the Advanced Evaluations Notebook in Google Colab →

Or download the notebook directly from GitHub →

Prerequisites

  • Fiddler account with API credentials

  • Basic familiarity with the Evaluations SDK Quick Start

  • Optional: OpenAI API key for real LLM examples (mock responses available)

Time Required

  • Complete tutorial: 45-60 minutes

  • Quick overview: 15-20 minutes

Tutorial Highlights

Key Takeaways from the Advanced Tutorial

Even if you prefer to run the notebook, here are the critical patterns you'll learn:

1. Complex Data Import Strategies

CSV Import with Column Mapping:

Why This Matters: Production datasets rarely have perfect column names. Column mapping lets you use any data source without reformatting files.

2. Context-Aware Evaluation for RAG Systems

Faithfulness Checking:

Why This Matters: RAG systems must be evaluated differently than simple Q&A. Faithfulness evaluators prevent hallucination by verifying claims against source documents.

3. Multi-Score Evaluators

Sentiment with Probability Scores:

Why This Matters: Some quality dimensions have multiple facets. Multi-score evaluators capture nuanced assessments in a single pass.

4. Production Experiment Patterns

11+ Evaluators in One Experiment:

Why This Matters: Production systems need comprehensive evaluation across multiple dimensions. This pattern shows how to run extensive evaluation suites efficiently.

5. Advanced Parameter Mapping

Complex Data Structures:

Why This Matters: Real applications have complex data structures. Lambda-based mapping gives you the flexibility to extract any value an evaluator needs.


Advanced Data Import

Learn how to import complex evaluation datasets with:

  • CSV and JSONL file support with column mapping

  • Separation of inputs, extras, expected outputs, and metadata

  • Source tracking for test case provenance

  • Support for RAG context and conversation history

Production Evaluator Suite

Build a comprehensive evaluation with:

  • Context-aware evaluators: Faithfulness checking for RAG systems

  • Safety evaluators: Prompt injection and toxicity detection

  • Quality evaluators: Relevance, coherence, and conciseness

  • Custom evaluators: Domain-specific metrics for complete customization

  • Multi-score evaluators: Sentiment and topic classification

Complex Parameter Mapping

Master advanced mapping techniques:

  • Lambda-based parameter transformation

  • Access to inputs, extras, outputs, and metadata

  • Flexible mapping for any evaluator signature

  • Production-ready patterns for all LLM use cases

Comprehensive Analysis

Extract insights from evaluation results:

  • Aggregate statistics by evaluator

  • Performance breakdown by category

  • DataFrame export for further analysis

  • A/B testing and regression detection patterns

Who Should Use This

  • AI engineers building production LLM applications

  • ML engineers implementing systematic evaluation pipelines

  • Data scientists analyzing LLM performance and quality

  • QA engineers setting up regression testing for AI systems

Use Case Flexibility

The patterns demonstrated work for all LLM application types:

  • Single-turn Q&A: Direct question-answering without context

  • RAG applications: Context-grounded responses with faithfulness checking

  • Multi-turn conversations: Dialogue systems with conversation history

  • Agentic workflows: Tool-using agents with intermediate outputs

  • Multi-task models: Systems handling diverse request types

Trust Service Integration

All evaluators in the advanced tutorial run on Fiddler Trust Models, which means:

Cost Efficiency at Scale

Running 11+ evaluators on 817 test cases (TruthfulQA dataset) would typically cost:

  • External LLM API: $50-100+ in API calls (0.01¢ per evaluation × 9,000 evaluations)

  • Fiddler Trust Service: $0 (no per-request charges)

Performance at Scale

  • Parallel execution: 10 workers process 817 items in ~5 minutes

  • Fast evaluators: <100ms per evaluation enables real-time feedback

  • No rate limits: No API quota concerns for extensive batch evaluations

Security

  • Data locality: All evaluations run within your Fiddler environment

  • No external calls: Your prompts and responses never leave your infrastructure

  • Audit trail: Complete traceability for compliance

This makes Fiddler Evals ideal for enterprise-scale evaluation pipelines.

Next Steps

After completing the tutorial: