Evals SDK Advanced Guide

What You'll Learn

This interactive notebook demonstrates advanced evaluation patterns for production LLM applications through comprehensive testing with the TruthfulQA benchmark dataset.

Key Topics Covered:

Advanced data import with CSV/JSONL and complex column mapping
Real LLM integration with production-ready task functions
Context-aware evaluators for RAG and knowledge-grounded applications
Multi-score evaluators and advanced evaluation patterns
Complex parameter mapping with lambda functions
Production experiments with 11+ evaluators and complete analysis

Interactive Tutorial

The notebook guides you through building a comprehensive evaluation pipeline for any LLM application, from single-turn Q&A to multi-turn conversations, RAG systems, and agentic workflows.

Open the Advanced Evaluations Notebook in Google Colab →

Or download the notebook directly from GitHub →

Prerequisites

Fiddler account with API credentials
Basic familiarity with the Evaluations SDK Quick Start
Optional: OpenAI API key for real LLM examples (mock responses available)

Time Required

Complete tutorial: 45-60 minutes
Quick overview: 15-20 minutes

Tutorial Highlights

Key Takeaways from the Advanced Tutorial

Even if you prefer to run the notebook, here are the critical patterns you'll learn:

1. Complex Data Import Strategies

CSV Import with Column Mapping:

dataset.insert_from_csv_file(
    file_path='truthfulqa.csv',
    input_columns=['question', 'category'],
    expected_output_columns=['best_answer'],
    extras_columns=['context'],  # Separate context for RAG evaluation
    metadata_columns=['type', 'difficulty']
)

Why This Matters: Production datasets rarely have perfect column names. Column mapping lets you use any data source without reformatting files.

2. Context-Aware Evaluation for RAG Systems

Faithfulness Checking:

from fiddler_evals.evaluators import FTLResponseFaithfulness

# Evaluator that checks if response is grounded in provided context
faithfulness = FTLResponseFaithfulness()

# Requires both response and context
score_fn_kwargs_mapping={
    "response": "answer",
    "context": lambda x: x["extras"]["context"],  # Extract from extras
}

Why This Matters: RAG systems must be evaluated differently than simple Q&A. Faithfulness evaluators prevent hallucination by verifying claims against source documents.

3. Multi-Score Evaluators

Sentiment with Probability Scores:

from fiddler_evals.evaluators import Sentiment

# Returns multiple scores: sentiment (categorical) and probability (float)
sentiment = Sentiment()

# Results include:
# - score.value: "positive" | "neutral" | "negative"
# - score.metadata["probability"]: confidence score 0.0-1.0

Why This Matters: Some quality dimensions have multiple facets. Multi-score evaluators capture nuanced assessments in a single pass.

4. Production Experiment Patterns

11+ Evaluators in One Experiment:

evaluators = [
    # Quality metrics
    AnswerRelevance(),
    Coherence(),
    Conciseness(),
    Completeness(),

    # Safety metrics
    Toxicity(),
    FTLPromptSafety(),

    # Context-aware metrics (for RAG)
    FTLResponseFaithfulness(),
    ContextRelevance(),

    # Domain-specific
    Sentiment(),
    RegexSearch(pattern=r'\b[A-Z][a-z]+\s[A-Z][a-z]+\b'),  # Proper nouns
    CustomDomainEvaluator(),
]

results = evaluate(
    dataset=large_dataset,
    task=production_task,
    evaluators=evaluators,
    max_workers=10,  # Parallel processing
    metadata={"version": "v2.1", "environment": "staging"}
)

Why This Matters: Production systems need comprehensive evaluation across multiple dimensions. This pattern shows how to run extensive evaluation suites efficiently.

5. Advanced Parameter Mapping

Complex Data Structures:

score_fn_kwargs_mapping={
    # Simple mappings
    "response": "answer",

    # Extract from nested dicts
    "prompt": lambda x: x["inputs"]["question"],
    "context": lambda x: x["extras"]["retrieved_docs"],

    # Compute values
    "output_length": lambda x: len(x["outputs"]["answer"]),

    # Combine values
    "full_conversation": lambda x: x["inputs"]["history"] + [x["outputs"]["answer"]],
}

Why This Matters: Real applications have complex data structures. Lambda-based mapping gives you the flexibility to extract any value an evaluator needs.

Advanced Data Import

Learn how to import complex evaluation datasets with:

CSV and JSONL file support with column mapping
Separation of inputs, extras, expected outputs, and metadata
Source tracking for test case provenance
Support for RAG context and conversation history

Production Evaluator Suite

Build a comprehensive evaluation with:

Context-aware evaluators: Faithfulness checking for RAG systems
Safety evaluators: Prompt injection and toxicity detection
Quality evaluators: Relevance, coherence, and conciseness
Custom evaluators: Domain-specific metrics for complete customization
Multi-score evaluators: Sentiment and topic classification

Complex Parameter Mapping

Master advanced mapping techniques:

Lambda-based parameter transformation
Access to inputs, extras, outputs, and metadata
Flexible mapping for any evaluator signature
Production-ready patterns for all LLM use cases

Comprehensive Analysis

Extract insights from evaluation results:

Aggregate statistics by evaluator
Performance breakdown by category
DataFrame export for further analysis
A/B testing and regression detection patterns

Who Should Use This

AI engineers building production LLM applications
ML engineers implementing systematic evaluation pipelines
Data scientists analyzing LLM performance and quality
QA engineers setting up regression testing for AI systems

Use Case Flexibility

The patterns demonstrated work for all LLM application types:

Single-turn Q&A: Direct question-answering without context
RAG applications: Context-grounded responses with faithfulness checking
Multi-turn conversations: Dialogue systems with conversation history
Agentic workflows: Tool-using agents with intermediate outputs
Multi-task models: Systems handling diverse request types

Trust Service Integration

All evaluators in the advanced tutorial run on Fiddler Trust Models, which means:

Cost Efficiency at Scale

Running 11+ evaluators on 817 test cases (TruthfulQA dataset) would typically cost:

External LLM API: $50-100+ in API calls (0.01¢ per evaluation × 9,000 evaluations)
Fiddler Trust Service: $0 (no per-request charges)

Performance at Scale

Parallel execution: 10 workers process 817 items in ~5 minutes
Fast evaluators: <100ms per evaluation enables real-time feedback
No rate limits: No API quota concerns for extensive batch evaluations

Security

Data locality: All evaluations run within your Fiddler environment
No external calls: Your prompts and responses never leave your infrastructure
Audit trail: Complete traceability for compliance

This makes Fiddler Evals ideal for enterprise-scale evaluation pipelines.

Next Steps

After completing the tutorial:

Technical Reference: Fiddler Evals SDK Documentation
Basic Tutorial: Evaluations SDK Quick Start for fundamentals
Getting Started Guide: Getting Started with Fiddler Evals for UI overview

❓ Questions? Talk to a product expert or request a demo.

💡 Need help? Contact us at [email protected].

PreviousEvals SDK Quick Start NextLLM Evaluation - Compare Outputs