Evals SDK Advanced Guide

What You'll Learn

This interactive notebook demonstrates advanced evaluation patterns for production LLM applications through comprehensive testing with the TruthfulQA benchmark dataset.

Key Topics Covered:

  • Advanced data import with CSV/JSONL and complex column mapping

  • Real LLM integration with production-ready task functions

  • Context-aware evaluators for RAG and knowledge-grounded applications

  • Multi-score evaluators and advanced evaluation patterns

  • Complex parameter mapping with lambda functions

  • Production experiments with 11+ evaluators and complete analysis

Interactive Tutorial

The notebook guides you through building a comprehensive evaluation pipeline for any LLM application, from single-turn Q&A to multi-turn conversations, RAG systems, and agentic workflows.

Open the Advanced Evaluations Notebook in Google Colab →

Or download the notebook directly from GitHub →

Prerequisites

  • Fiddler account with API credentials

  • Basic familiarity with the Evaluations SDK Quick Start

  • Optional: OpenAI API key for real LLM examples (mock responses available)

Time Required

  • Complete tutorial: 45-60 minutes

  • Quick overview: 15-20 minutes

Tutorial Highlights

Key Takeaways from the Advanced Tutorial

Even if you prefer to run the notebook, here are the critical patterns you'll learn:

1. Complex Data Import Strategies

CSV Import with Column Mapping:

dataset.insert_from_csv_file(
    file_path='truthfulqa.csv',
    input_columns=['question', 'category'],
    expected_output_columns=['best_answer'],
    extras_columns=['context'],  # Separate context for RAG evaluation
    metadata_columns=['type', 'difficulty']
)

Why This Matters: Production datasets rarely have perfect column names. Column mapping lets you use any data source without reformatting files.

2. Context-Aware Evaluation for RAG Systems

Faithfulness Checking:

from fiddler_evals.evaluators import FTLResponseFaithfulness

# Evaluator that checks if response is grounded in provided context
faithfulness = FTLResponseFaithfulness()

# Requires both response and context
score_fn_kwargs_mapping={
    "response": "answer",
    "context": lambda x: x["extras"]["context"],  # Extract from extras
}

Why This Matters: RAG systems must be evaluated differently than simple Q&A. Faithfulness evaluators prevent hallucination by verifying claims against source documents.

3. Multi-Score Evaluators

Sentiment with Probability Scores:

from fiddler_evals.evaluators import Sentiment

# Returns multiple scores: sentiment (categorical) and probability (float)
sentiment = Sentiment()

# Results include:
# - score.value: "positive" | "neutral" | "negative"
# - score.metadata["probability"]: confidence score 0.0-1.0

Why This Matters: Some quality dimensions have multiple facets. Multi-score evaluators capture nuanced assessments in a single pass.

4. Production Experiment Patterns

11+ Evaluators in One Experiment:

evaluators = [
    # Quality metrics
    AnswerRelevance(),
    Coherence(),
    Conciseness(),
    Completeness(),

    # Safety metrics
    Toxicity(),
    FTLPromptSafety(),

    # Context-aware metrics (for RAG)
    FTLResponseFaithfulness(),
    ContextRelevance(),

    # Domain-specific
    Sentiment(),
    RegexSearch(pattern=r'\b[A-Z][a-z]+\s[A-Z][a-z]+\b'),  # Proper nouns
    CustomDomainEvaluator(),
]

results = evaluate(
    dataset=large_dataset,
    task=production_task,
    evaluators=evaluators,
    max_workers=10,  # Parallel processing
    metadata={"version": "v2.1", "environment": "staging"}
)

Why This Matters: Production systems need comprehensive evaluation across multiple dimensions. This pattern shows how to run extensive evaluation suites efficiently.

5. Advanced Parameter Mapping

Complex Data Structures:

score_fn_kwargs_mapping={
    # Simple mappings
    "response": "answer",

    # Extract from nested dicts
    "prompt": lambda x: x["inputs"]["question"],
    "context": lambda x: x["extras"]["retrieved_docs"],

    # Compute values
    "output_length": lambda x: len(x["outputs"]["answer"]),

    # Combine values
    "full_conversation": lambda x: x["inputs"]["history"] + [x["outputs"]["answer"]],
}

Why This Matters: Real applications have complex data structures. Lambda-based mapping gives you the flexibility to extract any value an evaluator needs.


Advanced Data Import

Learn how to import complex evaluation datasets with:

  • CSV and JSONL file support with column mapping

  • Separation of inputs, extras, expected outputs, and metadata

  • Source tracking for test case provenance

  • Support for RAG context and conversation history

Production Evaluator Suite

Build a comprehensive evaluation with:

  • Context-aware evaluators: Faithfulness checking for RAG systems

  • Safety evaluators: Prompt injection and toxicity detection

  • Quality evaluators: Relevance, coherence, and conciseness

  • Custom evaluators: Domain-specific metrics for complete customization

  • Multi-score evaluators: Sentiment and topic classification

Complex Parameter Mapping

Master advanced mapping techniques:

  • Lambda-based parameter transformation

  • Access to inputs, extras, outputs, and metadata

  • Flexible mapping for any evaluator signature

  • Production-ready patterns for all LLM use cases

Comprehensive Analysis

Extract insights from evaluation results:

  • Aggregate statistics by evaluator

  • Performance breakdown by category

  • DataFrame export for further analysis

  • A/B testing and regression detection patterns

Who Should Use This

  • AI engineers building production LLM applications

  • ML engineers implementing systematic evaluation pipelines

  • Data scientists analyzing LLM performance and quality

  • QA engineers setting up regression testing for AI systems

Use Case Flexibility

The patterns demonstrated work for all LLM application types:

  • Single-turn Q&A: Direct question-answering without context

  • RAG applications: Context-grounded responses with faithfulness checking

  • Multi-turn conversations: Dialogue systems with conversation history

  • Agentic workflows: Tool-using agents with intermediate outputs

  • Multi-task models: Systems handling diverse request types

Trust Service Integration

All evaluators in the advanced tutorial run on Fiddler Trust Models, which means:

Cost Efficiency at Scale

Running 11+ evaluators on 817 test cases (TruthfulQA dataset) would typically cost:

  • External LLM API: $50-100+ in API calls (0.01¢ per evaluation × 9,000 evaluations)

  • Fiddler Trust Service: $0 (no per-request charges)

Performance at Scale

  • Parallel execution: 10 workers process 817 items in ~5 minutes

  • Fast evaluators: <100ms per evaluation enables real-time feedback

  • No rate limits: No API quota concerns for extensive batch evaluations

Security

  • Data locality: All evaluations run within your Fiddler environment

  • No external calls: Your prompts and responses never leave your infrastructure

  • Audit trail: Complete traceability for compliance

This makes Fiddler Evals ideal for enterprise-scale evaluation pipelines.

Next Steps

After completing the tutorial:


Questions? Talk to a product expert or request a demo.

💡 Need help? Contact us at [email protected].