flaskEvals SDK Quick Start

What You'll Learn

In this guide, you'll learn how to:

  • Connect to Fiddler and set up your experiment environment

  • Create projects, applications, and datasets for organizing experiments

  • Build experiment datasets with test cases

  • Use built-in evaluators for common AI evaluation tasks

  • Create custom evaluators for domain-specific requirements

  • Run comprehensive experiments

  • Analyze results with detailed metrics and insights

Time to complete: ~20 minutes

Prerequisites

Before you begin, ensure you have:

  • Fiddler Account: An active account with access to create applications

  • Python 3.10+

  • Fiddler Evals SDK:

    • pip install fiddler-evals

  • Fiddler Access Token: Get your access token from Settings > Credentials in your Fiddler instance

circle-info

If you prefer using a notebook, download a fully functional quick start directly from GitHubarrow-up-right or open it in Google Colabarrow-up-right to get started.

1

Connect to Fiddler

First, establish a connection to your Fiddler instance using the Evals SDK.

Connection Setup:

2

Create Project and Application

Fiddler Experiments uses a hierarchical structure to organize your experiments:

  • Projects provide organizational boundaries for related applications

  • Applications represent specific AI systems you want to evaluate

  • Datasets contain test cases for experiments

  • Experiments track individual evaluation runs

Create your organizational structure:

What This Creates:

  • A project to organize all your experiment work

  • An application representing your AI system under test

  • Persistent IDs for tracking results over time

3

Build Your Experiment Dataset

Datasets contain the test cases you'll use to evaluate your AI applications. Each test case includes:

  • Inputs: Data passed to your AI application (questions, prompts, etc.)

  • Expected Outputs: What you expect the application to return

  • Metadata: Additional context (categories, types, tags)

Create a dataset and add test cases:

Data Import Options:

4

Use Built-in Evaluators

Fiddler Experiments provides production-ready evaluators for common AI evaluation tasks. Let's test some key evaluators:

Available Built-in Evaluators:

Evaluator
Purpose
Key Parameters

AnswerRelevance

Checks if response addresses the question (ordinal: High/Medium/Low)

user_query, rag_response

ContextRelevance

Measures retrieval quality (ordinal: High/Medium/Low)

user_query, retrieved_documents

RAGFaithfulness

Detects hallucinations in RAG responses (binary: Yes/No)

user_query, rag_response, retrieved_documents

Coherence

Evaluates logical flow and consistency

response, prompt

Conciseness

Measures response brevity and clarity

response

Sentiment

Analyzes emotional tone

text

RegexSearch

Pattern matching for specific formats

output, pattern

FTLPromptSafety

Compute safety scores for prompts

text

FTLResponseFaithfulness

Evaluate faithfulness of LLM responses (FTL model)

response, context

circle-info

RAG Health Metrics: AnswerRelevance, ContextRelevance, and RAGFaithfulness form the RAG diagnostic triad. All LLM-as-a-Judge evaluators require model and credential parameters at initialization (e.g., AnswerRelevance(model="openai/gpt-4o", credential="your-credential")). See the RAG Health Metrics Tutorial for a complete walkthrough.

circle-check
5

Create Custom Evaluators

Build custom evaluation logic for your specific use cases by inheriting from the Evaluator base class:

Function-Based Evaluators:

You can also use simple functions:

6

Run Experiments

Now run a comprehensive experiment. The evaluate() function:

  1. Runs your AI application task on each dataset item

  2. Executes all evaluators on the results

  3. Tracks the experiment in Fiddler

  4. Returns comprehensive results with scores and timing

Define your experiment task:

Score Function Mapping:

The score_fn_kwargs_mapping parameter connects your task outputs to evaluator inputs. This is necessary because evaluators expect specific parameter names (like response, prompt, text) but your task may use different names (like answer, question).

Simple String Mapping (use this for most cases):

Advanced Mapping with Lambda Functions (for nested values):

How It Works:

  1. Your task returns a dict: {"answer": "Some response"}

  2. The mapping tells Fiddler: "When an evaluator needs rag_response, use the value from answer"

  3. Each evaluator gets the parameters it needs automatically

Complete Example:

This allows you to use any evaluator without changing your task function structure.

7

Analyze Experiment Results

After running your experiment, analyze the comprehensive results in your notebook or the Fiddler UI:

Fiddler Experiments results example page

Export Results

To conduct further analysis, export the experiment results:

Next Steps

Now that you have the Fiddler Evals SDK set up, explore these advanced capabilities:

Troubleshooting

Connection Issues

Issue: Cannot connect to Fiddler instance.

Solutions:

  1. Verify credentials

    1. Test network connectivity:

    2. Validate token:

Import Errors

Issue: ModuleNotFoundError: No module named 'fiddler_evals'

Solutions:

  1. Verify installation:

  2. Reinstall the SDK:

  3. Check Python version:

    • Requires Python 3.10 or higher

    • Run python --version to verify

Experiment Failures

Issue: Evaluators failing with errors.

Solutions:

  1. Check parameter mapping:

  2. Verify task output format:

    • Task must return a dictionary

    • Keys must match those referenced in score_fn_kwargs_mapping

  3. Debug individual evaluators:

Performance Issues

Issue: Experiment is running slowly.

Solutions:

  1. Use parallel processing:

  2. Reduce dataset size for testing:

    • Start with a small subset

    • Scale up once the configuration is validated

  3. Optimize LLM calls:

    • Use caching for repeated queries

    • Implement batching where possible

Configuration Options

Basic Configuration

Advanced Configuration

Concurrent Processing:

Experiment Metadata:

Custom Evaluator Configuration: