Evals SDK Quick Start
What You'll Learn
In this guide, you'll learn how to:
Connect to Fiddler and set up your experiment environment
Create projects, applications, and datasets for organizing experiments
Build experiment datasets with test cases
Use built-in evaluators for common AI evaluation tasks
Create custom evaluators for domain-specific requirements
Run comprehensive experiments
Analyze results with detailed metrics and insights
Time to complete: ~20 minutes
Prerequisites
Before you begin, ensure you have:
Fiddler Account: An active account with access to create applications
Python 3.10+
Fiddler Evals SDK:
pip install fiddler-evals
Fiddler Access Token: Get your access token from Settings > Credentials in your Fiddler instance
If you prefer using a notebook, download a fully functional quick start directly from GitHub or open it in Google Colab to get started.
Connect to Fiddler
First, establish a connection to your Fiddler instance using the Evals SDK.
Connection Setup:
Create Project and Application
Fiddler Experiments uses a hierarchical structure to organize your experiments:
Projects provide organizational boundaries for related applications
Applications represent specific AI systems you want to evaluate
Datasets contain test cases for experiments
Experiments track individual evaluation runs
Create your organizational structure:
What This Creates:
A project to organize all your experiment work
An application representing your AI system under test
Persistent IDs for tracking results over time
Build Your Experiment Dataset
Datasets contain the test cases you'll use to evaluate your AI applications. Each test case includes:
Inputs: Data passed to your AI application (questions, prompts, etc.)
Expected Outputs: What you expect the application to return
Metadata: Additional context (categories, types, tags)
Create a dataset and add test cases:
Data Import Options:
Use Built-in Evaluators
Fiddler Experiments provides production-ready evaluators for common AI evaluation tasks. Let's test some key evaluators:
Available Built-in Evaluators:
AnswerRelevance
Checks if response addresses the question (ordinal: High/Medium/Low)
user_query, rag_response
ContextRelevance
Measures retrieval quality (ordinal: High/Medium/Low)
user_query, retrieved_documents
RAGFaithfulness
Detects hallucinations in RAG responses (binary: Yes/No)
user_query, rag_response, retrieved_documents
Coherence
Evaluates logical flow and consistency
response, prompt
Conciseness
Measures response brevity and clarity
response
Sentiment
Analyzes emotional tone
text
RegexSearch
Pattern matching for specific formats
output, pattern
FTLPromptSafety
Compute safety scores for prompts
text
FTLResponseFaithfulness
Evaluate faithfulness of LLM responses (FTL model)
response, context
RAG Health Metrics: AnswerRelevance, ContextRelevance, and RAGFaithfulness form the RAG diagnostic triad. All LLM-as-a-Judge evaluators require model and credential parameters at initialization (e.g., AnswerRelevance(model="openai/gpt-4o", credential="your-credential")). See the RAG Health Metrics Tutorial for a complete walkthrough.
Cost-Effective Experiments at Scale
Fiddler Trust Model evaluators (FTLPromptSafety, FTLResponseFaithfulness, Sentiment, TopicClassification) run within your environment with no external API costs and sub-100ms latency. Initialize them with no parameters (e.g., Sentiment()).
LLM-as-a-Judge evaluators (AnswerRelevance, ContextRelevance, RAGFaithfulness, Coherence, Conciseness) use external LLMs via LLM Gateway and require model and credential parameters at initialization.
Create Custom Evaluators
Build custom evaluation logic for your specific use cases by inheriting from the Evaluator base class:
Function-Based Evaluators:
You can also use simple functions:
Run Experiments
Now run a comprehensive experiment. The evaluate() function:
Runs your AI application task on each dataset item
Executes all evaluators on the results
Tracks the experiment in Fiddler
Returns comprehensive results with scores and timing
Define your experiment task:
Score Function Mapping:
The score_fn_kwargs_mapping parameter connects your task outputs to evaluator inputs. This is necessary because evaluators expect specific parameter names (like response, prompt, text) but your task may use different names (like answer, question).
Simple String Mapping (use this for most cases):
Advanced Mapping with Lambda Functions (for nested values):
How It Works:
Your task returns a dict:
{"answer": "Some response"}The mapping tells Fiddler: "When an evaluator needs
rag_response, use the value fromanswer"Each evaluator gets the parameters it needs automatically
Complete Example:
This allows you to use any evaluator without changing your task function structure.
Analyze Experiment Results
After running your experiment, analyze the comprehensive results in your notebook or the Fiddler UI:

Output:
Export Results
To conduct further analysis, export the experiment results:
Next Steps
Now that you have the Fiddler Evals SDK set up, explore these advanced capabilities:
Experiments First Steps: An overview of Fiddler Experiments
Quick Start Notebook: Download and run a more expansive version of this quick start guide
Fiddler Evals SDK: Review the SDK technical reference
Advanced Evals Guide: Build sophisticated evaluation logic
Troubleshooting
Connection Issues
Issue: Cannot connect to Fiddler instance.
Solutions:
Verify credentials
Test network connectivity:
Validate token:
Ensure your access token is valid and not expired
Regenerate token if needed from Settings > Credentials
Import Errors
Issue: ModuleNotFoundError: No module named 'fiddler_evals'
Solutions:
Verify installation:
Reinstall the SDK:
Check Python version:
Requires Python 3.10 or higher
Run
python --versionto verify
Experiment Failures
Issue: Evaluators failing with errors.
Solutions:
Check parameter mapping:
Verify task output format:
Task must return a dictionary
Keys must match those referenced in score_fn_kwargs_mapping
Debug individual evaluators:
Performance Issues
Issue: Experiment is running slowly.
Solutions:
Use parallel processing:
Reduce dataset size for testing:
Start with a small subset
Scale up once the configuration is validated
Optimize LLM calls:
Use caching for repeated queries
Implement batching where possible
Configuration Options
Basic Configuration
Advanced Configuration
Concurrent Processing:
Experiment Metadata:
Custom Evaluator Configuration: