Evals SDK Quick Start
Private Preview Notice
The Fiddler Evals SDK is currently in private preview. This means:
API interfaces may change before general availability
Some features are still under active development
We welcome your feedback to help shape the final product
Please refer to our product maturity definitions for more details on policies and participation.
What You'll Learn
In this guide, you'll learn how to:
Connect to Fiddler and set up your evaluation environment
Create projects, applications, and datasets for organizing evaluations
Build evaluation datasets with test cases
Use built-in evaluators for common AI evaluation tasks
Create custom evaluators for domain-specific requirements
Run comprehensive evaluation experiments
Analyze results with detailed metrics and insights
Time to complete: ~20 minutes
Prerequisites
Before you begin, ensure you have:
Fiddler Account: An active account with access to create applications
Python 3.10+
Fiddler Evals SDK:
pip install fiddler-evals
Fiddler Access Token: Get your access token from Settings > Credentials in your Fiddler instance
Connect to Fiddler
First, establish a connection to your Fiddler instance using the Evals SDK.
Connection Setup:
from fiddler_evals import init
# Initialize connection to Fiddler
init(
url='https://your-org.fiddler.ai', # Your Fiddler URL
token='your-access-token' # Your access token
)
Create Project and Application
Fiddler Evals uses a hierarchical structure to organize your evaluations:
Projects provide organizational boundaries for related applications
Applications represent specific AI systems you want to evaluate
Datasets contain test cases for evaluation
Experiments track individual evaluation runs
Create your organizational structure:
from fiddler_evals import Project, Application
# Create or get the project
project = Project.get_or_create(name='my_eval_project')
print(f'✅ Project: {project.name} (ID: {project.id})')
# Create or get the application within the project
application = Application.get_or_create(
name='my_llm_app',
project_id=project.id
)
print(f'✅ Application: {application.name} (ID: {application.id})')
What This Creates:
A project to organize all your evaluation work
An application representing your AI system under test
Persistent IDs for tracking results over time
Build Your Evaluation Dataset
Datasets contain the test cases you'll use to evaluate your AI applications. Each test case includes:
Inputs: Data passed to your AI application (questions, prompts, etc.)
Expected Outputs: What you expect the application to return
Metadata: Additional context (categories, types, tags)
Create a dataset and add test cases:
from fiddler_evals import Dataset
from fiddler_evals.pydantic_models.dataset import NewDatasetItem
# Create the dataset
dataset = Dataset.create(
name='evaluation_dataset',
application_id=application.id,
description='Test cases for LLM evaluation'
)
# Add test cases
test_cases = [
NewDatasetItem(
inputs={"question": "What happens to you if you eat watermelon seeds?"},
expected_outputs={"answer": "The watermelon seeds pass through your digestive system"},
metadata={"type": "Adversarial", "category": "Misconceptions"}
),
NewDatasetItem(
inputs={"question": "What is the capital of France?"},
expected_outputs={"answer": "Paris is the capital of France"},
metadata={"type": "Factual", "category": "Geography"}
)
]
dataset.insert(test_cases)
print(f'✅ Added {len(test_cases)} test cases to dataset')
Data Import Options:
dataset.insert_from_csv_file(
file_path='data.csv',
input_columns=['question'],
expected_output_columns=['answer'],
metadata_columns=['category']
)
Use Built-in Evaluators
Fiddler Evals provides production-ready evaluators for common AI evaluation tasks. Let's test some key evaluators:
from fiddler_evals.evaluators import (
AnswerRelevance,
Coherence,
Conciseness,
Toxicity,
Sentiment
)
# Test Answer Relevance
relevance_evaluator = AnswerRelevance()
score = relevance_evaluator.score(
prompt="What is the capital of France?",
response="Paris is the capital of France."
)
print(f"Relevance Score: {score.value} - {score.reasoning}")
# Test Conciseness
conciseness_evaluator = Conciseness()
score = conciseness_evaluator.score(
response="Paris is the capital of France."
)
print(f"Conciseness Score: {score.value} - {score.reasoning}")
# Test Toxicity
toxicity_evaluator = Toxicity()
score = toxicity_evaluator.score(
text="Thank you for your question! I'd be happy to help."
)
print(f"Toxicity Score: {score.value}")
Available Built-in Evaluators:
AnswerRelevance
Checks if response addresses the question
prompt
, response
Coherence
Evaluates logical flow and consistency
response
, prompt
Conciseness
Measures response brevity and clarity
response
Toxicity
Detects harmful or toxic content
text
Sentiment
Analyzes emotional tone
text
RegexSearch
Pattern matching for specific formats
output
, pattern
FTLPromptSafety
Compute safety scores for prompts
text
FTLResponseFaithfulness
Evaluate faithfulness of LLM responses
response
, context
Cost-Effective Evaluation at Scale
These built-in evaluators run on Fiddler Trust Models within your environment:
No API Costs: Evaluate thousands of test cases with zero external API fees
Fast: <100ms latency per evaluation for real-time feedback
Secure: Your data never leaves your Fiddler instance
Unlike tools that charge per-request or require external API calls, Fiddler Evals provides unlimited evaluation at no additional cost.
Create Custom Evaluators
Build custom evaluation logic for your specific use cases by inheriting from the Evaluator
base class:
from fiddler_evals.evaluators.base import Evaluator
from fiddler_evals.pydantic_models.score import Score
class LengthEvaluator(Evaluator):
"""
Custom evaluator that checks if a response length is appropriate.
Gives higher scores for responses that are neither too short nor too long.
"""
def __init__(self, min_length: int = 10, max_length: int = 200):
super().__init__()
self.min_length = min_length
self.max_length = max_length
def score(self, output: str) -> Score:
"""Score based on response length appropriateness."""
length = len(output.strip())
if length < self.min_length:
score_value = 0.0
reasoning = f"Response too short ({length} chars, minimum {self.min_length})"
elif length > self.max_length:
score_value = 0.5
reasoning = f"Response too long ({length} chars, maximum {self.max_length})"
else:
score_value = 1.0
reasoning = f"Response length appropriate ({length} chars)"
return Score(
name="length_check",
evaluator_name=self.name,
value=score_value,
reasoning=reasoning
)
# Test the custom evaluator
length_evaluator = LengthEvaluator(min_length=15, max_length=100)
score = length_evaluator.score("Paris is the capital of France.")
print(f"Length Score: {score.value} - {score.reasoning}")
Function-Based Evaluators:
You can also use simple functions:
def word_count_evaluator(output: str) -> float:
"""Returns word count normalized to 0-1 scale."""
word_count = len(output.split())
return min(word_count / 50.0, 1.0)
# Use directly in evaluators list
evaluators = [
AnswerRelevance(),
word_count_evaluator, # Function evaluator
]
Run Evaluation Experiments
Now run a comprehensive evaluation experiment. The evaluate()
function:
Runs your AI application task on each dataset item
Executes all evaluators on the results
Tracks the experiment in Fiddler
Returns comprehensive results with scores and timing
Define your evaluation task:
from fiddler_evals import evaluate
# Define your AI application task
def my_llm_task(inputs: dict, extras: dict, metadata: dict) -> dict:
"""
This function represents your AI application that you want to evaluate.
Args:
inputs: The input data from the dataset (e.g., {"question": "..."})
extras: Additional context data (e.g., {"context": "..."})
metadata: Any metadata associated with the test case
Returns:
dict: The outputs from your AI application (e.g., {"answer": "..."})
"""
question = inputs.get("question", "")
# Your LLM API call here
# For this example, we'll use a mock response
answer = f"Mock response to: {question}"
return {"answer": answer}
# Set up evaluators
evaluators = [
AnswerRelevance(),
Conciseness(),
Sentiment(),
LengthEvaluator(),
]
# Run evaluation
experiment_result = evaluate(
dataset=dataset,
task=my_llm_task,
evaluators=evaluators,
name_prefix="my_evaluation",
description="Comprehensive LLM evaluation",
score_fn_kwargs_mapping={
"question": "question",
"response": "answer",
"output": "answer",
"text": "answer",
"prompt": lambda x: x["inputs"]["question"],
},
max_workers=4 # Process 4 test cases concurrently
)
print(f"✅ Evaluated {len(experiment_result.results)} test cases")
print(f"📈 Generated {sum(len(result.scores) for result in experiment_result.results)} scores")
Score Function Mapping:
The score_fn_kwargs_mapping
parameter connects your task outputs to evaluator inputs. This is necessary because evaluators expect specific parameter names (like response
, prompt
, text
) but your task may use different names (like answer
, question
).
Simple String Mapping (use this for most cases):
# Your task returns: {"answer": "Paris is the capital of France"}
# Evaluators expect: response="..." or text="..."
# Map your output keys to evaluator parameter names:
score_fn_kwargs_mapping={
"response": "answer", # Map 'response' param → 'answer' output key
"text": "answer", # Map 'text' param → 'answer' output key
}
Advanced Mapping with Lambda Functions (for nested values):
# Use lambda to extract nested or computed values:
score_fn_kwargs_mapping={
"prompt": lambda x: x["inputs"]["question"], # Extract from inputs dict
"response": "answer", # Simple string mapping
}
How It Works:
Your task returns a dict:
{"answer": "Some response"}
The mapping tells Fiddler: "When an evaluator needs
response
, use the value fromanswer
"Each evaluator gets the parameters it needs automatically
Complete Example:
# Task returns this structure:
{"answer": "Paris is the capital of France"}
# But evaluators need these parameters:
# - AnswerRelevance.score(prompt="...", response="...")
# - Conciseness.score(response="...")
# - Sentiment.score(text="...")
# Solution: Map parameter names to your output structure
score_fn_kwargs_mapping={
"response": "answer", # For AnswerRelevance and Conciseness
"text": "answer", # For Sentiment
"prompt": lambda x: x["inputs"]["question"], # Get prompt from inputs
}
This allows you to use any evaluator without changing your task function structure.
Analyze Experiment Results
After running your evaluation, analyze the comprehensive results in your notebook or the Fiddler UI:

Export Results
To conduct further analysis, export the experiment results:
# Convert to DataFrame for further analysis
results_data = []
for result in experiment_result.results:
item = result.experiment_item
row = {
'dataset_item_id': item.dataset_item_id,
'status': item.status,
'duration_ms': item.duration_ms,
}
# Add scores as columns
for score in result.scores:
row[f'{score.name}_score'] = score.value
row[f'{score.name}_reasoning'] = score.reasoning
results_data.append(row)
results_df = pd.DataFrame(results_data)
results_df.to_csv('experiment_results.csv', index=False)
print("💾 Results exported to experiment_results.csv")
Next Steps
Now that you have the Fiddler Evaluations SDK set up, explore these advanced capabilities:
Evals First Steps: An overview of Fiddler Evaluations
Quick Start Notebook: Download and run a more expansive version of this quick start guide
Fiddler Evals SDK: Review the SDK technical reference
Advanced Evals Guide: Build sophisticated evaluation logic
Troubleshooting
Connection Issues
Issue: Cannot connect to Fiddler instance.
Solutions:
Verify credentials
Test network connectivity:
curl -I https://your-org.fiddler.ai
Validate token:
Ensure your access token is valid and not expired
Regenerate token if needed from Settings > Credentials
Import Errors
Issue: ModuleNotFoundError: No module named 'fiddler_evals'
Solutions:
Verify installation:
pip list | grep fiddler-evals
Reinstall the SDK:
pip uninstall fiddler-evals pip install fiddler-evals
Check Python version:
Requires Python 3.10 or higher
Run
python --version
to verify
Evaluation Failures
Issue: Evaluators failing with errors.
Solutions:
Check parameter mapping:
# Ensure score_fn_kwargs_mapping matches evaluator requirements score_fn_kwargs_mapping={ "response": "answer", # Maps to your task output key "prompt": lambda x: x["inputs"]["question"], }
Verify task output format:
Task must return a dictionary
Keys must match those referenced in score_fn_kwargs_mapping
Debug individual evaluators:
# Test evaluators separately score = evaluator.score(response="test response") print(f"Score: {score.value}, Reasoning: {score.reasoning}")
Performance Issues
Issue: Evaluation is running slowly.
Solutions:
Use parallel processing:
experiment_result = evaluate( dataset=dataset, task=my_llm_task, evaluators=evaluators, max_workers=4 # Adjust based on your system )
Reduce dataset size for testing:
Start with a small subset
Scale up once the configuration is validated
Optimize LLM calls:
Use caching for repeated queries
Implement batching where possible
Configuration Options
Basic Configuration
from fiddler_evals import init, evaluate
# Initialize connection
init(url='https://your-org.fiddler.ai', token='your-access-token')
# Run evaluation with basic settings
experiment_result = evaluate(
dataset=dataset,
task=my_llm_task,
evaluators=evaluators,
name_prefix="my_eval"
)
Advanced Configuration
Concurrent Processing:
experiment_result = evaluate(
dataset=dataset,
task=my_llm_task,
evaluators=evaluators,
max_workers=8, # Process 8 test cases in parallel
name_prefix="parallel_eval"
)
Experiment Metadata:
experiment_result = evaluate(
dataset=dataset,
task=my_llm_task,
evaluators=evaluators,
metadata={
"model_version": "gpt-4",
"evaluation_date": "2024-01-15",
"temperature": 0.7,
"environment": "production"
}
)
Custom Evaluator Configuration:
# Configure evaluators with specific thresholds
evaluators = [
AnswerRelevance(threshold=0.8),
Conciseness(max_words=100),
LengthEvaluator(min_length=20, max_length=150),
]
❓ Questions? Talk to a product expert or request a demo.
💡 Need help? Contact us at [email protected].