Fiddler Evals SDK
LLM evaluation framework with pre-built evaluators and custom metrics
β GA | π Native SDK
Evaluate LLM application quality with Fiddler's evaluation framework. Run batch evaluations with 14+ pre-built evaluators or create custom metrics for domain-specific quality assessment.
What You'll Need
Fiddler account
Python 3.10 or higher
Fiddler API key and access token
Dataset for evaluation
Quick Start
# Step 1: Install
pip install fiddler-evals
# Step 2: Initialize connection
from fiddler_evals import init
init(
url='https://your-org.fiddler.ai',
token='your-access-token'
)
# Step 3: Create project and application
from fiddler_evals import Project, Application, Dataset
project = Project.get_or_create(name='my_eval_project')
application = Application.get_or_create(
name='my_llm_app',
project_id=project.id
)
# Step 4: Create dataset and add test cases
from fiddler_evals.pydantic_models.dataset import NewDatasetItem
dataset = Dataset.create(
name='evaluation_dataset',
application_id=application.id,
description='Test cases for LLM evaluation'
)
test_cases = [
NewDatasetItem(
inputs={"question": "What is the capital of France?"},
expected_outputs={"answer": "Paris is the capital of France"},
metadata={"type": "Factual", "category": "Geography"}
),
]
dataset.insert(test_cases)
# Step 5: Run evaluation
from fiddler_evals import evaluate
from fiddler_evals.evaluators import AnswerRelevance, Conciseness, Toxicity
def my_llm_task(inputs, extras, metadata):
"""Your LLM application logic"""
question = inputs.get("question", "")
# Call your LLM here
answer = f"Mock response to: {question}"
return {"answer": answer}
results = evaluate(
dataset=dataset,
task=my_llm_task,
evaluators=[
AnswerRelevance(),
Conciseness(),
Toxicity()
],
name_prefix="my_evaluation",
score_fn_kwargs_mapping={
"response": "answer",
"text": "answer",
"prompt": lambda x: x["inputs"]["question"]
}
)
# Step 6: Analyze results in Fiddler UI
print(f"β
Evaluated {len(results.results)} test cases")Pre-Built Evaluators
Safety & Trust
Toxicity - Detect toxic language and harmful content
PII Detection - Identify 35+ types of personally identifiable information
Jailbreak - Detect prompt injection and manipulation attempts
Profanity - Flag offensive language
Quality & Accuracy
Faithfulness - Measure hallucination and groundedness to source material
Answer Relevance - Assess how well responses address the query
Context Relevance - Evaluate retrieval quality in RAG applications
Coherence - Measure logical flow and consistency
Conciseness - Evaluate response brevity and efficiency
Content Analysis
Sentiment - Analyze emotional tone
Topic Classification - Categorize content by topic
Language Detection - Identify language of text
Regex Matching - Custom pattern-based evaluation
Keyword Detection - Flag specific terms or phrases
Example Usage
Batch Evaluation with Multiple Evaluators
Custom Evaluators
Importing Test Cases from Files
Viewing Results
Results are automatically tracked in the Fiddler UI. Navigate to your application to:
View experiment results with detailed scores
Compare experiments side-by-side
Filter and analyze by metadata
Export results for further analysis
Programmatic Analysis
Advanced Configuration
Parallel Processing
Experiment Metadata and Organization
Custom Parameter Mapping
Troubleshooting
Connection Issues
Problem: Cannot connect to Fiddler instance
Solution:
Verify your URL is correct (e.g.,
https://your-org.fiddler.ai)Ensure your access token is valid and not expired
Check network connectivity:
curl -I https://your-org.fiddler.aiRegenerate token from Fiddler UI: Settings > Credentials
Import Errors
Problem: ModuleNotFoundError: No module named 'fiddler_evals'
Solution:
Evaluation Failures
Problem: Evaluators failing with parameter errors
Solution:
Check
score_fn_kwargs_mappingmatches evaluator requirementsVerify task output format matches expected structure
Test evaluators individually:
Performance Issues
Problem: Evaluation running slowly
Solution:
Related Integrations
Evals SDK Quick Start - Detailed setup guide with code examples
Evals SDK Reference - Complete SDK API documentation
LangGraph SDK - Runtime monitoring for LangGraph agents
Strands Agents SDK - Monitor Strands Agents
Python Client SDK - Full platform API access
Next Steps
Quick Start Guide - Complete tutorial with working examples
Getting Started with Evaluations - Understand evaluation concepts and best practices
SDK API Reference - Explore all available classes and methods
Last updated
Was this helpful?