Fiddler Evals SDK
LLM evaluation framework with pre-built evaluators and custom metrics
✓ GA | 🏆 Native SDK
Evaluate LLM application quality with Fiddler's evaluation framework. Run batch evaluations with 14+ pre-built evaluators or create custom metrics for domain-specific quality assessment.
What You'll Need
Fiddler account
Python 3.10 or higher
Dataset for evaluation (CSV, JSON, or DataFrame)
Fiddler API key
Quick Start
# Step 1: Install
pip install fiddler-evals
# Step 2: Run evaluation
from fiddler_evals import FiddlerEvaluator, Evaluators
evaluator = FiddlerEvaluator(
api_key="fid_...",
url="https://app.fiddler.ai"
)
# Evaluate your LLM outputs
results = evaluator.evaluate(
dataset=my_dataframe,
evaluators=[
Evaluators.FAITHFULNESS,
Evaluators.TOXICITY,
Evaluators.PII_DETECTION
]
)
# View results
print(results.summary())Pre-Built Evaluators
Safety & Trust
Toxicity - Detect toxic language and harmful content
PII Detection - Identify 35+ types of personally identifiable information
Jailbreak - Detect prompt injection and manipulation attempts
Profanity - Flag offensive language
Quality & Accuracy
Faithfulness - Measure hallucination and groundedness to source material
Answer Relevance - Assess how well responses address the query
Context Relevance - Evaluate retrieval quality in RAG applications
Coherence - Measure logical flow and consistency
Conciseness - Evaluate response brevity and efficiency
Content Analysis
Sentiment - Analyze emotional tone
Topic Classification - Categorize content by topic
Language Detection - Identify language of text
Regex Matching - Custom pattern-based evaluation
Keyword Detection - Flag specific terms or phrases
Example Usage
Batch Evaluation
import pandas as pd
from fiddler_evals import FiddlerEvaluator, Evaluators
# Load your dataset
df = pd.read_csv("llm_outputs.csv")
# Columns: prompt, response, context (optional)
# Create evaluator
evaluator = FiddlerEvaluator(api_key="fid_...")
# Run multiple evaluators
results = evaluator.evaluate(
dataset=df,
evaluators=[
Evaluators.FAITHFULNESS,
Evaluators.TOXICITY,
Evaluators.ANSWER_RELEVANCE
],
batch_size=100
)
# Access results
print(f"Average faithfulness: {results.mean('faithfulness')}")
print(f"Toxic responses: {results.count('toxicity', threshold=0.8)}")Custom Evaluators
from fiddler_evals import CustomEvaluator
# Define custom metric
class DomainSpecificEvaluator(CustomEvaluator):
def evaluate(self, prompt, response, context=None):
# Your custom logic
score = calculate_domain_score(response)
return {
"score": score,
"passed": score > 0.7,
"details": {"reason": "..."}
}
# Use custom evaluator
results = evaluator.evaluate(
dataset=df,
evaluators=[DomainSpecificEvaluator()]
)RAG Application Evaluation
# Evaluate retrieval-augmented generation
results = evaluator.evaluate_rag(
queries=queries,
retrieved_contexts=contexts,
generated_responses=responses,
ground_truth=expected_answers, # optional
evaluators=[
Evaluators.CONTEXT_RELEVANCE,
Evaluators.FAITHFULNESS,
Evaluators.ANSWER_RELEVANCE
]
)Viewing Results
# Summary statistics
print(results.summary())
# Detailed results DataFrame
detailed_df = results.to_dataframe()
# Export to Fiddler platform for tracking
results.upload_to_fiddler(project="my-llm-evals")
# Generate report
results.generate_report(output_file="eval_report.html")Advanced Features
Threshold Configuration
evaluator.configure(
evaluators={
Evaluators.TOXICITY: {"threshold": 0.8},
Evaluators.FAITHFULNESS: {"threshold": 0.05, "invert": True}
}
)Parallel Execution
results = evaluator.evaluate(
dataset=large_df,
parallel=True,
max_workers=8
)Integration with Fiddler Platform
# Track evaluations over time
evaluator.track_evaluation(
dataset=df,
evaluators=[...],
version="v2.1",
environment="staging"
)Troubleshooting
Rate Limiting
evaluator.configure(
rate_limit=10, # requests per second
retry_config={
"max_retries": 3,
"backoff_factor": 2
}
)Memory Management for Large Datasets
# Process in chunks
for chunk in pd.read_csv("large_file.csv", chunksize=1000):
results = evaluator.evaluate(dataset=chunk)
results.save(append=True)Related Integrations
LangGraph SDK - Runtime monitoring for LangGraph agents
Strands SDK - Monitor Strands Agents
Python Client - Full platform API access
API Reference
Evals SDK Documentation - Complete API reference
Last updated
Was this helpful?