Fiddler Evals SDK

LLM evaluation framework with pre-built evaluators and custom metrics

✓ GA | 🏆 Native SDK

Evaluate LLM application quality with Fiddler's evaluation framework. Run batch evaluations with 14+ pre-built evaluators or create custom metrics for domain-specific quality assessment.

What You'll Need

Fiddler account
Python 3.10 or higher
Dataset for evaluation (CSV, JSON, or DataFrame)
Fiddler API key

Quick Start

# Step 1: Install
pip install fiddler-evals

# Step 2: Run evaluation
from fiddler_evals import FiddlerEvaluator, Evaluators

evaluator = FiddlerEvaluator(
    api_key="fid_...",
    url="https://app.fiddler.ai"
)

# Evaluate your LLM outputs
results = evaluator.evaluate(
    dataset=my_dataframe,
    evaluators=[
        Evaluators.FAITHFULNESS,
        Evaluators.TOXICITY,
        Evaluators.PII_DETECTION
    ]
)

# View results
print(results.summary())

Pre-Built Evaluators

Safety & Trust

Toxicity - Detect toxic language and harmful content
PII Detection - Identify 35+ types of personally identifiable information
Jailbreak - Detect prompt injection and manipulation attempts
Profanity - Flag offensive language

Quality & Accuracy

Faithfulness - Measure hallucination and groundedness to source material
Answer Relevance - Assess how well responses address the query
Context Relevance - Evaluate retrieval quality in RAG applications
Coherence - Measure logical flow and consistency
Conciseness - Evaluate response brevity and efficiency

Content Analysis

Sentiment - Analyze emotional tone
Topic Classification - Categorize content by topic
Language Detection - Identify language of text
Regex Matching - Custom pattern-based evaluation
Keyword Detection - Flag specific terms or phrases

Example Usage

Batch Evaluation

import pandas as pd
from fiddler_evals import FiddlerEvaluator, Evaluators

# Load your dataset
df = pd.read_csv("llm_outputs.csv")
# Columns: prompt, response, context (optional)

# Create evaluator
evaluator = FiddlerEvaluator(api_key="fid_...")

# Run multiple evaluators
results = evaluator.evaluate(
    dataset=df,
    evaluators=[
        Evaluators.FAITHFULNESS,
        Evaluators.TOXICITY,
        Evaluators.ANSWER_RELEVANCE
    ],
    batch_size=100
)

# Access results
print(f"Average faithfulness: {results.mean('faithfulness')}")
print(f"Toxic responses: {results.count('toxicity', threshold=0.8)}")

Custom Evaluators

from fiddler_evals import CustomEvaluator

# Define custom metric
class DomainSpecificEvaluator(CustomEvaluator):
    def evaluate(self, prompt, response, context=None):
        # Your custom logic
        score = calculate_domain_score(response)
        return {
            "score": score,
            "passed": score > 0.7,
            "details": {"reason": "..."}
        }

# Use custom evaluator
results = evaluator.evaluate(
    dataset=df,
    evaluators=[DomainSpecificEvaluator()]
)

RAG Application Evaluation

# Evaluate retrieval-augmented generation
results = evaluator.evaluate_rag(
    queries=queries,
    retrieved_contexts=contexts,
    generated_responses=responses,
    ground_truth=expected_answers,  # optional
    evaluators=[
        Evaluators.CONTEXT_RELEVANCE,
        Evaluators.FAITHFULNESS,
        Evaluators.ANSWER_RELEVANCE
    ]
)

Viewing Results

# Summary statistics
print(results.summary())

# Detailed results DataFrame
detailed_df = results.to_dataframe()

# Export to Fiddler platform for tracking
results.upload_to_fiddler(project="my-llm-evals")

# Generate report
results.generate_report(output_file="eval_report.html")

Advanced Features

Threshold Configuration

evaluator.configure(
    evaluators={
        Evaluators.TOXICITY: {"threshold": 0.8},
        Evaluators.FAITHFULNESS: {"threshold": 0.05, "invert": True}
    }
)

Parallel Execution

results = evaluator.evaluate(
    dataset=large_df,
    parallel=True,
    max_workers=8
)

Integration with Fiddler Platform

# Track evaluations over time
evaluator.track_evaluation(
    dataset=df,
    evaluators=[...],
    version="v2.1",
    environment="staging"
)

Troubleshooting

Rate Limiting

evaluator.configure(
    rate_limit=10,  # requests per second
    retry_config={
        "max_retries": 3,
        "backoff_factor": 2
    }
)

Memory Management for Large Datasets

# Process in chunks
for chunk in pd.read_csv("large_file.csv", chunksize=1000):
    results = evaluator.evaluate(dataset=chunk)
    results.save(append=True)

LangGraph SDK - Runtime monitoring for LangGraph agents
Strands SDK - Monitor Strands Agents
Python Client - Full platform API access

API Reference

Evals SDK Documentation - Complete API reference

PreviousFiddler Strands SDK NextOpenTelemetry Integration

Last updated 6 days ago

Was this helpful?