Fiddler Evals SDK

LLM evaluation framework with pre-built evaluators and custom metrics

βœ“ GA | πŸ† Native SDK

Evaluate LLM application quality with Fiddler's evaluation framework. Run batch evaluations with 14+ pre-built evaluators or create custom metrics for domain-specific quality assessment.

What You'll Need

  • Fiddler account

  • Python 3.10 or higher

  • Fiddler API key and access token

  • Dataset for evaluation

Quick Start

# Step 1: Install
pip install fiddler-evals

# Step 2: Initialize connection
from fiddler_evals import init

init(
    url='https://your-org.fiddler.ai',
    token='your-access-token'
)

# Step 3: Create project and application
from fiddler_evals import Project, Application, Dataset

project = Project.get_or_create(name='my_eval_project')
application = Application.get_or_create(
    name='my_llm_app',
    project_id=project.id
)

# Step 4: Create dataset and add test cases
from fiddler_evals.pydantic_models.dataset import NewDatasetItem

dataset = Dataset.create(
    name='evaluation_dataset',
    application_id=application.id,
    description='Test cases for LLM evaluation'
)

test_cases = [
    NewDatasetItem(
        inputs={"question": "What is the capital of France?"},
        expected_outputs={"answer": "Paris is the capital of France"},
        metadata={"type": "Factual", "category": "Geography"}
    ),
]
dataset.insert(test_cases)

# Step 5: Run evaluation
from fiddler_evals import evaluate
from fiddler_evals.evaluators import AnswerRelevance, Conciseness, Toxicity

def my_llm_task(inputs, extras, metadata):
    """Your LLM application logic"""
    question = inputs.get("question", "")
    # Call your LLM here
    answer = f"Mock response to: {question}"
    return {"answer": answer}

results = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=[
        AnswerRelevance(),
        Conciseness(),
        Toxicity()
    ],
    name_prefix="my_evaluation",
    score_fn_kwargs_mapping={
        "response": "answer",
        "text": "answer",
        "prompt": lambda x: x["inputs"]["question"]
    }
)

# Step 6: Analyze results in Fiddler UI
print(f"βœ… Evaluated {len(results.results)} test cases")

Pre-Built Evaluators

Safety & Trust

  • Toxicity - Detect toxic language and harmful content

  • PII Detection - Identify 35+ types of personally identifiable information

  • Jailbreak - Detect prompt injection and manipulation attempts

  • Profanity - Flag offensive language

Quality & Accuracy

  • Faithfulness - Measure hallucination and groundedness to source material

  • Answer Relevance - Assess how well responses address the query

  • Context Relevance - Evaluate retrieval quality in RAG applications

  • Coherence - Measure logical flow and consistency

  • Conciseness - Evaluate response brevity and efficiency

Content Analysis

  • Sentiment - Analyze emotional tone

  • Topic Classification - Categorize content by topic

  • Language Detection - Identify language of text

  • Regex Matching - Custom pattern-based evaluation

  • Keyword Detection - Flag specific terms or phrases

Example Usage

Batch Evaluation with Multiple Evaluators

Custom Evaluators

Importing Test Cases from Files

Viewing Results

Results are automatically tracked in the Fiddler UI. Navigate to your application to:

  • View experiment results with detailed scores

  • Compare experiments side-by-side

  • Filter and analyze by metadata

  • Export results for further analysis

Programmatic Analysis

Advanced Configuration

Parallel Processing

Experiment Metadata and Organization

Custom Parameter Mapping

Troubleshooting

Connection Issues

Problem: Cannot connect to Fiddler instance

Solution:

  1. Verify your URL is correct (e.g., https://your-org.fiddler.ai)

  2. Ensure your access token is valid and not expired

  3. Check network connectivity: curl -I https://your-org.fiddler.ai

  4. Regenerate token from Fiddler UI: Settings > Credentials

Import Errors

Problem: ModuleNotFoundError: No module named 'fiddler_evals'

Solution:

Evaluation Failures

Problem: Evaluators failing with parameter errors

Solution:

  1. Check score_fn_kwargs_mapping matches evaluator requirements

  2. Verify task output format matches expected structure

  3. Test evaluators individually:

Performance Issues

Problem: Evaluation running slowly

Solution:

Next Steps

  1. Quick Start Guide - Complete tutorial with working examples

  2. Getting Started with Evaluations - Understand evaluation concepts and best practices

  3. SDK API Reference - Explore all available classes and methods

Last updated

Was this helpful?