Experiments Quick Start

Systematically evaluate your LLM applications, RAG systems, and AI agents using the Fiddler Evals SDK with built-in evaluators and custom metrics.

Time to complete: ~20 minutes

What You'll Learn

Initialize the Fiddler Evals SDK and organize your experiments
Create experiment datasets with test cases
Use built-in evaluators (faithfulness, relevance, coherence, etc.)
Create custom evaluators for domain-specific requirements
Run experiments and analyze results

Prerequisites

Fiddler Account: Active account with API access
Python 3.10+
Fiddler Evals SDK: pip install fiddler-evals
Access Token: From Settings > Credentials

Quick Start

Step 1: Connect to Fiddler

from fiddler_evals import init

# Initialize connection
init(
    url='https://your-org.fiddler.ai',
    token='your-access-token'
)

Step 2: Create Project and Application

from fiddler_evals import Project, Application, Dataset
from fiddler_evals.pydantic_models.dataset import NewDatasetItem

# Create organizational structure
project = Project.get_or_create(name='my_eval_project')
application = Application.get_or_create(
    name='my_llm_app',
    project_id=project.id
)

# Create experiment dataset
dataset = Dataset.create(
    name='experiment_dataset',
    application_id=application.id,
    description='Test cases for LLM experiments'
)

Step 3: Add Test Cases

# Define test cases
test_cases = [
    NewDatasetItem(
        inputs={"question": "What is the capital of France?"},
        expected_outputs={"answer": "Paris is the capital of France"},
        metadata={"type": "Factual", "category": "Geography"}
    ),
    NewDatasetItem(
        inputs={"question": "Explain photosynthesis"},
        expected_outputs={"answer": "Photosynthesis is the process by which plants convert sunlight into energy"},
        metadata={"type": "Explanation", "category": "Science"}
    ),
]

# Insert test cases into dataset
dataset.insert(test_cases)
print(f"✅ Added {len(test_cases)} test cases")

Step 4: Define Your LLM Task

def my_llm_task(inputs, extras, metadata):
    """Your LLM application logic."""
    question = inputs.get("question", "")

    # Call your LLM here (example uses placeholder)
    # In production, call OpenAI, Anthropic, or your LLM
    answer = f"Mock response to: {question}"

    return {"answer": answer}

Step 5: Run Experiment with Built-In Evaluators

from fiddler_evals import evaluate
from fiddler_evals.evaluators import (
    AnswerRelevance,
    Conciseness,
    FTLPromptSafety
)

MODEL = "openai/gpt-4o"
CREDENTIAL = "your-llm-credential"  # From Settings > LLM Gateway

# Run evaluation
results = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=[
        AnswerRelevance(model=MODEL, credential=CREDENTIAL),
        Conciseness(model=MODEL, credential=CREDENTIAL),
        FTLPromptSafety()  # FTL models run locally, no model= needed
    ],
    name_prefix="my_experiment",
    score_fn_kwargs_mapping={
        "user_query": lambda x: x["inputs"]["question"],
        "rag_response": "answer",
        "response": "answer",
        "text": "answer",
    }
)

print(f"✅ Evaluated {len(results.results)} test cases")

Step 6: Analyze Results

# Access results programmatically
for item_result in results.results:
    print(f"\nTest Case: {item_result.dataset_item_id}")
    print(f"Task Output: {item_result.task_output}")

    # View scores from each evaluator
    for score in item_result.scores:
        print(f"  {score.name}: {score.value} - {score.reasoning}")

# View results in Fiddler UI
print(f"\n🔗 View results: https://your-org.fiddler.ai")

Built-In Evaluators

The Fiddler Evals SDK includes 13 pre-built evaluators:

Quality & Accuracy

AnswerRelevance - Measures response relevance to the question (High / Medium / Low)
Coherence - Evaluates logical flow and consistency
Conciseness - Checks for unnecessary verbosity

Safety & Trust

FTLPromptSafety - Detects prompt injection, jailbreaks, and unsafe prompts
FTLResponseFaithfulness - Evaluate faithfulness of LLM responses (Fast Trust Model)

RAG Health Metrics

AnswerRelevance - Measures how well responses address user queries (High / Medium / Low)
ContextRelevance - Evaluates whether retrieved documents are relevant to the query (High / Medium / Low)
RAGFaithfulness - Checks if response is grounded in retrieved documents (Yes / No)

Use these three evaluators together as a diagnostic triad to pinpoint whether RAG pipeline issues originate in retrieval, generation, or query understanding.

Example: RAG Experiment

from fiddler_evals.evaluators import AnswerRelevance, ContextRelevance, RAGFaithfulness

# Add context to test cases
rag_test_cases = [
    NewDatasetItem(
        inputs={
            "user_query": "What is the capital of France?",
            "retrieved_documents": "Paris is the capital and largest city of France."
        },
        expected_outputs={"rag_response": "Paris"}
    ),
]

dataset.insert(rag_test_cases)

# Evaluate with RAG Health Metrics evaluators
rag_results = evaluate(
    dataset=dataset,
    task=my_rag_task,  # Your RAG system
    evaluators=[
        AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential"),
        ContextRelevance(model="openai/gpt-4o", credential="your-llm-credential"),
        RAGFaithfulness(model="openai/gpt-4o", credential="your-llm-credential")
    ],
    score_fn_kwargs_mapping={
        "rag_response": "rag_response",
        "retrieved_documents": lambda x: x["inputs"]["retrieved_documents"],
        "user_query": lambda x: x["inputs"]["user_query"]
    }
)

Custom Evaluators

Create domain-specific evaluators for your use case:

from fiddler_evals.evaluators.base import Evaluator
from fiddler_evals.pydantic_models.score import Score

class CustomToneEvaluator(Evaluator):
    """Evaluates if response matches desired tone."""

    def score(self, response: str, desired_tone: str = "professional") -> Score:
        # Your custom evaluation logic
        is_professional = self._check_tone(response, desired_tone)

        return Score(
            name="tone_match",
            value=1.0 if is_professional else 0.0,
            reasoning=f"Response {'matches' if is_professional else 'does not match'} {desired_tone} tone"
        )

    def _check_tone(self, text: str, tone: str) -> bool:
        # Implement your tone detection logic
        # Could use keyword matching, LLM-as-judge, or ML model
        return True  # Placeholder

# Use custom evaluator
results = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=[CustomToneEvaluator()],
    score_fn_kwargs_mapping={
        "response": "answer",
        "desired_tone": "professional"
    }
)

Advanced Features

Batch Experiments with Parallel Processing

# Evaluate with parallel workers for faster execution
results = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=[AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential"), Conciseness(model="openai/gpt-4o", credential="your-llm-credential")],
    max_workers=5  # Process 5 test cases in parallel
)

Import Datasets from Files

# From CSV
dataset.insert_from_csv_file(
    csv_file_path='test_cases.csv',
    inputs_columns=['question'],
    expected_outputs_columns=['answer']
)

# From JSONL
dataset.insert_from_jsonl_file(
    jsonl_file_path='test_cases.jsonl'
)

# From Pandas DataFrame
import pandas as pd

df = pd.DataFrame({
    'question': ['Q1', 'Q2'],
    'expected_answer': ['A1', 'A2']
})

dataset.insert_from_pandas(
    dataframe=df,
    inputs_columns=['question'],
    expected_outputs_columns=['expected_answer']
)

Track Experiment Metadata

# Add experiment metadata for tracking
results = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=[AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential")],
    name_prefix="experiment_v2",  # Version your experiments
    score_fn_kwargs_mapping={"response": "answer"}
)

# Results are automatically tracked in Fiddler
# View experiment history in the Fiddler UI

Complete Example: RAG Experiment Pipeline

from fiddler_evals import init, Project, Application, Dataset, evaluate
from fiddler_evals.pydantic_models.dataset import NewDatasetItem
from fiddler_evals.evaluators import (
    RAGFaithfulness,
    AnswerRelevance,
    ContextRelevance,
    Conciseness
)

# Step 1: Initialize
init(url='https://your-org.fiddler.ai', token='your-token')

# Step 2: Set up organization
project = Project.get_or_create(name='rag_experiments')
app = Application.get_or_create(name='doc_qa_system', project_id=project.id)
dataset = Dataset.create(name='qa_test_set', application_id=app.id)

# Step 3: Create test cases
test_cases = [
    NewDatasetItem(
        inputs={
            "user_query": "What is machine learning?",
            "retrieved_documents": "Machine learning is a subset of AI that enables "
                "systems to learn from data."
        },
        expected_outputs={
            "rag_response": "Machine learning is a subset of AI."
        },
        metadata={"difficulty": "easy"}
    ),
]
dataset.insert(test_cases)

# Step 4: Define RAG task
def rag_task(inputs, extras, metadata):
    """Your RAG system implementation."""
    user_query = inputs["user_query"]
    retrieved_documents = inputs["retrieved_documents"]

    # Call your RAG system (simplified example)
    rag_response = generate_answer(user_query, retrieved_documents)

    return {
        "rag_response": rag_response,
        "retrieved_documents": retrieved_documents,
    }

# Step 5: Run comprehensive evaluation
results = evaluate(
    dataset=dataset,
    task=rag_task,
    evaluators=[
        RAGFaithfulness(model="openai/gpt-4o", credential="your-llm-credential"),     # Check factual grounding (Yes/No)
        AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential"),     # Check relevance to query (High/Medium/Low)
        ContextRelevance(model="openai/gpt-4o", credential="your-llm-credential"),    # Check retrieval quality (High/Medium/Low)
        Conciseness(model="openai/gpt-4o", credential="your-llm-credential"),         # Check for verbosity
    ],
    name_prefix="rag_eval_v1",
    score_fn_kwargs_mapping={
        "rag_response": "rag_response",
        "retrieved_documents": "retrieved_documents",
        "user_query": lambda x: x["inputs"]["user_query"],
    }
)

# Step 6: Analyze
print(f"Evaluated {len(results.results)} test cases")
for result in results.results:
    print(f"\n{result.dataset_item_id}:")
    for score in result.scores:
        print(f"  {score.name}: {score.value:.3f}")

Best Practices

Start Small: Begin with 10-20 test cases to validate your setup
Use Multiple Evaluators: Combine quality, safety, and domain-specific evaluators
Version Your Experiments: Use name_prefix to track different experiment runs
Monitor Over Time: Run experiments regularly to catch regressions
Custom Evaluators: Create domain-specific evaluators for specialized needs
Leverage Parallelization: Use max_workers for faster evaluation of large datasets
Organize Hierarchically: Use Projects > Applications > Datasets structure

Next Steps

Complete Guides

Evals SDK Quick Start - Full tutorial with detailed examples
Evals SDK Reference - Complete API documentation

Concepts & Background

Experiments Overview - Why and when to run experiments
Trust Service - Fiddler's evaluation platform

Integration Guides

Evals SDK Integration - Integration patterns and examples
LangGraph SDK - Monitor LangGraph agents
Strands Agents SDK - Monitor Strands agents

Summary

You've learned how to:

✅ Initialize the Fiddler Evals SDK with init()
✅ Create Projects, Applications, and Datasets for organization
✅ Build experiment datasets with test cases
✅ Use 13 built-in evaluators for quality, safety, and RAG metrics
✅ Create custom evaluators for domain-specific requirements
✅ Run experiments with the evaluate() function
✅ Analyze results programmatically and in the Fiddler UI

The Fiddler Evals SDK provides a comprehensive framework for systematic LLM experiments, enabling you to ensure quality, safety, and accuracy before deploying your AI applications.

PreviousSimple ML Monitoring NextGuardrails Quick Start

hashtagWhat You'll Learn

hashtagPrerequisites

hashtagQuick Start

hashtagStep 1: Connect to Fiddler

hashtagStep 2: Create Project and Application

hashtagStep 3: Add Test Cases

hashtagStep 4: Define Your LLM Task

hashtagStep 5: Run Experiment with Built-In Evaluators

hashtagStep 6: Analyze Results

hashtagBuilt-In Evaluators

hashtagQuality & Accuracy

hashtagSafety & Trust

hashtagRAG Health Metrics

hashtagExample: RAG Experiment

hashtagCustom Evaluators

hashtagAdvanced Features

hashtagBatch Experiments with Parallel Processing

hashtagImport Datasets from Files

hashtagTrack Experiment Metadata

hashtagComplete Example: RAG Experiment Pipeline

hashtagBest Practices

hashtagNext Steps

hashtagComplete Guides

hashtagConcepts & Background

hashtagIntegration Guides

hashtagSummary