Running RAG Experiments at Scale

Move beyond ad-hoc evaluation to structured experiments that track results, validate against golden labels, and enable side-by-side comparison of RAG pipeline configurations.

Use this cookbook when: You want to compare different retrieval strategies, LLM models, or prompt configurations across a standardized test set.

Time to complete: ~25 minutes

Prerequisites

Fiddler account with API access
LLM credential configured in Settings > LLM Gateway
pip install fiddler-evals pandas
Familiarity with RAG Evaluation Fundamentals recommended

Set Up the Experiment Infrastructure

Experiments are organized as: Project > Application > Dataset > Experiment

Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.

import pandas as pd
from fiddler_evals import Application, Dataset, Project, evaluate, init
from fiddler_evals.evaluators import AnswerRelevance, ContextRelevance, RAGFaithfulness

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'
LLM_MODEL_NAME = 'openai/gpt-4o'

init(url=URL, token=TOKEN)

project = Project.get_or_create(name='rag_experiments')
application = Application.get_or_create(
    name='rag-pipeline-comparison',
    project_id=project.id,
)
dataset = Dataset.get_or_create(
    name='rag-test-cases',
    application_id=application.id,
)

Create Test Cases with Golden Labels

Include expected_quality labels so you can validate whether evaluators correctly identify good and bad responses:

rag_data = pd.DataFrame(
    [
        {
            'scenario': 'Perfect Match',
            'expected_quality': 'good',
            'user_query': 'What is the capital of France?',
            'retrieved_documents': [
                'Paris is the capital and largest city of France.',
                'France is located in Western Europe.',
            ],
            'rag_response': 'The capital of France is Paris.',
        },
        {
            'scenario': 'Irrelevant Context',
            'expected_quality': 'bad',
            'user_query': 'How do I reset my password?',
            'retrieved_documents': [
                'To make pasta, boil water and add salt.',
                'Italian cuisine features many pasta dishes.',
            ],
            'rag_response': 'To reset your password, go to the login page '
                'and click Forgot Password.',
        },
        {
            'scenario': 'Hallucination',
            'expected_quality': 'bad',
            'user_query': 'What are the business hours?',
            'retrieved_documents': [
                'Our office is located at 123 Main Street.',
                'We are closed on federal holidays.',
            ],
            'rag_response': 'Our business hours are Monday through Friday, '
                '9 AM to 5 PM.',
        },
        {
            'scenario': 'Irrelevant Answer',
            'expected_quality': 'bad',
            'user_query': 'What is your return policy?',
            'retrieved_documents': [
                'Returns are accepted within 30 days of purchase.',
                'Items must be unused and in original packaging.',
            ],
            'rag_response': 'We offer free shipping on orders over $50. '
                'Delivery takes 3-5 business days.',
        },
    ]
)

Insert Data into the Dataset

if not list(dataset.get_items()):
    dataset.insert_from_pandas(
        df=rag_data,
        input_columns=['user_query', 'retrieved_documents', 'rag_response'],
        expected_output_columns=['expected_quality'],
        metadata_columns=['scenario'],
    )
    print(f'Inserted {len(rag_data)} test cases')
else:
    print('Dataset already has items, skipping insert')

Expected output:

Inserted 4 test cases

The idempotency check (if not list(dataset.get_items())) prevents duplicate inserts if you re-run the notebook. Remove this check if you want to refresh the dataset.

Run the Experiment

Define a task function that returns the RAG response, then run the experiment with all three RAG Health evaluators:

def rag_task(inputs: dict, extras: dict, metadata: dict) -> dict:
    """Return pre-recorded RAG response.
    Replace with your actual RAG pipeline in production.
    """
    return {'rag_response': inputs['rag_response']}


evaluators = [
    ContextRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
]

result = evaluate(
    dataset=dataset,
    task=rag_task,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': lambda x: x['inputs']['retrieved_documents'],
        'rag_response': 'rag_response',
    },
)

print(f'Experiment: {result.experiment.name}')
print(f'Evaluated {len(result.results)} test cases')

Understanding score_fn_kwargs_mapping: This dict maps evaluator parameter names to data sources. Use a lambda to extract values from dataset inputs (x['inputs']['...']), or a string to reference a key from the task function's return dict.

Expected output:

Experiment: rag-test-cases-2026-02-07-001
Evaluated 4 test cases

Validate Against Golden Labels

Check whether the evaluators correctly identified quality issues by comparing their scores against your expected labels:

from fiddler_evals.pydantic_models.experiment import ExperimentItemResult
from fiddler_evals.pydantic_models.score import Score

validation_results = []
correct = 0

for r in result.results:
    expected = r.dataset_item.expected_outputs.get('expected_quality')
    has_problem = any(s.value < 0.5 for s in r.scores)
    predicted = 'bad' if has_problem else 'good'

    if expected == predicted:
        correct += 1

    validation_results.append(
        ExperimentItemResult(
            experiment_item=r.experiment_item,
            dataset_item=r.dataset_item,
            scores=[
                Score(
                    name='predicted_quality',
                    evaluator_name='OverallQuality',
                    value=1.0 if predicted == 'good' else 0.0,
                    label=predicted,
                    reasoning=f'Expected: {expected}',
                )
            ],
        )
    )

result.experiment.add_results(validation_results)
print(
    f'Evaluator Accuracy: {correct}/{len(result.results)} '
    f'({100 * correct / len(result.results):.0f}%)'
)

Expected output:

Evaluator Accuracy: 4/4 (100%)

View Results

print(f'View in Fiddler: {URL}/evals/experiments/{result.experiment.id}')

# Build results DataFrame
rows = []
for r, v in zip(result.results, validation_results):
    row = {
        'scenario': r.dataset_item.metadata.get('scenario'),
        'expected': r.dataset_item.expected_outputs.get('expected_quality'),
        'predicted': v.scores[0].label,
    }
    row.update({s.evaluator_name: s.value for s in r.scores})
    rows.append(row)

pd.DataFrame(rows)

Expected output:

scenario

expected

predicted

ContextRelevance

RAGFaithfulness

AnswerRelevance

Perfect Match

good

1.0

Irrelevant Context

bad

0.0

0.5

Hallucination

bad

0.5

0.0

1.0

Irrelevant Answer

bad

1.0

0.0

Comparing Pipeline Configurations

To compare different RAG configurations, run multiple experiments against the same dataset:

Replace rag_pipeline_v1 and rag_pipeline_v2 with your actual RAG pipeline functions. Each function must accept (inputs, extras, metadata) and return a dict containing rag_response.

# Experiment 1: Default retrieval
result_v1 = evaluate(
    dataset=dataset,
    task=rag_pipeline_v1,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': lambda x: x['inputs']['retrieved_documents'],
        'rag_response': 'rag_response',
    },
)

# Experiment 2: Improved retrieval with re-ranking
result_v2 = evaluate(
    dataset=dataset,
    task=rag_pipeline_v2,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': lambda x: x['inputs']['retrieved_documents'],
        'rag_response': 'rag_response',
    },
)

# Compare results side-by-side in the Fiddler UI
print(f'V1: {URL}/evals/experiments/{result_v1.experiment.id}')
print(f'V2: {URL}/evals/experiments/{result_v2.experiment.id}')

Both experiments appear in the Fiddler UI under the same Application, enabling side-by-side comparison of scores across all test cases.

Next Steps

RAG Evaluation Fundamentals — Direct .score() API for quick iteration
Building Custom Judge Evaluators — Add domain-specific evaluation criteria
RAG Health Metrics Tutorial — Step-by-step tutorial

Source notebook: Fiddler Cookbook: RAG Experiments at Scale

❓ Questions? Talk to a product expert or request a demo.

💡 Need help? Contact us at [email protected].

PreviousRAG Evaluation Fundamentals NextBuilding Custom Judge Evaluators

hashtagSet Up the Experiment Infrastructure

hashtagCreate Test Cases with Golden Labels

hashtagInsert Data into the Dataset

hashtagRun the Experiment

hashtagValidate Against Golden Labels

hashtagView Results

hashtagComparing Pipeline Configurations

hashtagNext Steps

Set Up the Experiment Infrastructure

Create Test Cases with Golden Labels

Insert Data into the Dataset

Run the Experiment

Validate Against Golden Labels

View Results

Comparing Pipeline Configurations

Next Steps