chart-mixedRunning RAG Experiments at Scale

Move beyond ad-hoc evaluation to structured experiments that track results, validate against golden labels, and enable side-by-side comparison of RAG pipeline configurations.

Use this cookbook when: You want to compare different retrieval strategies, LLM models, or prompt configurations across a standardized test set.

Time to complete: ~25 minutes

spinner
circle-info

Prerequisites

  • Fiddler account with API access

  • LLM credential configured in Settings > LLM Gateway

  • pip install fiddler-evals pandas

  • Familiarity with RAG Evaluation Fundamentals recommended


1

Set Up the Experiment Infrastructure

Experiments are organized as: Project > Application > Dataset > Experiment

circle-info

Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.

import pandas as pd
from fiddler_evals import Application, Dataset, Project, evaluate, init
from fiddler_evals.evaluators import AnswerRelevance, ContextRelevance, RAGFaithfulness

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'
LLM_MODEL_NAME = 'openai/gpt-4o'

init(url=URL, token=TOKEN)

project = Project.get_or_create(name='rag_experiments')
application = Application.get_or_create(
    name='rag-pipeline-comparison',
    project_id=project.id,
)
dataset = Dataset.get_or_create(
    name='rag-test-cases',
    application_id=application.id,
)
2

Create Test Cases with Golden Labels

Include expected_quality labels so you can validate whether evaluators correctly identify good and bad responses:

rag_data = pd.DataFrame(
    [
        {
            'scenario': 'Perfect Match',
            'expected_quality': 'good',
            'user_query': 'What is the capital of France?',
            'retrieved_documents': [
                'Paris is the capital and largest city of France.',
                'France is located in Western Europe.',
            ],
            'rag_response': 'The capital of France is Paris.',
        },
        {
            'scenario': 'Irrelevant Context',
            'expected_quality': 'bad',
            'user_query': 'How do I reset my password?',
            'retrieved_documents': [
                'To make pasta, boil water and add salt.',
                'Italian cuisine features many pasta dishes.',
            ],
            'rag_response': 'To reset your password, go to the login page '
                'and click Forgot Password.',
        },
        {
            'scenario': 'Hallucination',
            'expected_quality': 'bad',
            'user_query': 'What are the business hours?',
            'retrieved_documents': [
                'Our office is located at 123 Main Street.',
                'We are closed on federal holidays.',
            ],
            'rag_response': 'Our business hours are Monday through Friday, '
                '9 AM to 5 PM.',
        },
        {
            'scenario': 'Irrelevant Answer',
            'expected_quality': 'bad',
            'user_query': 'What is your return policy?',
            'retrieved_documents': [
                'Returns are accepted within 30 days of purchase.',
                'Items must be unused and in original packaging.',
            ],
            'rag_response': 'We offer free shipping on orders over $50. '
                'Delivery takes 3-5 business days.',
        },
    ]
)
3

Insert Data into the Dataset

if not list(dataset.get_items()):
    dataset.insert_from_pandas(
        df=rag_data,
        input_columns=['user_query', 'retrieved_documents', 'rag_response'],
        expected_output_columns=['expected_quality'],
        metadata_columns=['scenario'],
    )
    print(f'Inserted {len(rag_data)} test cases')
else:
    print('Dataset already has items, skipping insert')

Expected output:

Inserted 4 test cases
circle-info

The idempotency check (if not list(dataset.get_items())) prevents duplicate inserts if you re-run the notebook. Remove this check if you want to refresh the dataset.

4

Run the Experiment

Define a task function that returns the RAG response, then run the experiment with all three RAG Health evaluators:

def rag_task(inputs: dict, extras: dict, metadata: dict) -> dict:
    """Return pre-recorded RAG response.
    Replace with your actual RAG pipeline in production.
    """
    return {'rag_response': inputs['rag_response']}


evaluators = [
    ContextRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
]

result = evaluate(
    dataset=dataset,
    task=rag_task,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': lambda x: x['inputs']['retrieved_documents'],
        'rag_response': 'rag_response',
    },
)

print(f'Experiment: {result.experiment.name}')
print(f'Evaluated {len(result.results)} test cases')
circle-info

Understanding score_fn_kwargs_mapping: This dict maps evaluator parameter names to data sources. Use a lambda to extract values from dataset inputs (x['inputs']['...']), or a string to reference a key from the task function's return dict.

Expected output:

Experiment: rag-test-cases-2026-02-07-001
Evaluated 4 test cases
5

Validate Against Golden Labels

Check whether the evaluators correctly identified quality issues by comparing their scores against your expected labels:

from fiddler_evals.pydantic_models.experiment import ExperimentItemResult
from fiddler_evals.pydantic_models.score import Score

validation_results = []
correct = 0

for r in result.results:
    expected = r.dataset_item.expected_outputs.get('expected_quality')
    has_problem = any(s.value < 0.5 for s in r.scores)
    predicted = 'bad' if has_problem else 'good'

    if expected == predicted:
        correct += 1

    validation_results.append(
        ExperimentItemResult(
            experiment_item=r.experiment_item,
            dataset_item=r.dataset_item,
            scores=[
                Score(
                    name='predicted_quality',
                    evaluator_name='OverallQuality',
                    value=1.0 if predicted == 'good' else 0.0,
                    label=predicted,
                    reasoning=f'Expected: {expected}',
                )
            ],
        )
    )

result.experiment.add_results(validation_results)
print(
    f'Evaluator Accuracy: {correct}/{len(result.results)} '
    f'({100 * correct / len(result.results):.0f}%)'
)

Expected output:

Evaluator Accuracy: 4/4 (100%)
6

View Results

print(f'View in Fiddler: {URL}/evals/experiments/{result.experiment.id}')

# Build results DataFrame
rows = []
for r, v in zip(result.results, validation_results):
    row = {
        'scenario': r.dataset_item.metadata.get('scenario'),
        'expected': r.dataset_item.expected_outputs.get('expected_quality'),
        'predicted': v.scores[0].label,
    }
    row.update({s.evaluator_name: s.value for s in r.scores})
    rows.append(row)

pd.DataFrame(rows)

Expected output:

scenario
expected
predicted
ContextRelevance
RAGFaithfulness
AnswerRelevance

Perfect Match

good

good

1.0

1.0

1.0

Irrelevant Context

bad

bad

0.0

0.0

0.5

Hallucination

bad

bad

0.5

0.0

1.0

Irrelevant Answer

bad

bad

1.0

0.0

0.0


Comparing Pipeline Configurations

To compare different RAG configurations, run multiple experiments against the same dataset:

circle-exclamation
# Experiment 1: Default retrieval
result_v1 = evaluate(
    dataset=dataset,
    task=rag_pipeline_v1,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': lambda x: x['inputs']['retrieved_documents'],
        'rag_response': 'rag_response',
    },
)

# Experiment 2: Improved retrieval with re-ranking
result_v2 = evaluate(
    dataset=dataset,
    task=rag_pipeline_v2,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': lambda x: x['inputs']['retrieved_documents'],
        'rag_response': 'rag_response',
    },
)

# Compare results side-by-side in the Fiddler UI
print(f'V1: {URL}/evals/experiments/{result_v1.experiment.id}')
print(f'V2: {URL}/evals/experiments/{result_v2.experiment.id}')

Both experiments appear in the Fiddler UI under the same Application, enabling side-by-side comparison of scores across all test cases.


Next Steps


Source notebook: Fiddler Cookbook: RAG Experiments at Scalearrow-up-right


Questions? Talkarrow-up-right to a product expert or requestarrow-up-right a demo.

💡 Need help? Contact us at [email protected]envelope.