shield-halvedDetecting Hallucinations in RAG

Build a hallucination detection pipeline that combines pre-deployment evaluation with the Evals SDK and continuous production monitoring through LLM Observability enrichments and Evaluator Rules.

Use this cookbook when: You want to monitor your RAG application for hallucinations across both testing and production environments.

Time to complete: ~25 minutes

spinner
circle-info

Prerequisites

  • Fiddler account with API access

  • LLM credential configured in Settings > LLM Gateway

  • pip install fiddler-evals fiddler-client pandas


The Two-Layer Approach

Hallucination detection works best as a two-layer pipeline:

Layer
Tool
Purpose

Pre-deployment

Evals SDK

Test against known scenarios, validate with golden labels

Production

LLM Observability + Evaluator Rules

Continuous monitoring of live traffic


Layer 1: Pre-Deployment Evaluation

1

Set Up and Connect

Use the RAG Health Metrics triad to distinguish hallucinations from other failure modes:

circle-info

Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.

import pandas as pd
from fiddler_evals import init, evaluate, Project, Application, Dataset
from fiddler_evals.evaluators import (
    AnswerRelevance,
    ContextRelevance,
    RAGFaithfulness,
)

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'
LLM_MODEL_NAME = 'openai/gpt-4o'

init(url=URL, token=TOKEN)

project = Project.get_or_create(name='hallucination_detection')
app = Application.get_or_create(
    name='rag-hallucination-test',
    project_id=project.id,
)
dataset = Dataset.get_or_create(
    name='hallucination-scenarios',
    application_id=app.id,
)
2

Create Hallucination-Focused Test Cases

Design test cases that specifically probe for hallucination patterns:

hallucination_scenarios = pd.DataFrame(
    [
        {
            'scenario': 'Grounded response',
            'user_query': 'What is the return policy?',
            'retrieved_documents': [
                'Returns accepted within 30 days with receipt.',
            ],
            'rag_response': 'You can return items within 30 days '
                'if you have a receipt.',
        },
        {
            'scenario': 'Fabricated details',
            'user_query': 'What is the return policy?',
            'retrieved_documents': [
                'Returns accepted within 30 days with receipt.',
            ],
            'rag_response': 'You can return items within 60 days. '
                'No receipt needed. We also offer free shipping on returns.',
        },
        {
            'scenario': 'Insufficient context',
            'user_query': 'What are the shipping costs?',
            'retrieved_documents': [
                'We ship to all 50 US states.',
            ],
            'rag_response': 'Standard shipping is $5.99 and express '
                'shipping is $12.99.',
        },
    ]
)

dataset.insert_from_pandas(
    df=hallucination_scenarios,
    input_columns=['user_query', 'retrieved_documents', 'rag_response'],
    metadata_columns=['scenario'],
)
3

Run the Diagnostic Evaluation

def passthrough_task(inputs, extras, metadata):
    return {
        'rag_response': inputs['rag_response'],
        'retrieved_documents': inputs['retrieved_documents'],
    }

result = evaluate(
    dataset=dataset,
    task=passthrough_task,
    evaluators=[
        RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
        AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
        ContextRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    ],
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': 'retrieved_documents',
        'rag_response': 'rag_response',
    },
)
4

Interpret Results

Use the diagnostic workflow to classify failures:

for r in result.results:
    scores = {s.evaluator_name: s for s in r.scores}
    scenario = r.dataset_item.metadata.get('scenario', 'unknown')

    faithfulness = scores.get('rag_faithfulness')
    relevance = scores.get('answer_relevance')
    context = scores.get('context_relevance')

    # Classify the failure mode
    if faithfulness and faithfulness.value == 0:
        diagnosis = 'HALLUCINATION'
    elif context and context.value < 0.5:
        diagnosis = 'BAD RETRIEVAL'
    elif relevance and relevance.value < 0.5:
        diagnosis = 'OFF-TOPIC'
    else:
        diagnosis = 'HEALTHY'

    print(f'{scenario}: {diagnosis}')
    if faithfulness:
        print(f'  Faithfulness: {faithfulness.label}{faithfulness.reasoning}')

Expected output:

Grounded response: HEALTHY
  Faithfulness: yes — The response accurately reflects the return policy
  stated in the retrieved document.

Fabricated details: HALLUCINATION
  Faithfulness: no — The response claims a 60-day return window and no
  receipt requirement, but the source document states 30 days with receipt.

Insufficient context: HALLUCINATION
  Faithfulness: no — The response provides specific prices ($5.99, $12.99)
  that are not supported by the retrieved document.
circle-info

Reading the diagnosis: The triad distinguishes why a response failed:

  • HALLUCINATION = Faithfulness fails (response fabricates information)

  • BAD RETRIEVAL = Context Relevance fails (wrong documents retrieved)

  • OFF-TOPIC = Answer Relevance fails (response doesn't address the question)


Layer 2: Production Monitoring

For applications using Agentic Monitoring, configure Evaluator Rules to continuously evaluate production spans:

  1. Navigate to your application's Evaluator Rules tab

  2. Add a rule for RAG Faithfulness

  3. Map evaluator inputs to your span attributes:

    • user_query → your query span attribute

    • rag_response → your response span attribute

    • retrieved_documents → your context span attribute

  4. Set alert thresholds (e.g., alert when faithfulness drops below 80%)

See Evaluator Rulesarrow-up-right for step-by-step instructions.


Combining Both Layers

The most effective hallucination detection pipeline uses both layers:

Stage
What to Do
Tool

Development

Test against known hallucination scenarios

Evals SDK + RAG Faithfulness

Pre-release

Run experiments comparing pipeline changes

Evals SDK + full diagnostic triad

Production

Continuous monitoring with alerting

Evaluator Rules or LLM Obs enrichments

Investigation

Deep-dive into flagged events

Evals SDK .score() on specific cases


Next Steps


Source notebooks:


Questions? Talkarrow-up-right to a product expert or requestarrow-up-right a demo.

💡 Need help? Contact us at [email protected]envelope.