RAG Health Metrics Tutorial

Evaluate your RAG application using the RAG Health Metrics diagnostic triad to pinpoint whether issues originate in retrieval, generation, or query understanding.

Time to complete: ~30 minutes

What You'll Learn

Set up a RAG experiment pipeline with the Fiddler Evals SDK
Use Answer Relevance, Context Relevance, and RAG Faithfulness together
Interpret diagnostic results to identify pipeline failures
Distinguish between retrieval and generation problems

Prerequisites

Fiddler Account: Active account with API access
Python 3.10+
Fiddler Evals SDK: pip install fiddler-evals
Familiarity with: Experiments Getting Started

Step 1: Connect and Set Up

from fiddler_evals import init, Project, Application, Dataset
from fiddler_evals.pydantic_models.dataset import NewDatasetItem

# Initialize connection
init(
    url='https://your-org.fiddler.ai',
    token='your-access-token'
)

# Create organizational structure
project = Project.get_or_create(name='rag_health_experiments')
application = Application.get_or_create(
    name='my_rag_app',
    project_id=project.id
)

Step 2: Create a RAG Experiment Dataset

Create test cases that include user queries and retrieved documents. The quality of your evaluation depends on realistic, representative test cases.

dataset = Dataset.create(
    name='rag_health_test_cases',
    application_id=application.id,
    description='RAG Health Metrics experiment dataset'
)

test_cases = [
    # Scenario 1: Good RAG response
    NewDatasetItem(
        inputs={
            "user_query": "What are the benefits of renewable energy?",
            "retrieved_documents": "Renewable energy sources like solar and wind reduce "
                "greenhouse gas emissions, decrease dependence on fossil fuels, and can "
                "lower long-term energy costs. Solar panel costs have dropped 89% since 2010."
        },
        metadata={"scenario": "good_response"}
    ),
    # Scenario 2: Irrelevant retrieval
    NewDatasetItem(
        inputs={
            "user_query": "What are the benefits of renewable energy?",
            "retrieved_documents": "The history of the automobile dates back to the 15th "
                "century. Karl Benz patented the first true automobile in 1886."
        },
        metadata={"scenario": "bad_retrieval"}
    ),
    # Scenario 3: Hallucination risk
    NewDatasetItem(
        inputs={
            "user_query": "What is the current price of solar panels?",
            "retrieved_documents": "Solar energy adoption has grown significantly worldwide. "
                "Many countries now have solar incentive programs."
        },
        metadata={"scenario": "insufficient_context"}
    ),
]

dataset.insert(test_cases)
print(f"Added {len(test_cases)} test cases")

Step 3: Define Your RAG Task

The task function represents your RAG application. It receives inputs and returns the generated response.

def my_rag_task(inputs, extras, metadata):
    """Your RAG application logic.

    Replace this with your actual RAG pipeline:
    1. Take the user query
    2. Use the retrieved documents as context
    3. Generate a response
    """
    user_query = inputs["user_query"]
    context = inputs["retrieved_documents"]

    # Call your LLM with the query and retrieved context
    # Example: response = my_llm.generate(query=user_query, context=context)
    response = generate_rag_response(user_query, context)

    return {
        "rag_response": response,
        "retrieved_documents": context,
    }

Step 4: Run the RAG Health Experiment

Use all three evaluators together for comprehensive diagnostics:

from fiddler_evals import evaluate
from fiddler_evals.evaluators import AnswerRelevance, ContextRelevance, RAGFaithfulness

results = evaluate(
    dataset=dataset,
    task=my_rag_task,
    evaluators=[
        AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential"),      # Is the response relevant? (High/Medium/Low)
        ContextRelevance(model="openai/gpt-4o", credential="your-llm-credential"),     # Are retrieved docs relevant? (High/Medium/Low)
        RAGFaithfulness(model="openai/gpt-4o", credential="your-llm-credential"),      # Is the response grounded? (Yes/No)
    ],
    name_prefix="rag_health_baseline",
    score_fn_kwargs_mapping={
        "user_query": lambda x: x["inputs"]["user_query"],
        "rag_response": "rag_response",
        "retrieved_documents": "retrieved_documents",
    }
)

print(f"Evaluated {len(results.results)} test cases")

Step 5: Analyze Diagnostic Results

Examine the results to identify which pipeline stage is causing issues:

for result in results.results:
    print(f"\nScenario: {result.dataset_item.metadata.get('scenario', 'unknown')}")

    scores = {score.name: score for score in result.scores}

    # Extract scores
    ar_score = scores.get("answer_relevance")
    cr_score = scores.get("context_relevance")
    rf_score = scores.get("rag_faithfulness")

    if ar_score:
        print(f"  Answer Relevance: {ar_score.label} ({ar_score.value})")
        print(f"    Reasoning: {ar_score.reasoning}")

    if cr_score:
        print(f"  Context Relevance: {cr_score.label} ({cr_score.value})")
        print(f"    Reasoning: {cr_score.reasoning}")

    if rf_score:
        print(f"  RAG Faithfulness: {rf_score.label} ({rf_score.value})")
        print(f"    Reasoning: {rf_score.reasoning}")

    # Diagnostic interpretation
    if ar_score and cr_score and rf_score:
        if ar_score.value >= 0.5 and rf_score.value == 0:
            print("  Diagnosis: HALLUCINATION — response is relevant but not grounded")
        elif rf_score.value == 1 and ar_score.value < 0.5:
            print("  Diagnosis: OFF-TOPIC — response is grounded but doesn't answer the query")
        elif cr_score.value < 0.5:
            print("  Diagnosis: BAD RETRIEVAL — retrieved documents are not relevant")
        elif ar_score.value >= 0.5 and rf_score.value == 1 and cr_score.value >= 0.5:
            print("  Diagnosis: HEALTHY — all metrics indicate good RAG performance")

Step 6: Compare RAG Configurations

Use experiments to compare different RAG configurations:

# Evaluate with a different retrieval strategy
results_v2 = evaluate(
    dataset=dataset,
    task=my_improved_rag_task,  # Different retrieval or generation config
    evaluators=[
        AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential"),
        ContextRelevance(model="openai/gpt-4o", credential="your-llm-credential"),
        RAGFaithfulness(model="openai/gpt-4o", credential="your-llm-credential"),
    ],
    name_prefix="rag_health_improved",
    score_fn_kwargs_mapping={
        "user_query": lambda x: x["inputs"]["user_query"],
        "rag_response": "rag_response",
        "retrieved_documents": "retrieved_documents",
    }
)

# Compare results side-by-side in the Fiddler UI
print("Compare experiments in your Fiddler dashboard")

Understanding the Results

Score Interpretation

Evaluator

High Score

Low Score

Answer Relevance

Response directly addresses the query

Response misses the point or is off-topic

Context Relevance

Retrieved documents support the query

Retrieved documents are irrelevant

RAG Faithfulness

Response is grounded in context

Response contains unsupported claims

Common Diagnostic Patterns

Answer Relevance

Context Relevance

RAG Faithfulness

Diagnosis

High

Yes

Healthy RAG pipeline

High

Hallucination — fix generation

Low

High

Yes

Query misunderstanding — fix prompt

Low

Bad retrieval — fix retrieval

High

Low

Yes

Lucky generation — retrieval needs work

Next Steps

RAG Health Diagnostics — Conceptual deep-dive into the diagnostic framework
Evals SDK Advanced Guide — Advanced evaluation patterns
Evaluator Rules — Set up continuous RAG monitoring in production

PreviousExperiments NextAdvanced Prompt Specs

hashtagWhat You'll Learn

hashtagPrerequisites

hashtagStep 1: Connect and Set Up

hashtagStep 2: Create a RAG Experiment Dataset

hashtagStep 3: Define Your RAG Task

hashtagStep 4: Run the RAG Health Experiment

hashtagStep 5: Analyze Diagnostic Results

hashtagStep 6: Compare RAG Configurations

hashtagUnderstanding the Results

hashtagScore Interpretation

hashtagCommon Diagnostic Patterns

hashtagNext Steps

What You'll Learn

Prerequisites

Step 1: Connect and Set Up

Step 2: Create a RAG Experiment Dataset

Step 3: Define Your RAG Task

Step 4: Run the RAG Health Experiment

Step 5: Analyze Diagnostic Results

Step 6: Compare RAG Configurations

Understanding the Results

Score Interpretation

Common Diagnostic Patterns

Next Steps