flask-round-potionRAG Evaluation Fundamentals

Evaluate your RAG application's retrieval and generation quality using Fiddler's built-in evaluators. This cookbook demonstrates the direct .score() API for rapid iteration on test cases before scaling to full experiments.

Use this cookbook when: You have a RAG application and want to quickly assess whether responses are faithful to retrieved documents and relevant to user queries.

Time to complete: ~15 minutes

spinner
circle-info

Prerequisites

  • Fiddler account with API access

  • LLM credential configured in Settings > LLM Gateway

  • pip install fiddler-evals pandas


1

Connect and Initialize Evaluators

circle-info

Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.

import pandas as pd
from fiddler_evals import init
from fiddler_evals.evaluators import RAGFaithfulness, AnswerRelevance

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'   # From Settings > LLM Gateway
LLM_MODEL_NAME = 'openai/gpt-4o'              # Or your preferred model

init(url=URL, token=TOKEN)

# Initialize evaluators
faithfulness = RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)
relevance = AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)
2

Create Test Cases

Define representative test cases that cover both successful and failing RAG scenarios:

test_cases = pd.DataFrame(
    [
        {
            'scenario': 'Perfect Match',
            'user_query': 'What is the capital of France?',
            'retrieved_documents': ['Paris is the capital of France.'],
            'rag_response': 'The capital of France is Paris.',
        },
        {
            'scenario': 'Hallucination',
            'user_query': 'What are the office hours?',
            'retrieved_documents': ['We are closed on weekends.'],
            'rag_response': 'We are open 9 AM to 5 PM every day.',
        },
        {
            'scenario': 'Irrelevant Answer',
            'user_query': 'How do I reset my password?',
            'retrieved_documents': ['To reset, click "Forgot Password".'],
            'rag_response': 'Our system is very secure and uses 256-bit encryption.',
        },
    ]
)
3

Evaluate Each Test Case

Use the .score() method to evaluate each test case directly. Each evaluator returns a Score object with value, label, and reasoning:

def evaluate_row(row):
    f_score = faithfulness.score(
        user_query=row['user_query'],
        rag_response=row['rag_response'],
        retrieved_documents=row['retrieved_documents'],
    )

    r_score = relevance.score(
        user_query=row['user_query'],
        rag_response=row['rag_response'],
    )

    return pd.Series(
        {
            'Faithfulness': f_score.label,
            'Relevance': r_score.label,
            'Status': 'HEALTHY'
            if f_score.label == 'yes' and r_score.value >= 0.5
            else 'ISSUE DETECTED',
        }
    )

results = test_cases.join(test_cases.apply(evaluate_row, axis=1))
4

View Results

results[['scenario', 'Faithfulness', 'Relevance', 'Status']]

Expected output:

scenario
Faithfulness
Relevance
Status

Perfect Match

yes

high

HEALTHY

Hallucination

no

high

ISSUE DETECTED

Irrelevant Answer

yes

low

ISSUE DETECTED

The hallucination case scores high on relevance (it addresses the question) but fails faithfulness (the response fabricates hours not in the context). The irrelevant answer is faithful to the context but doesn't actually answer the user's question.


Understanding the Evaluators

RAG Faithfulness checks whether the response is grounded in the retrieved documents.

  • Inputs: user_query, rag_response, retrieved_documents

  • Scoring: Binary — Yes (1.0) / No (0.0)

  • Use for: Detecting hallucinations where the LLM generates plausible but unsupported claims

Answer Relevance measures how well the response addresses the user's query.

  • Inputs: user_query, rag_response (+ optional retrieved_documents)

  • Scoring: Ordinal — High (1.0), Medium (0.5), Low (0.0)

  • Use for: Detecting off-topic responses where the LLM answers a different question

circle-exclamation

Next Steps


Source notebook: Fiddler Cookbook: RAG Evaluation Fundamentalsarrow-up-right


Questions? Talkarrow-up-right to a product expert or requestarrow-up-right a demo.

💡 Need help? Contact us at [email protected]envelope.