Overview
GenAI applications span a wide variety of tasks, from summarization and information extraction to Q&A systems and autonomous agents. Each task type presents unique challenges for accuracy measurement and bias detection. This cookbook provides a framework for selecting the right Fiddler evaluators (both out-of-the-box and custom LLM-as-a-Judge) to ensure your models deliver reliable, fair, and high-quality outputs across different use cases.Understanding Bias in GenAI
Bias in GenAI primarily refers to unfair treatment of people by an LLM based on protected characteristics such as:- Nationality
- Gender
- Age
- Socioeconomic status
- Sexual orientation
- Race/ethnicity
Measuring Bias with Fiddler
A great way to detect bias is to compare accuracy metrics across protected cohorts. For example:- Does your Q&A system achieve lower Answer Relevance scores for questions about certain cultural topics?
- Does your summarization model produce less faithful summaries for content written by authors from specific demographic backgrounds?
- Does your chatbot provide less helpful responses when discussing topics related to certain protected characteristics?
- LLM-as-a-Judge for individual assessment: Custom evaluators that assess each request for potential bias indicators
- Segment analysis for comparative measurement: Compare accuracy metrics across protected cohorts to identify disparities
Out-of-the-Box Evaluators
These pre-built metrics provide immediate, generalized assessments of LLM performance. They offer a quick, low-effort way to establish a baseline for quality and compliance. Learn more: EnrichmentsCustom LLM-as-a-Judge Evaluators
These evaluators allow you to inject specific business rules and task-specific quality standards directly into the evaluation process. By using a structured prompt template to define the criteria, the LLM acts as an automated, scalable subject matter expert, turning nuanced judgments into objective, trackable data. Learn more: Prompt Specs Quick StartRecommended Evaluators by GenAI Task
Summarization
Use Case Description: Summarization models condense long-form content (documents, articles, meeting transcripts, research papers) into concise summaries while preserving key information and maintaining factual accuracy. Accuracy Evaluators:| Evaluator | What does it measure? | What value does it provide? |
|---|---|---|
| Faithfulness (optimized for summarization) | Assesses whether the summary accurately represents the source material without introducing hallucinations or distortions. | Accuracy Baseline: Ensures the summary doesn’t fabricate or misrepresent information from the original text. |
| Conciseness | Evaluates whether the summary is appropriately brief while retaining essential information. | Efficiency: Confirms the model is truly summarizing, not just excerpting or paraphrasing at length. |
- Tag summaries with metadata about source content (author demographics, topic categories, document type)
- Compare Faithfulness and Conciseness scores across segments
- You can also monitor token count differences between summaries of similar-length documents from different segments
- Example: Do summaries of technical papers by women researchers average 20% fewer tokens than those by men, suggesting less thorough coverage?
Information Extraction
Use Case Description: Information extraction models parse unstructured or semi-structured text (invoices, receipts, contracts, emails, forms) to identify and extract specific data fields like names, dates, amounts, addresses, or custom entities relevant to your business. Accuracy Evaluators:| Evaluator | What does it measure? | What value does it provide? |
|---|---|---|
| Per-Field Accuracy LLM-as-a-Judge | Evaluates extraction accuracy for each specific field (e.g., “name,” “date,” “amount”). | Granular Quality Control: Pinpoints which fields the model extracts reliably vs. which require improvement. |
| Overall Accuracy LLM-as-a-Judge | Assesses whether all required information was extracted correctly in aggregate. | Completeness Check: Ensures no critical data is missed during extraction. |
- Tag extractions with metadata about the source (document format, language, content domain, demographic context)
- Compare per-field accuracy rates across segments
- Example: Does the model correctly extract names from resumes with non-Western names at the same rate as Western names?
Question & Answer (RAG Systems)
Use Case Description: RAG (Retrieval-Augmented Generation) systems answer user questions by first retrieving relevant information from a knowledge base, then generating responses grounded in that context. These systems power customer support chatbots, internal knowledge assistants, and documentation search tools. Note: Evaluation strategies differ for closed-ended questions (verifiable facts) vs. open-ended questions (requiring interpretation). Accuracy Evaluators:| Evaluator | What does it measure? | What value does it provide? |
|---|---|---|
| Answer Relevance | Assesses whether the response accurately addresses the user’s question. | Intent Matching: Especially critical for open-ended questions—confirms the response addresses the spirit of the question, not just surface-level keywords. |
| RAG Faithfulness | Evaluates whether the answer is grounded in the retrieved context without hallucination. | Factual Accuracy: For closed-ended questions, ensures the model said the “right thing” based on retrieved documents. For open-ended questions, confirms claims are supported by context. |
| Context Relevance | Measures whether the retrieved documents are actually relevant to the question. | Retrieval Quality: Identifies when the retrieval subsystem is returning poor-quality or off-topic context. |
| Conciseness | Evaluates whether answers are appropriately brief. | User Experience: Prevents over-explaining simple questions or burying key information in verbose responses. |
- Closed-Ended: “What is the capital of France?” → Focus on RAG Faithfulness (Did it say “Paris”?)
- Open-Ended: “How can I improve team productivity?” → Focus on Answer Relevance (Does it address the intent?) and RAG Faithfulness (Are suggestions grounded in retrieved best practices?)
- Tag questions with metadata about topic category, user demographics (if available), or question complexity
- Compare Answer Relevance and RAG Faithfulness scores across segments
- Monitor response token count and sentiment across different question types
- Example: Do questions about minority health issues receive lower Answer Relevance scores or shorter responses than general health questions?
Chatbots
Use Case Description: Conversational AI agents engage in multi-turn dialogues with users for customer support, HR assistance, IT helpdesk, sales qualification, or general-purpose assistance. Unlike simple Q&A, chatbots maintain conversation context and handle follow-up questions. Accuracy Evaluators:| Evaluator | What does it measure? | What value does it provide? |
|---|---|---|
| Answer Relevance | Assesses whether responses stay on-topic throughout the conversation. | Conversational Coherence: Prevents the chatbot from drifting off-topic or ignoring user intent. |
| RAG Faithfulness | Ensures responses are grounded in retrieved knowledge (if using RAG). | Trust Building: Users trust chatbots that cite or reference real information rather than making things up. |
| Sentiment Analysis | Tracks the emotional tone of chatbot responses. | Tone Management: Ensures the bot maintains an appropriate, helpful tone even when users are frustrated. |
- Tag conversations with user demographics, conversation topic, or user satisfaction ratings
- Compare Answer Relevance, RAG Faithfulness, and Sentiment scores across segments
- Monitor average response length and response time across different user groups
- Example: Do users from certain demographic groups receive responses with consistently more negative sentiment or fewer tokens?
Classification
Use Case Description: Classification models categorize text inputs into predefined classes. Common applications include sentiment analysis (positive/negative/neutral), topic categorization (finance/sports/politics), intent detection (purchase/support/information), content moderation, and spam filtering. Accuracy Measurement Strategy: For GenAI classification tasks, traditional ML performance metrics such as precision, recall, and F1-score are highly effective. Since the task can be treated the same way as a predictive ML task, these metrics provide quantitative benchmarks for application performance. Bias Detection Strategy:- Tag classifications with metadata about input characteristics (writing style, topic, demographic context)
- Compare precision, recall, and F1 scores across segments in Fiddler
- Example: Does sentiment classification achieve 85% accuracy for product reviews written by younger users but only 70% for older users?
Autonomous Agents
Use Case Description: Autonomous agents are AI systems that independently plan, execute multi-step workflows, call external tools or APIs, and make decisions to accomplish complex goals. Examples include AI assistants that can book travel, research assistants that gather and synthesize information, or automation agents that handle business processes. Accuracy Evaluators:| Evaluator | What does it measure? | What value does it provide? |
|---|---|---|
| Tool Call Accuracy LLM-as-a-Judge | Evaluates whether the agent selected and invoked the correct tool with proper parameters. | Execution Reliability: Prevents the agent from calling the wrong APIs or passing invalid arguments. |
| Context Relevance | Assesses whether the agent retrieved or used appropriate information before taking action. | Decision Quality: Ensures the agent’s actions are informed by relevant context. |
- Tag agent interactions with task type, user demographics, or workflow complexity
- Compare Context Relevance and Tool Call Accuracy across segments
- Monitor number of tool calls and task completion time across different user groups
- Example: Does the agent require more steps to complete identical tasks for certain user populations?
Code Generation
Use Case Description: Code generation models translate natural language descriptions into working code. Applications range from developer productivity tools (generating boilerplate, writing tests, explaining code) to low-code/no-code platforms where non-developers can create automation scripts or data processing pipelines. Accuracy Evaluators:| Evaluator | What does it measure? | What value does it provide? |
|---|---|---|
| Custom LLM-as-a-Judge | Evaluates code quality, correctness, security, and adherence to coding standards. | Code Review Automation: Acts as a first-pass reviewer to catch obvious errors, security vulnerabilities, or style violations. |
- Tag code generation requests with programming language, user experience level, or problem domain
- Compare correctness and security scores across segments
- Monitor generated code token count for similar requests across different segments
- Example: Does the model generate less secure code for web development tasks compared to data science tasks?
Content Generation
Use Case Description: Content generation models create original written content tailored to specific audiences and purposes. Applications include marketing copy, blog posts, social media content, email campaigns, product descriptions, internal communications, and creative writing assistance. Accuracy Evaluators:| Evaluator | What does it measure? | What value does it provide? |
|---|---|---|
| Answer Relevance | Assesses whether the content addresses the original prompt or brief. | Instruction Adherence: Ensures the model delivers what was requested, not something tangentially related. |
| Custom LLM-as-a-Judge: Audience Alignment | Evaluates whether the content is appropriate for the target audience (tone, complexity, terminology). | Targeted Communication: Confirms the content speaks to the intended reader (e.g., executives vs. technical users vs. general public). |
| Coherence | Assesses logical flow and narrative quality. | Readability: Ensures content is easy to follow and professionally written. |
| Sentiment Analysis | Tracks the emotional tone of generated content. | Brand Safety: Prevents overly negative or inappropriate messaging from reaching audiences. |
- Tag content with target audience demographics, topic category, or content type
- Compare Answer Relevance, Coherence, and audience alignment scores across segments
- Monitor content token count and sentiment scores for similar prompts targeting different audiences
- Example: Does content generated for female audiences consistently receive lower Coherence scores or different sentiment patterns than content for male audiences?
Deep Dive: Detecting Bias with LLM-as-a-Judge
LLM-as-a-Judge evaluators can be designed to assess individual requests for potential bias indicators. This approach is most effective when you have specific bias criteria you want to detect in each prompt or response.When to Use LLM-as-a-Judge for Bias Detection
Use this approach when you need to:- Flag potentially biased content in individual responses (e.g., stereotypical language, exclusionary phrasing)
- Assess prompt safety for bias-related concerns (e.g., requests that might elicit biased responses)
- Evaluate fairness of individual outputs (e.g., does a job description use inclusive language?)
- Score bias indicators that can later be aggregated across segments
Limitations of LLM-as-a-Judge for Bias Detection
Important: LLM-as-a-Judge evaluators process one request at a time and cannot compare performance across demographic groups. They can identify potentially problematic content, but they cannot tell you if your model performs differently for different user populations. For that, you need segment analysis (see next section).Deep Dive: Detecting Bias Through Segment Analysis
Segment analysis is the most powerful method for detecting performance disparities across protected cohorts. By comparing accuracy metrics between demographic groups, you can identify systemic bias in your GenAI systems.How Segment Analysis Works
Since LLM-as-a-Judge evaluators process requests individually and cannot compare across prompts, Fiddler enables bias detection through a four-step process:Step 1: Apply Accuracy Evaluators to All Requests
Use the appropriate out-of-the-box and custom LLM-as-a-Judge evaluators for your use case (as outlined above) to score every request for accuracy and quality. Examples:- Summarization: Faithfulness, Conciseness
- Q&A: Answer Relevance, RAG Faithfulness
- Content Generation: Answer Relevance, Coherence, Audience Alignment
- Information Extraction: Per-Field Accuracy, Overall Accuracy
Step 2: Tag Requests with Relevant Metadata
Enrich your data with tags that enable meaningful comparisons. The specific tags depend on your use case: User-level tags:- Demographics (age group, gender, location, language)
- User segment (enterprise vs. SMB, premium vs. free tier)
- Experience level (new user vs. power user)
- Topic category (e.g., “women’s health” vs. “men’s health” vs. “general health”)
- Content source (author demographics, publication type)
- Language or dialect
- Domain (technical vs. general, formal vs. informal)
- Channel (web, mobile, API)
- Time of day
- Session length
Step 3: Compare Metrics Across Segments in Fiddler
Use Fiddler’s dashboards to calculate and compare: Accuracy metrics:- Average Answer Relevance, Faithfulness, Coherence scores by segment
- Per-field extraction accuracy rates by segment
- Classification precision/recall/F1 by segment
- Token count distributions (are responses equally detailed?)
- Response time (do certain queries take longer?)
- Retrieval quality (Context Relevance by segment)
- Sentiment patterns (are responses equally positive/neutral?)
Step 4: Investigate Disparities
When you identify statistically significant differences: Quantify the gap:- What’s the magnitude of the disparity? (e.g., 15% difference in Faithfulness scores)
- How many users are affected?
- Is the gap consistent over time or growing?
- Is the bias in the training data?
- Is it in the prompt or system instructions?
- Is it in the retrieval system (for RAG applications)?
- Is it in the evaluation criteria themselves?
- Adjust prompts to be more explicit about fairness requirements
- Augment training data to balance representation
- Retrain or fine-tune models
- Filter or reweight retrieval results
- Update evaluation criteria to catch edge cases
Example: Detecting Gender Bias in a Healthcare Q&A System
Scenario: You want to ensure your healthcare Q&A system provides equally helpful answers regardless of the topic’s demographic context. Implementation: Step 1: Apply Accuracy Evaluators- Enable Answer Relevance evaluator for all questions
- Enable RAG Faithfulness evaluator for all questions
topic: "women's health"(menstruation, pregnancy, menopause, etc.)topic: "men's health"(prostate health, testosterone, etc.)topic: "general health"(nutrition, exercise, sleep, etc.)topic: "pediatric care"(child development, vaccinations, etc.)
| Segment | Avg Answer Relevance | Avg RAG Faithfulness | Avg Response Length (tokens) | Sample Size |
|---|---|---|---|---|
| Women’s health | 0.78 | 0.82 | 145 | 1,247 |
| Men’s health | 0.91 | 0.89 | 203 | 1,156 |
| General health | 0.88 | 0.87 | 195 | 3,892 |
| Pediatric care | 0.85 | 0.86 | 188 | 891 |
- 14% lower Answer Relevance scores (0.78 vs. 0.91)
- 8% lower RAG Faithfulness scores (0.82 vs. 0.89)
- 29% shorter responses (145 vs. 203 tokens)
- Examine retrieved documents for women’s health queries → Discovery: Knowledge base has 40% fewer articles on women’s health topics
- Review sample low-scoring responses → Discovery: Model often retrieves general health content instead of women’s-health-specific sources
- Check Context Relevance scores → Discovery: Retrieved documents are less relevant for women’s health (0.72 vs. 0.85 for men’s health)
- Immediate: Adjust retrieval parameters to prioritize topic-specific matches
- Short-term: Expand knowledge base with high-quality women’s health content
- Medium-term: Fine-tune retrieval model on balanced health topic dataset
- Ongoing: Monitor disparity metrics weekly to ensure improvement
| Segment | Avg Answer Relevance | Avg RAG Faithfulness | Avg Response Length (tokens) |
|---|---|---|---|
| Women’s health | 0.86 ↑ | 0.87 ↑ | 192 ↑ |
| Men’s health | 0.91 | 0.89 | 203 |
Advanced Segment Analysis Techniques
1. Intersectional Analysis You can also examine intersections in protected groups rather than just single demographic attributes:- How does the system perform for questions about “women’s health” in Spanish vs. English?
- Do younger users get different quality responses than older users for the same topics?
- Did a model update introduce new disparities?
- Are gaps widening or narrowing?
- Do disparities appear at certain times of day or during high-traffic periods?
- Use adequate sample sizes for each segment
- Calculate confidence intervals
- Apply appropriate statistical tests (t-tests, ANOVA, etc.)
How These Evaluators Can Help
1. Establishing Accuracy Baselines
By combining out-of-the-box metrics (like Faithfulness, Answer Relevance) with custom LLM-as-a-Judge evaluators, you create a comprehensive accuracy profile for your GenAI application. This lets you:- Set quality thresholds (e.g., “95% of summaries must achieve Faithfulness > 0.85”)
- Compare model versions objectively
- Track accuracy degradation over time
2. Detecting Bias at Scale
Manual review of thousands of responses for bias is impractical. By combining LLM-as-a-Judge flagging with segment analysis, you can:- Flag problematic individual outputs for immediate remediation
- Quantify fairness gaps across demographic groups with statistical rigor
- Identify systemic disparities that would be invisible in aggregate metrics
- Build audit trails demonstrating proactive bias monitoring for compliance
3. Continuous Monitoring & Improvement
By tracking these metrics in Fiddler over time, you can:- Identify systemic drift in accuracy or bias after model updates
- Catch when prompt changes inadvertently introduce new biases
- Build feedback loops that route low-quality outputs for retraining data
- Measure the effectiveness of bias mitigation efforts