LLM Evaluation Quick Start

Systematically evaluate your LLM applications, RAG systems, and AI agents using the Fiddler Evals SDK with built-in evaluators and custom metrics.

Time to complete: ~20 minutes

What You'll Learn

  • Initialize the Fiddler Evals SDK and organize your evaluations

  • Create evaluation datasets with test cases

  • Use built-in evaluators (faithfulness, toxicity, PII, coherence, etc.)

  • Create custom evaluators for domain-specific requirements

  • Run evaluation experiments and analyze results


Prerequisites

  • Fiddler Account: Active account with API access

  • Python 3.10+

  • Fiddler Evals SDK: pip install fiddler-evals

  • Access Token: From Settings > Credentials


Quick Start

Step 1: Connect to Fiddler

Step 2: Create Project and Application

Step 3: Add Test Cases

Step 4: Define Your LLM Task

Step 5: Run Evaluation with Built-In Evaluators

Step 6: Analyze Results


Built-In Evaluators

The Fiddler Evals SDK includes 14+ pre-built evaluators:

Quality & Accuracy

  • AnswerRelevance - Measures response relevance to the question

  • Coherence - Evaluates logical flow and consistency

  • Conciseness - Checks for unnecessary verbosity

  • AnswerCorrectness - Compares output to expected answer

Safety & Ethics

  • Toxicity - Detects harmful or offensive content

  • PIIDetection - Identifies personally identifiable information

  • Bias - Detects potential biases in responses

RAG-Specific

  • Faithfulness - Checks if response is supported by context

  • ContextRelevance - Evaluates relevance of retrieved context

  • GroundedAnswerRelevance - Combines faithfulness and relevance

Example: RAG Evaluation


Custom Evaluators

Create domain-specific evaluators for your use case:


Advanced Features

Batch Evaluation with Parallel Processing

Import Datasets from Files

Track Experiment Metadata


Complete Example: RAG Evaluation Pipeline


Best Practices

  1. Start Small: Begin with 10-20 test cases to validate your setup

  2. Use Multiple Evaluators: Combine quality, safety, and domain-specific evaluators

  3. Version Your Experiments: Use name_prefix to track different experiment runs

  4. Monitor Over Time: Run evaluations regularly to catch regressions

  5. Custom Evaluators: Create domain-specific evaluators for specialized needs

  6. Leverage Parallelization: Use max_workers for faster evaluation of large datasets

  7. Organize Hierarchically: Use Projects > Applications > Datasets structure


Next Steps

Complete Guides

Concepts & Background

Integration Guides


Summary

You've learned how to:

  • ✅ Initialize the Fiddler Evals SDK with init()

  • ✅ Create Projects, Applications, and Datasets for organization

  • ✅ Build evaluation datasets with test cases

  • ✅ Use 14+ built-in evaluators for quality, safety, and RAG metrics

  • ✅ Create custom evaluators for domain-specific requirements

  • ✅ Run evaluations with evaluate() function

  • ✅ Analyze results programmatically and in the Fiddler UI

The Fiddler Evals SDK provides a comprehensive framework for systematic LLM evaluation, enabling you to ensure quality, safety, and accuracy before deploying your AI applications.