flaskExperiments Quick Start

Systematically evaluate your LLM applications, RAG systems, and AI agents using the Fiddler Evals SDK with built-in evaluators and custom metrics.

Time to complete: ~20 minutes

What You'll Learn

  • Initialize the Fiddler Evals SDK and organize your experiments

  • Create experiment datasets with test cases

  • Use built-in evaluators (faithfulness, relevance, coherence, etc.)

  • Create custom evaluators for domain-specific requirements

  • Run experiments and analyze results


Prerequisites

  • Fiddler Account: Active account with API access

  • Python 3.10+

  • Fiddler Evals SDK: pip install fiddler-evals

  • Access Token: From Settings > Credentials


Quick Start

Step 1: Connect to Fiddler

Step 2: Create Project and Application

Step 3: Add Test Cases

Step 4: Define Your LLM Task

Step 5: Run Experiment with Built-In Evaluators

Step 6: Analyze Results


Built-In Evaluators

The Fiddler Evals SDK includes 13 pre-built evaluators:

Quality & Accuracy

  • AnswerRelevance - Measures response relevance to the question (High / Medium / Low)

  • Coherence - Evaluates logical flow and consistency

  • Conciseness - Checks for unnecessary verbosity

Safety & Trust

  • FTLPromptSafety - Detects prompt injection, jailbreaks, and unsafe prompts

  • FTLResponseFaithfulness - Evaluate faithfulness of LLM responses (Fast Trust Model)

RAG Health Metrics

  • AnswerRelevance - Measures how well responses address user queries (High / Medium / Low)

  • ContextRelevance - Evaluates whether retrieved documents are relevant to the query (High / Medium / Low)

  • RAGFaithfulness - Checks if response is grounded in retrieved documents (Yes / No)

Use these three evaluators together as a diagnostic triad to pinpoint whether RAG pipeline issues originate in retrieval, generation, or query understanding.

Example: RAG Experiment


Custom Evaluators

Create domain-specific evaluators for your use case:


Advanced Features

Batch Experiments with Parallel Processing

Import Datasets from Files

Track Experiment Metadata


Complete Example: RAG Experiment Pipeline


Best Practices

  1. Start Small: Begin with 10-20 test cases to validate your setup

  2. Use Multiple Evaluators: Combine quality, safety, and domain-specific evaluators

  3. Version Your Experiments: Use name_prefix to track different experiment runs

  4. Monitor Over Time: Run experiments regularly to catch regressions

  5. Custom Evaluators: Create domain-specific evaluators for specialized needs

  6. Leverage Parallelization: Use max_workers for faster evaluation of large datasets

  7. Organize Hierarchically: Use Projects > Applications > Datasets structure


Next Steps

Complete Guides

Concepts & Background

Integration Guides


Summary

You've learned how to:

  • ✅ Initialize the Fiddler Evals SDK with init()

  • ✅ Create Projects, Applications, and Datasets for organization

  • ✅ Build experiment datasets with test cases

  • ✅ Use 13 built-in evaluators for quality, safety, and RAG metrics

  • ✅ Create custom evaluators for domain-specific requirements

  • ✅ Run experiments with the evaluate() function

  • ✅ Analyze results programmatically and in the Fiddler UI

The Fiddler Evals SDK provides a comprehensive framework for systematic LLM experiments, enabling you to ensure quality, safety, and accuracy before deploying your AI applications.