Overview

Quick Start Guides

Ready to start testing your LLM applications? Choose the hands-on guide that matches your evaluation needs. Each quick start provides step-by-step instructions, code examples, and takes 15-20 minutes to complete.

New to Fiddler Evals? Start with our comprehensive Evaluations guide to understand core concepts, workflows, and best practices before diving into these quick starts.


Evaluations SDK Quick Start

Build comprehensive evaluation workflows with built-in and custom evaluators

Fiddler Evaluations experiment results example page
Analyze experiment results with detailed metrics and insights

What you'll learn:

  • Connect to Fiddler and set up evaluation projects

  • Create datasets with test cases (CSV, JSONL, or DataFrame)

  • Use production-ready evaluators (Relevance, Coherence, Toxicity, Sentiment)

  • Build custom evaluators for domain-specific requirements

  • Run evaluation experiments with parallel processing

  • Analyze results and export data for further analysis

Perfect for:

  • Teams needing full control over evaluation logic

  • Building comprehensive test suites with multiple quality dimensions

  • Creating domain-specific custom metrics

  • Programmatic evaluation workflows and CI/CD integration

Time to complete: ~20 minutes

Start Evaluations SDK Quick Start →


Prompt Specs Quick Start

Create custom LLM-as-a-Judge evaluations without manual prompt engineering

What you'll build: A news article topic classifier that demonstrates:

  • Schema-based evaluation definition (no prompt writing!)

  • Validation and testing workflows

  • Iterative improvement with field descriptions

  • Production deployment as Fiddler enrichments

What you'll learn:

  • Define evaluation schemas using JSON

  • Validate Prompt Specs before deployment

  • Test evaluation logic with sample data

  • Improve accuracy through structured descriptions

  • Deploy custom evaluators to production monitoring

Perfect for:

  • Teams needing domain-specific evaluation logic

  • Avoiding time-consuming prompt engineering

  • Rapid iteration on evaluation criteria

  • Schema-driven evaluation workflows

Time to complete: ~15 minutes

Start Prompt Specs Quick Start →


Compare LLM Outputs

Systematically compare different LLM models to make data-driven decisions

What you'll learn:

  • Compare outputs from different LLM models (GPT-4, Claude, Llama, etc.)

  • Evaluate multiple prompt variations side-by-side

  • Use Fiddler's observability features for pre-production testing

  • Balance quality, cost, and latency trade-offs

Perfect for:

  • Model selection and validation

  • Prompt A/B testing and optimization

  • Cost optimization through model comparison

  • Pre-production evaluation of LLM outputs

Time to complete: ~15 minutes

Interactive notebook:

Start Comparing Models →


Choosing the Right Quick Start

Not sure which guide to start with? Use this decision tree:

Quick recommendations:

Core Evaluation Concepts

These quick starts demonstrate key Fiddler Evals capabilities:

Built-in Evaluators

Production-ready metrics that run on Fiddler Trust Models:

  • Quality: Answer Relevance, Coherence, Conciseness, Completeness

  • Safety: Toxicity Detection, Prompt Injection, PII Detection

  • RAG-Specific: Faithfulness, Context Relevance

  • Sentiment: Multi-score sentiment and topic classification

Key benefits:

  • Zero external API costs

  • <100ms latency for real-time evaluation

  • Your data never leaves your environment

Custom Evaluation Frameworks

Build domain-specific evaluators using:

  • Python-based evaluators - Full programmatic control

  • Prompt Specs - Schema-driven LLM-as-a-Judge (no manual prompting)

  • Function wrappers - Integrate existing evaluation logic

Experiment Tracking & Comparison

Every evaluation run becomes a tracked experiment:

  • Complete lineage of inputs, outputs, and scores

  • Side-by-side experiment comparison in Fiddler UI

  • Aggregate statistics and drill-down analysis

  • Export capabilities for further processing

Common Evaluation Workflows

These quick starts support various evaluation scenarios:

Pre-Production Testing

  • Regression Testing: Run comprehensive test suites before deployment

  • Quality Gates: Set score thresholds that must be met

  • Version Validation: Compare model versions on same datasets

Model & Prompt Optimization

  • A/B Testing: Compare prompt variations quantitatively

  • Model Selection: Evaluate multiple LLMs on same tasks

  • Hyperparameter Tuning: Test temperature, top-p, and other configs

RAG System Evaluation

  • Faithfulness Checking: Verify responses are grounded in context

  • Context Relevance: Assess quality of retrieved documents

  • Source Attribution: Validate proper citation of sources

Safety & Compliance

  • Adversarial Testing: Test with jailbreak attempts and prompt injections

  • Content Moderation: Measure toxicity, bias, and PII exposure

  • Policy Validation: Ensure outputs meet organizational standards

From Development to Production

Fiddler Evals integrates seamlessly with production monitoring:

Unified Workflow Benefits:

  • Consistent Metrics: Same evaluators in development and production

  • Continuous Learning: Production insights feed back into test datasets

  • Seamless Transition: Deploy with confidence—monitoring matches testing

Complete AI Lifecycle:

  1. Build → Design and instrument your applications

  2. Test → Evaluate with Fiddler Evals (these quick starts)

  3. Monitor → Track production with Agentic Monitoring

  4. Improve → Refine based on insights

Learn more about Fiddler's end-to-end agentic AI lifecycle.

Getting Started Checklist

Ready to evaluate your LLM applications?

Additional Resources

Learn More:

Example Notebooks:

Related Capabilities:


Last updated

Was this helpful?