Overview

Quick Start Guides

Ready to start testing your LLM applications? Choose the hands-on guide that matches your evaluation needs. Each quick start provides step-by-step instructions, code examples, and takes 15-20 minutes to complete.

New to Fiddler Evals? Start with our comprehensive Evaluations guide to understand core concepts, workflows, and best practices before diving into these quick starts.

Evaluations SDK Quick Start

Build comprehensive evaluation workflows with built-in and custom evaluators

Fiddler Evaluations experiment results example page

What you'll learn:

Connect to Fiddler and set up evaluation projects
Create datasets with test cases (CSV, JSONL, or DataFrame)
Use production-ready evaluators (Relevance, Coherence, Toxicity, Sentiment)
Build custom evaluators for domain-specific requirements
Run evaluation experiments with parallel processing
Analyze results and export data for further analysis

Perfect for:

Teams needing full control over evaluation logic
Building comprehensive test suites with multiple quality dimensions
Creating domain-specific custom metrics
Programmatic evaluation workflows and CI/CD integration

Time to complete: ~20 minutes

Start Evaluations SDK Quick Start →

Prompt Specs Quick Start

Create custom LLM-as-a-Judge evaluations without manual prompt engineering

What you'll build: A news article topic classifier that demonstrates:

Schema-based evaluation definition (no prompt writing!)
Validation and testing workflows
Iterative improvement with field descriptions
Production deployment as Fiddler enrichments

What you'll learn:

Define evaluation schemas using JSON
Validate Prompt Specs before deployment
Test evaluation logic with sample data
Improve accuracy through structured descriptions
Deploy custom evaluators to production monitoring

Perfect for:

Teams needing domain-specific evaluation logic
Avoiding time-consuming prompt engineering
Rapid iteration on evaluation criteria
Schema-driven evaluation workflows

Time to complete: ~15 minutes

Start Prompt Specs Quick Start →

Compare LLM Outputs

Systematically compare different LLM models to make data-driven decisions

What you'll learn:

Compare outputs from different LLM models (GPT-4, Claude, Llama, etc.)
Evaluate multiple prompt variations side-by-side
Use Fiddler's observability features for pre-production testing
Balance quality, cost, and latency trade-offs

Perfect for:

Model selection and validation
Prompt A/B testing and optimization
Cost optimization through model comparison
Pre-production evaluation of LLM outputs

Time to complete: ~15 minutes

Interactive notebook:

Start Comparing Models →

Choosing the Right Quick Start

Not sure which guide to start with? Use this decision tree:

Quick recommendations:

🎯 First-time users: Start with Evaluations SDK Quick Start to learn the fundamentals
🔧 Custom evaluations needed: Use Prompt Specs Quick Start for schema-based approach
📊 Model comparison: Jump to Compare LLM Outputs for side-by-side testing

Core Evaluation Concepts

These quick starts demonstrate key Fiddler Evals capabilities:

Built-in Evaluators

Production-ready metrics that run on Fiddler Trust Models:

Quality: Answer Relevance, Coherence, Conciseness, Completeness
Safety: Toxicity Detection, Prompt Injection, PII Detection
RAG-Specific: Faithfulness, Context Relevance
Sentiment: Multi-score sentiment and topic classification

Key benefits:

Zero external API costs
<100ms latency for real-time evaluation
Your data never leaves your environment

Custom Evaluation Frameworks

Build domain-specific evaluators using:

Python-based evaluators - Full programmatic control
Prompt Specs - Schema-driven LLM-as-a-Judge (no manual prompting)
Function wrappers - Integrate existing evaluation logic

Experiment Tracking & Comparison

Every evaluation run becomes a tracked experiment:

Complete lineage of inputs, outputs, and scores
Side-by-side experiment comparison in Fiddler UI
Aggregate statistics and drill-down analysis
Export capabilities for further processing

Common Evaluation Workflows

These quick starts support various evaluation scenarios:

Pre-Production Testing

Regression Testing: Run comprehensive test suites before deployment
Quality Gates: Set score thresholds that must be met
Version Validation: Compare model versions on same datasets

Model & Prompt Optimization

A/B Testing: Compare prompt variations quantitatively
Model Selection: Evaluate multiple LLMs on same tasks
Hyperparameter Tuning: Test temperature, top-p, and other configs

RAG System Evaluation

Faithfulness Checking: Verify responses are grounded in context
Context Relevance: Assess quality of retrieved documents
Source Attribution: Validate proper citation of sources

Safety & Compliance

Adversarial Testing: Test with jailbreak attempts and prompt injections
Content Moderation: Measure toxicity, bias, and PII exposure
Policy Validation: Ensure outputs meet organizational standards

From Development to Production

Fiddler Evals integrates seamlessly with production monitoring:

Unified Workflow Benefits:

Consistent Metrics: Same evaluators in development and production
Continuous Learning: Production insights feed back into test datasets
Seamless Transition: Deploy with confidence—monitoring matches testing

Complete AI Lifecycle:

Build → Design and instrument your applications
Test → Evaluate with Fiddler Evals (these quick starts)
Monitor → Track production with Agentic Monitoring
Improve → Refine based on insights

Learn more about Fiddler's end-to-end agentic AI lifecycle.

Getting Started Checklist

Ready to evaluate your LLM applications?

Choose a quick start guide based on your evaluation needs
Install the Fiddler Evals SDK (for SDK and Prompt Specs guides)
Prepare 5-10 sample test cases for your application
Follow the step-by-step guide (15-20 minutes)
Review results in Fiddler UI
Iterate and expand your evaluation coverage

Additional Resources

Learn More:

Evaluations Overview - Comprehensive guide to Fiddler Evals
Evals SDK Advanced Guide - Production patterns
Fiddler Evals SDK Reference - Complete API documentation
Evaluations Glossary - Key terminology

Example Notebooks:

Related Capabilities:

Agentic Monitoring - Production agent observability
LLM Monitoring - Production LLM tracking
Guardrails - Real-time safety validation

Questions or feedback? Reach out to your Fiddler team or explore our documentation for more resources.

PreviousAWS SageMaker Partner AI App NextEvals SDK Quick Start

Last updated 6 days ago

Was this helpful?