Overview
Quick Start Guides
Ready to start testing your LLM applications? Choose the hands-on guide that matches your evaluation needs. Each quick start provides step-by-step instructions, code examples, and takes 15-20 minutes to complete.
New to Fiddler Evals? Start with our comprehensive Evaluations guide to understand core concepts, workflows, and best practices before diving into these quick starts.
Evaluations SDK Quick Start
Build comprehensive evaluation workflows with built-in and custom evaluators

What you'll learn:
Connect to Fiddler and set up evaluation projects
Create datasets with test cases (CSV, JSONL, or DataFrame)
Use production-ready evaluators (Relevance, Coherence, Toxicity, Sentiment)
Build custom evaluators for domain-specific requirements
Run evaluation experiments with parallel processing
Analyze results and export data for further analysis
Perfect for:
Teams needing full control over evaluation logic
Building comprehensive test suites with multiple quality dimensions
Creating domain-specific custom metrics
Programmatic evaluation workflows and CI/CD integration
Time to complete: ~20 minutes
Start Evaluations SDK Quick Start →
Prompt Specs Quick Start
Create custom LLM-as-a-Judge evaluations without manual prompt engineering
What you'll build: A news article topic classifier that demonstrates:
Schema-based evaluation definition (no prompt writing!)
Validation and testing workflows
Iterative improvement with field descriptions
Production deployment as Fiddler enrichments
What you'll learn:
Define evaluation schemas using JSON
Validate Prompt Specs before deployment
Test evaluation logic with sample data
Improve accuracy through structured descriptions
Deploy custom evaluators to production monitoring
Perfect for:
Teams needing domain-specific evaluation logic
Avoiding time-consuming prompt engineering
Rapid iteration on evaluation criteria
Schema-driven evaluation workflows
Time to complete: ~15 minutes
Start Prompt Specs Quick Start →
Compare LLM Outputs
Systematically compare different LLM models to make data-driven decisions
What you'll learn:
Compare outputs from different LLM models (GPT-4, Claude, Llama, etc.)
Evaluate multiple prompt variations side-by-side
Use Fiddler's observability features for pre-production testing
Balance quality, cost, and latency trade-offs
Perfect for:
Model selection and validation
Prompt A/B testing and optimization
Cost optimization through model comparison
Pre-production evaluation of LLM outputs
Time to complete: ~15 minutes
Interactive notebook:
Choosing the Right Quick Start
Not sure which guide to start with? Use this decision tree:
Quick recommendations:
🎯 First-time users: Start with Evaluations SDK Quick Start to learn the fundamentals
🔧 Custom evaluations needed: Use Prompt Specs Quick Start for schema-based approach
📊 Model comparison: Jump to Compare LLM Outputs for side-by-side testing
Core Evaluation Concepts
These quick starts demonstrate key Fiddler Evals capabilities:
Built-in Evaluators
Production-ready metrics that run on Fiddler Trust Models:
Quality: Answer Relevance, Coherence, Conciseness, Completeness
Safety: Toxicity Detection, Prompt Injection, PII Detection
RAG-Specific: Faithfulness, Context Relevance
Sentiment: Multi-score sentiment and topic classification
Key benefits:
Zero external API costs
<100ms latency for real-time evaluation
Your data never leaves your environment
Custom Evaluation Frameworks
Build domain-specific evaluators using:
Python-based evaluators - Full programmatic control
Prompt Specs - Schema-driven LLM-as-a-Judge (no manual prompting)
Function wrappers - Integrate existing evaluation logic
Experiment Tracking & Comparison
Every evaluation run becomes a tracked experiment:
Complete lineage of inputs, outputs, and scores
Side-by-side experiment comparison in Fiddler UI
Aggregate statistics and drill-down analysis
Export capabilities for further processing
Common Evaluation Workflows
These quick starts support various evaluation scenarios:
Pre-Production Testing
Regression Testing: Run comprehensive test suites before deployment
Quality Gates: Set score thresholds that must be met
Version Validation: Compare model versions on same datasets
Model & Prompt Optimization
A/B Testing: Compare prompt variations quantitatively
Model Selection: Evaluate multiple LLMs on same tasks
Hyperparameter Tuning: Test temperature, top-p, and other configs
RAG System Evaluation
Faithfulness Checking: Verify responses are grounded in context
Context Relevance: Assess quality of retrieved documents
Source Attribution: Validate proper citation of sources
Safety & Compliance
Adversarial Testing: Test with jailbreak attempts and prompt injections
Content Moderation: Measure toxicity, bias, and PII exposure
Policy Validation: Ensure outputs meet organizational standards
From Development to Production
Fiddler Evals integrates seamlessly with production monitoring:
Unified Workflow Benefits:
Consistent Metrics: Same evaluators in development and production
Continuous Learning: Production insights feed back into test datasets
Seamless Transition: Deploy with confidence—monitoring matches testing
Complete AI Lifecycle:
Build → Design and instrument your applications
Test → Evaluate with Fiddler Evals (these quick starts)
Monitor → Track production with Agentic Monitoring
Improve → Refine based on insights
Learn more about Fiddler's end-to-end agentic AI lifecycle.
Getting Started Checklist
Ready to evaluate your LLM applications?
Additional Resources
Learn More:
Evaluations Overview - Comprehensive guide to Fiddler Evals
Evals SDK Advanced Guide - Production patterns
Fiddler Evals SDK Reference - Complete API documentation
Evaluations Glossary - Key terminology
Example Notebooks:
Related Capabilities:
Agentic Monitoring - Production agent observability
LLM Monitoring - Production LLM tracking
Guardrails - Real-time safety validation
❓ Questions? Talk to a product expert or request a demo.
💡 Need help? Contact us at [email protected].
Last updated
Was this helpful?