Compare LLM Outputs
Overview
Making the right LLM model choice is critical for your application's success, but comparing models effectively requires more than intuition. This quick start demonstrates how to use Fiddler's pre-production evaluation environment to perform systematic, side-by-side comparisons of different LLM models using the same prompts and consistent evaluation metrics.
What You'll Learn
In this hands-on notebook guide, you'll learn how to:
Upload model outputs from different LLMs (GPT-3.5 and Claude) to Fiddler's pre-production environment
Define a consistent evaluation schema that works across multiple models
Apply Fiddler enrichments for automated quality assessment:
Faithfulness - Detect hallucinations in RAG responses
Sentiment Analysis - Understand response tone
PII Detection - Identify privacy risks
Embeddings - Track semantic patterns and outliers
Build comparison dashboards using metric cards to visualize differences
Make data-driven decisions about which model best fits your needs
Why Compare in Pre-Production?
Pre-production evaluation lets you test models before committing to production deployment:
Apples-to-apples comparison - Same prompts, same metrics, consistent evaluation
Cost optimization - Identify the most cost-effective model that meets quality requirements
Risk assessment - Understand safety and quality trade-offs before production
Informed decisions - Replace guesswork with quantitative evidence
What You'll Build
By the end of this guide, you'll have:
Two datasets uploaded to Fiddler (GPT-3.5 and Claude outputs)
Consistent enrichments applied to both datasets for fair comparison
Metric card dashboards showing side-by-side model performance
Clear insights into which model performs better for your use case
Prerequisites
Before you begin:
Fiddler account with access to create projects
Python environment with Jupyter notebook support
Fiddler Python client (installed via pip)
Sample data (provided in the notebook from Fiddler examples repository)
Time to Complete
~15 minutes - Follow the interactive notebook for a guided experience
Get Started
Choose your preferred environment:
Option 1: Google Colab (Recommended)
Run the notebook directly in your browser with zero setup:
Option 2: Local Jupyter Notebook
Download and run locally in your own environment:
Workflow Overview
The notebook walks through this evaluation workflow:
Step-by-Step Breakdown
Connect to Fiddler - Initialize the Python client with your credentials
Create Project - Set up a project to organize your evaluation
Load Model Outputs - Read sample datasets containing GPT-3.5 and Claude responses
Configure Enrichments - Enable automated quality metrics:
Text embeddings for semantic analysis
Faithfulness scoring for hallucination detection
Sentiment analysis for response tone
PII detection for privacy compliance
Define Schema - Specify input/output columns and metadata
Publish Datasets - Upload both model outputs to pre-production environment
Build Dashboards - Create metric cards for visual comparison
Compare & Decide - Analyze results to choose the best model
Key Features Demonstrated
Fiddler Enrichments
The notebook demonstrates several powerful Fiddler enrichments:
Text Embeddings:
Generate semantic representations of prompts, responses, and source documents
Track outliers and drift in your evaluation datasets
Enable similarity-based analysis
Faithfulness Assessment:
Automatically detect hallucinations in RAG-based responses
Compare how well each model grounds responses in source documents
Critical for applications requiring factual accuracy
Sentiment & Safety:
Analyze response tone and sentiment
Detect PII leakage across models
Assess safety and compliance risks
Pre-Production Comparison
Side-by-Side Analysis:
Upload multiple model outputs to the same project
Apply identical enrichments for fair comparison
Visualize differences with metric cards
Data-Driven Decisions:
Quantitative metrics replace subjective judgment
Compare quality, safety, and consistency across models
Balance performance with cost considerations
Use Cases
This comparison approach works for:
Model selection - Choose between GPT-4, Claude, Llama, or other LLMs
Prompt optimization - Compare different prompt strategies on the same model
Version testing - Evaluate new model versions before upgrading
Cost optimization - Find the most cost-effective model meeting quality standards
RAG system tuning - Compare retrieval strategies and grounding effectiveness
Example Results
The notebook shows how to create comparison dashboards like this:

Metric cards provide at-a-glance comparison of faithfulness, sentiment, and other quality dimensions
What Happens Next?
After completing this quick start:
Analyze your results - Use metric cards to understand performance differences
Iterate on your evaluation - Add more test cases or different models
Scale to production - Deploy the winning model with confidence
Continue monitoring - Use Fiddler's production monitoring for ongoing quality tracking
Related Resources
Expand Your Evaluation Capabilities:
Evaluations SDK Quick Start - Build custom evaluation workflows
Prompt Specs Quick Start - Create custom LLM-as-a-Judge evaluators
Evaluations Overview - Comprehensive guide to Fiddler Evals
Production Deployment:
Agentic Monitoring - Monitor LLM applications in production
LLM Monitoring - Track production LLM performance
Guardrails - Add real-time safety validation
Need Help?
Questions about this guide?
Check out our documentation for detailed explanations
Contact your Fiddler team for personalized support
Explore more example notebooks on GitHub
Ready to compare your models? Click the Google Colab link above to get started in minutes!
Last updated
Was this helpful?