Compare LLM Outputs

Overview

Making the right LLM model choice is critical for your application's success, but comparing models effectively requires more than intuition. This quick start demonstrates how to use Fiddler's pre-production evaluation environment to perform systematic, side-by-side comparisons of different LLM models using the same prompts and consistent evaluation metrics.

What You'll Learn

In this hands-on notebook guide, you'll learn how to:

Upload model outputs from different LLMs (GPT-3.5 and Claude) to Fiddler's pre-production environment
Define a consistent evaluation schema that works across multiple models
Apply Fiddler enrichments for automated quality assessment:
- Faithfulness - Detect hallucinations in RAG responses
- Sentiment Analysis - Understand response tone
- PII Detection - Identify privacy risks
- Embeddings - Track semantic patterns and outliers
Build comparison dashboards using metric cards to visualize differences
Make data-driven decisions about which model best fits your needs

Why Compare in Pre-Production?

Pre-production evaluation lets you test models before committing to production deployment:

Apples-to-apples comparison - Same prompts, same metrics, consistent evaluation
Cost optimization - Identify the most cost-effective model that meets quality requirements
Risk assessment - Understand safety and quality trade-offs before production
Informed decisions - Replace guesswork with quantitative evidence

What You'll Build

By the end of this guide, you'll have:

Two datasets uploaded to Fiddler (GPT-3.5 and Claude outputs)
Consistent enrichments applied to both datasets for fair comparison
Metric card dashboards showing side-by-side model performance
Clear insights into which model performs better for your use case

Prerequisites

Before you begin:

Fiddler account with access to create projects
Python environment with Jupyter notebook support
Fiddler Python client (installed via pip)
Sample data (provided in the notebook from Fiddler examples repository)

Time to Complete

~15 minutes - Follow the interactive notebook for a guided experience

Get Started

Choose your preferred environment:

Option 1: Google Colab (Recommended)

Run the notebook directly in your browser with zero setup:

Open in Google Colab →

Option 2: Local Jupyter Notebook

Download and run locally in your own environment:

Download from GitHub →

Workflow Overview

The notebook walks through this evaluation workflow:

Step-by-Step Breakdown

Connect to Fiddler - Initialize the Python client with your credentials
Create Project - Set up a project to organize your evaluation
Load Model Outputs - Read sample datasets containing GPT-3.5 and Claude responses
Configure Enrichments - Enable automated quality metrics:
- Text embeddings for semantic analysis
- Faithfulness scoring for hallucination detection
- Sentiment analysis for response tone
- PII detection for privacy compliance
Define Schema - Specify input/output columns and metadata
Publish Datasets - Upload both model outputs to pre-production environment
Build Dashboards - Create metric cards for visual comparison
Compare & Decide - Analyze results to choose the best model

Key Features Demonstrated

Fiddler Enrichments

The notebook demonstrates several powerful Fiddler enrichments:

Text Embeddings:

Generate semantic representations of prompts, responses, and source documents
Track outliers and drift in your evaluation datasets
Enable similarity-based analysis

Faithfulness Assessment:

Automatically detect hallucinations in RAG-based responses
Compare how well each model grounds responses in source documents
Critical for applications requiring factual accuracy

Sentiment & Safety:

Analyze response tone and sentiment
Detect PII leakage across models
Assess safety and compliance risks

Pre-Production Comparison

Side-by-Side Analysis:

Upload multiple model outputs to the same project
Apply identical enrichments for fair comparison
Visualize differences with metric cards

Data-Driven Decisions:

Quantitative metrics replace subjective judgment
Compare quality, safety, and consistency across models
Balance performance with cost considerations

Use Cases

This comparison approach works for:

Model selection - Choose between GPT-4, Claude, Llama, or other LLMs
Prompt optimization - Compare different prompt strategies on the same model
Version testing - Evaluate new model versions before upgrading
Cost optimization - Find the most cost-effective model meeting quality standards
RAG system tuning - Compare retrieval strategies and grounding effectiveness

Example Results

The notebook shows how to create comparison dashboards like this:

Example metric cards comparing GPT-3.5 and Claude performance

Metric cards provide at-a-glance comparison of faithfulness, sentiment, and other quality dimensions

What Happens Next?

After completing this quick start:

Analyze your results - Use metric cards to understand performance differences
Iterate on your evaluation - Add more test cases or different models
Scale to production - Deploy the winning model with confidence
Continue monitoring - Use Fiddler's production monitoring for ongoing quality tracking

Expand Your Evaluation Capabilities:

Evaluations SDK Quick Start - Build custom evaluation workflows
Prompt Specs Quick Start - Create custom LLM-as-a-Judge evaluators
Evaluations Overview - Comprehensive guide to Fiddler Evals

Production Deployment:

Agentic Monitoring - Monitor LLM applications in production
LLM Monitoring - Track production LLM performance
Guardrails - Add real-time safety validation

Need Help?

Questions about this guide?

Check out our documentation for detailed explanations
Contact your Fiddler team for personalized support
Explore more example notebooks on GitHub

Ready to compare your models? Click the Google Colab link above to get started in minutes!

PreviousPrompt Specs Quick Start NextOverview

Last updated 7 days ago

Was this helpful?

Overview

What You'll Learn

Why Compare in Pre-Production?

What You'll Build

Prerequisites

Time to Complete

Get Started

Option 1: Google Colab (Recommended)

Option 2: Local Jupyter Notebook

Workflow Overview

Step-by-Step Breakdown

Key Features Demonstrated

Fiddler Enrichments

Pre-Production Comparison

Use Cases

Example Results

What Happens Next?

Related Resources

Need Help?