Compare LLM Outputs

Overview

Making the right LLM model choice is critical for your application's success, but comparing models effectively requires more than intuition. This quick start demonstrates how to use Fiddler's pre-production evaluation environment to perform systematic, side-by-side comparisons of different LLM models using the same prompts and consistent evaluation metrics.

What You'll Learn

In this hands-on notebook guide, you'll learn how to:

  • Upload model outputs from different LLMs (GPT-3.5 and Claude) to Fiddler's pre-production environment

  • Define a consistent evaluation schema that works across multiple models

  • Apply Fiddler enrichments for automated quality assessment:

    • Faithfulness - Detect hallucinations in RAG responses

    • Sentiment Analysis - Understand response tone

    • PII Detection - Identify privacy risks

    • Embeddings - Track semantic patterns and outliers

  • Build comparison dashboards using metric cards to visualize differences

  • Make data-driven decisions about which model best fits your needs

Why Compare in Pre-Production?

Pre-production evaluation lets you test models before committing to production deployment:

  • Apples-to-apples comparison - Same prompts, same metrics, consistent evaluation

  • Cost optimization - Identify the most cost-effective model that meets quality requirements

  • Risk assessment - Understand safety and quality trade-offs before production

  • Informed decisions - Replace guesswork with quantitative evidence

What You'll Build

By the end of this guide, you'll have:

  1. Two datasets uploaded to Fiddler (GPT-3.5 and Claude outputs)

  2. Consistent enrichments applied to both datasets for fair comparison

  3. Metric card dashboards showing side-by-side model performance

  4. Clear insights into which model performs better for your use case

Prerequisites

Before you begin:

  • Fiddler account with access to create projects

  • Python environment with Jupyter notebook support

  • Fiddler Python client (installed via pip)

  • Sample data (provided in the notebook from Fiddler examples repository)

Time to Complete

~15 minutes - Follow the interactive notebook for a guided experience

Get Started

Choose your preferred environment:

Run the notebook directly in your browser with zero setup:

Open in Google Colab →

Google Colab

Option 2: Local Jupyter Notebook

Download and run locally in your own environment:

Download from GitHub →

Workflow Overview

The notebook walks through this evaluation workflow:

Step-by-Step Breakdown

  1. Connect to Fiddler - Initialize the Python client with your credentials

  2. Create Project - Set up a project to organize your evaluation

  3. Load Model Outputs - Read sample datasets containing GPT-3.5 and Claude responses

  4. Configure Enrichments - Enable automated quality metrics:

    • Text embeddings for semantic analysis

    • Faithfulness scoring for hallucination detection

    • Sentiment analysis for response tone

    • PII detection for privacy compliance

  5. Define Schema - Specify input/output columns and metadata

  6. Publish Datasets - Upload both model outputs to pre-production environment

  7. Build Dashboards - Create metric cards for visual comparison

  8. Compare & Decide - Analyze results to choose the best model

Key Features Demonstrated

Fiddler Enrichments

The notebook demonstrates several powerful Fiddler enrichments:

Text Embeddings:

  • Generate semantic representations of prompts, responses, and source documents

  • Track outliers and drift in your evaluation datasets

  • Enable similarity-based analysis

Faithfulness Assessment:

  • Automatically detect hallucinations in RAG-based responses

  • Compare how well each model grounds responses in source documents

  • Critical for applications requiring factual accuracy

Sentiment & Safety:

  • Analyze response tone and sentiment

  • Detect PII leakage across models

  • Assess safety and compliance risks

Pre-Production Comparison

Side-by-Side Analysis:

  • Upload multiple model outputs to the same project

  • Apply identical enrichments for fair comparison

  • Visualize differences with metric cards

Data-Driven Decisions:

  • Quantitative metrics replace subjective judgment

  • Compare quality, safety, and consistency across models

  • Balance performance with cost considerations

Use Cases

This comparison approach works for:

  • Model selection - Choose between GPT-4, Claude, Llama, or other LLMs

  • Prompt optimization - Compare different prompt strategies on the same model

  • Version testing - Evaluate new model versions before upgrading

  • Cost optimization - Find the most cost-effective model meeting quality standards

  • RAG system tuning - Compare retrieval strategies and grounding effectiveness

Example Results

The notebook shows how to create comparison dashboards like this:

Example metric cards comparing GPT-3.5 and Claude performance

Metric cards provide at-a-glance comparison of faithfulness, sentiment, and other quality dimensions

What Happens Next?

After completing this quick start:

  1. Analyze your results - Use metric cards to understand performance differences

  2. Iterate on your evaluation - Add more test cases or different models

  3. Scale to production - Deploy the winning model with confidence

  4. Continue monitoring - Use Fiddler's production monitoring for ongoing quality tracking

Expand Your Evaluation Capabilities:

Production Deployment:

Need Help?

Questions about this guide?


Last updated

Was this helpful?