gavelBuilding Custom Judge Evaluators

Create domain-specific evaluators using CustomJudge to encode business rules, quality criteria, or classification tasks that built-in evaluators don't cover.

Use this cookbook when: You need evaluation criteria specific to your domain, such as topic classification, brand voice matching, compliance checking, or custom quality rubrics.

Time to complete: ~20 minutes

spinner
circle-info

Prerequisites

  • Fiddler account with API access

  • LLM credential configured in Settings > LLM Gateway

  • pip install fiddler-evals pandas


1

Connect to Fiddler

circle-info

Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.

import pandas as pd
from fiddler_evals import init
from fiddler_evals.evaluators import CustomJudge

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'
LLM_MODEL_NAME = 'openai/gpt-4o'

init(url=URL, token=TOKEN)
2

Prepare Test Data

This example classifies news summaries into topics — Sci/Tech, Sports, Business, or World:

df = pd.DataFrame(
    [
        {
            'text': 'Google announces new AI chip designed to accelerate '
                'machine learning workloads.',
            'ground_truth': 'Sci/Tech',
        },
        {
            'text': 'The Lakers defeated the Celtics 112-108 in overtime, '
                'with LeBron James scoring 35 points.',
            'ground_truth': 'Sports',
        },
        {
            'text': 'Federal Reserve raises interest rates by 0.25% citing '
                'persistent inflation concerns.',
            'ground_truth': 'Business',
        },
        {
            'text': 'United Nations Security Council votes to impose new '
                'sanctions on North Korea.',
            'ground_truth': 'World',
        },
        {
            'text': 'Microsoft acquires gaming company Activision Blizzard '
                'for $69 billion.',
            'ground_truth': 'Sci/Tech',
        },
    ]
)
3

Create a CustomJudge

Define your evaluation criteria using a prompt_template with {{ placeholder }} markers and output_fields that define the structured response:

simple_judge = CustomJudge(
    prompt_template="""
        Determine the topic of the given news summary.
        Pick one of: Sports, World, Sci/Tech, Business.

        News Summary:
        {{ news_summary }}
    """,
    output_fields={
        'topic': {'type': 'string'},
        'reasoning': {'type': 'string'},
    },
    model=LLM_MODEL_NAME,
    credential=LLM_CREDENTIAL_NAME,
)
circle-info

How It Works

  • prompt_template: Your evaluation prompt with {{ placeholder }} markers (Jinja syntax). Placeholders are filled from the inputs dict passed to .score().

  • output_fields: Schema defining the expected outputs. Each field specifies a type (string, boolean, integer, number) and optional choices or description.

4

Run Evaluator

results = []
for _, row in df.iterrows():
    scores = simple_judge.score(inputs={'news_summary': row['text']})
    scores_dict = {s.name: s for s in scores}
    results.append(
        {
            'ground_truth': row['ground_truth'],
            'predicted': scores_dict['topic'].label,
            'reasoning': scores_dict['reasoning'].label,
        }
    )

results_df = pd.DataFrame(results)
accuracy = (results_df['ground_truth'] == results_df['predicted']).mean()
print(f'Accuracy: {accuracy:.0%}')

# Show misclassified
misclassified = results_df[results_df['ground_truth'] != results_df['predicted']]
if len(misclassified) > 0:
    print(f'\nMisclassified ({len(misclassified)}):')
    for _, row in misclassified.iterrows():
        print(f'  Expected: {row["ground_truth"]}, Predicted: {row["predicted"]}')

Expected output:

Accuracy: 80%

Misclassified (1):
  Expected: Sci/Tech, Predicted: Business
circle-exclamation
5

Improve the Prompt

Add clearer topic guidelines and constrain outputs with choices:

improved_judge = CustomJudge(
    prompt_template="""
        Determine the topic of the given news summary.

        Use topic 'Sci/Tech' if the news summary is about a company or
        business in the tech industry, or if the news summary is about
        a scientific discovery or research, including health and medicine.
        Use topic 'Sports' if the news summary is about a sports event
        or athlete.
        Use topic 'Business' if the news summary is about a company or
        industry outside of science, technology, or sports.
        Use topic 'World' if the news summary is about a global event
        or issue.

        News Summary:
        {{ news_summary }}
    """,
    output_fields={
        'topic': {
            'type': 'string',
            'choices': ['Sci/Tech', 'Sports', 'Business', 'World'],
        },
        'reasoning': {'type': 'string'},
    },
    model=LLM_MODEL_NAME,
    credential=LLM_CREDENTIAL_NAME,
)

Key improvements:

  • Explicit guidelines for each topic eliminate ambiguity

  • choices constrains the LLM output to valid categories only

Compare Results

improved_results = []
for _, row in df.iterrows():
    scores = improved_judge.score(inputs={'news_summary': row['text']})
    scores_dict = {s.name: s for s in scores}
    improved_results.append(
        {
            'ground_truth': row['ground_truth'],
            'predicted': scores_dict['topic'].label,
        }
    )

improved_df = pd.DataFrame(improved_results)
original_accuracy = (results_df['ground_truth'] == results_df['predicted']).mean()
improved_accuracy = (improved_df['ground_truth'] == improved_df['predicted']).mean()

print(f'Simple prompt:   {original_accuracy:.0%}')
print(f'Improved prompt: {improved_accuracy:.0%}')

Expected output:

Simple prompt:   80%
Improved prompt: 100%

Output Field Types

CustomJudge supports four output field types:

Type
Description
Example Use

string

Free-form text or categorical (with choices)

Topic classification, reasoning

boolean

True/False

Compliance checks, binary quality gates

integer

Whole numbers

1-5 rating scales

number

Floating-point

0.0-1.0 confidence scores

Using choices for Categorical Output

Using description to Guide the LLM


Real-World Examples

Brand Voice Match

Evaluate whether generated content adheres to brand guidelines:

Expected output:

Compliance Checking

Verify responses meet regulatory requirements:


Next Steps


Source notebook: Fiddler Cookbook: Custom Judge Evaluatorsarrow-up-right


Questions? Talkarrow-up-right to a product expert or requestarrow-up-right a demo.

💡 Need help? Contact us at [email protected]envelope.