CustomJudge

Create a fully customizable LLM-as-a-Judge evaluator with your own prompt and output schema. The CustomJudge evaluator allows you to define arbitrary evaluation criteria by specifying a custom prompt template and structured output fields. This is the most flexible evaluator in the Fiddler Evals SDK, enabling you to build domain-specific evaluation logic without writing custom code. Key Features:

Custom Prompts: Define your own evaluation prompt with {{ placeholder }} syntax
Structured Outputs: Specify typed output fields (string, boolean, integer, number)
Categorical Choices: Constrain string outputs to predefined categories
Multi-Field Outputs: Return multiple scores/labels from a single evaluation
Field Descriptions: Guide the LLM with descriptions for each output field
Numeric Constraints: Set minimum/maximum bounds on numeric output fields
Multi-Message Prompts: Use structured message lists with system/user/assistant roles
Input Metadata: Define input field requirements and documentation
Output Transforms: Map LLM response fields to final output fields with value mapping
Intermediate Response Schema: Define a separate LLM response schema with transforms
CustomJudgeSpec Object: Bundle prompt, inputs, and outputs into a reusable CustomJudgeSpec

Use Cases:

Domain-Specific Evaluation: Create evaluators tailored to your industry or use case
Custom Rubrics: Implement grading rubrics with specific criteria
Multi-Aspect Scoring: Evaluate multiple dimensions (e.g., tone, accuracy, helpfulness)
Classification Tasks: Categorize responses into predefined labels
Compliance Checking: Verify responses meet specific guidelines or policies

Output Field Types:

string: Free-form text output, or categorical if choices is specified
boolean: True/False classification
integer: Whole number scores (e.g., 1-5 rating scale)
number: Floating-point scores (e.g., 0.0-1.0 confidence)

Parameters

str | list[Message], optional

default:"None"

The evaluation prompt. Can be either a plain string with {{ placeholder }} markers (wrapped in a single user message automatically) or a list of Message dicts for multi-message prompts. Required unless prompt_spec is provided.

Dict[str, OutputField], optional

default:"None"

Schema defining the expected outputs. Required unless prompt_spec is provided. Each field has:

type: One of ‘string’, ‘boolean’, ‘integer’, ‘number’
choices (optional): List of allowed values for categorical string fields
description (optional): Instructions for the LLM about this field
title (optional): Human-readable title for the field
transform (optional): Transform from LLM response field to output field
default (optional): Default value if field is missing from LLM response
minimum (optional): Minimum allowed value for numeric fields
maximum (optional): Maximum allowed value for numeric fields

CustomJudgeSpec, optional

default:"None"

A CustomJudgeSpec object bundling prompt_template, output_fields, inputs, and llm_response_fields into a single reusable specification. Mutually exclusive with providing prompt_template and output_fields directly.

str

required

LLM Gateway model name in {provider}/{model} format. E.g., openai/gpt-4o, anthropic/claude-3-sonnet

str, optional

required

Name of the LLM Gateway credential for the provider.

Dict[str, InputFieldSpec], optional

default:"None"

Metadata for template variables. Keys must match {{ placeholder }} names in the prompt template. Each value can specify:

title (optional): Human-readable title
description (optional): Description of the input
required (optional): Whether this input must be provided (default: False)

Dict[str, OutputField], optional

default:"None"

Schema for the LLM response before transformation. When provided, the LLM is instructed to return fields matching this schema, and output_fields with transform specs define how to map the response to final outputs. Required when any output field uses a transform.

Returns

A list of Score objects, one for each output field defined. Each Score contains:

name: The output field name (e.g., “sentiment”, “confidence”)
value: The numeric value (for number/integer/boolean fields)
label: The string label (for string/categorical fields)
reasoning: Always None for CustomJudge (reasoning is returned as a field)

Example

Basic sentiment analysis with categorical output:

from fiddler_evals.evaluators import CustomJudge

evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="""
        Analyze the sentiment of the following customer review:

        Review: {{ review_text }}

        Classify the sentiment and explain your reasoning.
    """,
    output_fields={
        "sentiment": {
            "type": "string",
            "choices": ["positive", "negative", "neutral"],
        },
        "confidence": {
            "type": "number",
            "description": "Confidence score between 0 and 1"
        },
        "reasoning": {
            "type": "string",
        }
    }
)

scores = evaluator.score(inputs={
    "review_text": "The product exceeded my expectations! Fast shipping too."
})

# Access individual scores by index or iterate
for score in scores:
    print(f"{score.name}: {score.value or score.label}")
# Output:
# sentiment: positive
# confidence: 0.95
# reasoning: The review expresses satisfaction...

Example

Multi-criteria response quality evaluation:

evaluator = CustomJudge(
    model="anthropic/claude-3-sonnet",
    credential="my-anthropic-key",
    prompt_template="""
        Evaluate the quality of this customer support response.

        Customer Question: {{ question }}
        Support Response: {{ response }}

        Rate the response on multiple criteria.
    """,
    output_fields={
        "helpful": {
            "type": "boolean",
            "description": "Does the response address the customer's question?"
        },
        "professional_tone": {
            "type": "boolean",
            "description": "Is the tone professional and courteous?"
        },
        "quality_score": {
            "type": "integer",
            "description": "Overall quality rating from 1 (poor) to 5 (excellent)"
        }
    }
)

scores = evaluator.score(inputs={
    "question": "How do I reset my password?",
    "response": "Click 'Forgot Password' on the login page and follow the steps."
})

# Convert to dict for easy access
scores_dict = {s.name: s for s in scores}
print(f"Helpful: {scores_dict['helpful'].value}")  # True
print(f"Quality: {scores_dict['quality_score'].value}")  # 4

Example

Code review evaluator:

evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="""
        Review this code change for potential issues:
```{{ language }}

Context: {{ pr_description }} """, output_fields={ “has_bugs”: { “type”: “boolean”, “description”: “Are there any obvious bugs or logic errors?” }, “severity”: { “type”: “string”, “choices”: [“critical”, “major”, “minor”, “none”], “description”: “Severity of issues found” }, “feedback”: { “type”: “string”, “description”: “Specific feedback for the code author” } } )

{{ code_diff }}

## Example
Multi-message prompt with system instructions and numeric constraints:
```python
evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template=[
        {"role": "system", "content": "You are an expert code reviewer."},
        {"role": "user", "content": "Review this code:\n{{ code }}"},
    ],
    output_fields={
        "quality_score": {
            "type": "integer",
            "minimum": 1,
            "maximum": 10,
            "description": "Code quality score from 1 to 10"
        },
        "feedback": {
            "type": "string",
            "description": "Specific feedback for the code author"
        }
    },
    inputs={
        "code": {"required": True, "description": "The code to review"}
    }
)

Example

Using llm_response_fields with transforms for value mapping:

evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="Is the response faithful? Response: {{ response }}",
    llm_response_fields={
        "is_faithful": {
            "type": "string",
            "choices": ["faithful", "not_faithful"],
        },
        "reasoning": {"type": "string"},
    },
    output_fields={
        "label": {
            "type": "string",
            "choices": ["yes", "no"],
            "transform": {
                "source_field": "is_faithful",
                "value_map": {"faithful": "yes", "not_faithful": "no"},
            },
        },
        "reasoning": {"type": "string"},
    },
)

Example

Using a reusable CustomJudgeSpec object:

from fiddler_evals.evaluators.custom_judge import (
    CustomJudge, CustomJudgeSpec, Message, InputFieldSpec,
)

spec = CustomJudgeSpec(
    prompt_template=[
        Message(role="system", content="You are an expert evaluator."),
        Message(role="user", content="Rate this response:\n{{ response }}"),
    ],
    inputs={"response": InputFieldSpec(required=True)},
    output_fields={
        "quality": {
            "type": "integer",
            "minimum": 1,
            "maximum": 5,
            "description": "Quality rating from 1 to 5",
        }
    },
)

evaluator = CustomJudge(prompt_spec=spec, model="openai/gpt-4o")

Placeholder names in {{ }} must exactly match keys in the inputs dict
The LLM is instructed to return JSON matching your output schema
For best results, include clear descriptions for each output field
Use choices for categorical fields to ensure consistent outputs
Use minimum/maximum for numeric fields to constrain values
Use CustomJudgeSpec to bundle prompt configuration into a reusable object
This evaluator requires an active connection to the Fiddler API

name = ‘custom_judge’

score()

Score using the Custom Judge.

Parameters

Dict[str, Any]

required

Values for the {{ placeholders }} in your prompt_template. Keys must match placeholder names exactly.

Returns

A list of Score objects, one for each output field defined.

Raises

ValueError – If inputs is empty.

Fiddler Python Client SDK

Fiddler Evals SDK

Fiddler OTel SDK

Fiddler LangGraph SDK

Fiddler LangChain SDK

Fiddler Google ADK SDK

Fiddler Strands Agent SDK

Fiddler OTel JS SDK

Fiddler LangGraph JS SDK

Fiddler LangChain JS SDK

Parameters

Returns

Example

Example

Example

Example

Example

name = ‘custom_judge’

score()

Parameters

Returns

Raises

​Parameters

​Returns

​Example

​Example

​Example

​Example

​Example

​name = ‘custom_judge’

​score()

​Parameters

​Returns

​Raises

Parameters

Returns

Example

Example

Example

Example

Example

name = ‘custom_judge’

score()

Parameters

Returns

Raises