Skip to main content
Create a fully customizable LLM-as-a-Judge evaluator with your own prompt and output schema. The CustomJudge evaluator allows you to define arbitrary evaluation criteria by specifying a custom prompt template and structured output fields. This is the most flexible evaluator in the Fiddler Evals SDK, enabling you to build domain-specific evaluation logic without writing custom code. Key Features:
  • Custom Prompts: Define your own evaluation prompt with {{ placeholder }} syntax
  • Structured Outputs: Specify typed output fields (string, boolean, integer, number)
  • Categorical Choices: Constrain string outputs to predefined categories
  • Multi-Field Outputs: Return multiple scores/labels from a single evaluation
  • Field Descriptions: Guide the LLM with descriptions for each output field
  • Numeric Constraints: Set minimum/maximum bounds on numeric output fields
  • Multi-Message Prompts: Use structured message lists with system/user/assistant roles
  • Input Metadata: Define input field requirements and documentation
  • Output Transforms: Map LLM response fields to final output fields with value mapping
  • Intermediate Response Schema: Define a separate LLM response schema with transforms
  • CustomJudgeSpec Object: Bundle prompt, inputs, and outputs into a reusable CustomJudgeSpec
Use Cases:
  • Domain-Specific Evaluation: Create evaluators tailored to your industry or use case
  • Custom Rubrics: Implement grading rubrics with specific criteria
  • Multi-Aspect Scoring: Evaluate multiple dimensions (e.g., tone, accuracy, helpfulness)
  • Classification Tasks: Categorize responses into predefined labels
  • Compliance Checking: Verify responses meet specific guidelines or policies
Output Field Types:
  • string: Free-form text output, or categorical if choices is specified
  • boolean: True/False classification
  • integer: Whole number scores (e.g., 1-5 rating scale)
  • number: Floating-point scores (e.g., 0.0-1.0 confidence)

Parameters

prompt_template
str | list[Message], optional
default:"None"
The evaluation prompt. Can be either a plain string with {{ placeholder }} markers (wrapped in a single user message automatically) or a list of Message dicts for multi-message prompts. Required unless prompt_spec is provided.
output_fields
Dict[str, OutputField], optional
default:"None"
Schema defining the expected outputs. Required unless prompt_spec is provided. Each field has:
  • type: One of ‘string’, ‘boolean’, ‘integer’, ‘number’
  • choices (optional): List of allowed values for categorical string fields
  • description (optional): Instructions for the LLM about this field
  • title (optional): Human-readable title for the field
  • transform (optional): Transform from LLM response field to output field
  • default (optional): Default value if field is missing from LLM response
  • minimum (optional): Minimum allowed value for numeric fields
  • maximum (optional): Maximum allowed value for numeric fields
prompt_spec
CustomJudgeSpec, optional
default:"None"
A CustomJudgeSpec object bundling prompt_template, output_fields, inputs, and llm_response_fields into a single reusable specification. Mutually exclusive with providing prompt_template and output_fields directly.
model
str
required
LLM Gateway model name in {provider}/{model} format. E.g., openai/gpt-4o, anthropic/claude-3-sonnet
credential
str, optional
required
Name of the LLM Gateway credential for the provider.
inputs
Dict[str, InputFieldSpec], optional
default:"None"
Metadata for template variables. Keys must match {{ placeholder }} names in the prompt template. Each value can specify:
  • title (optional): Human-readable title
  • description (optional): Description of the input
  • required (optional): Whether this input must be provided (default: False)
llm_response_fields
Dict[str, OutputField], optional
default:"None"
Schema for the LLM response before transformation. When provided, the LLM is instructed to return fields matching this schema, and output_fields with transform specs define how to map the response to final outputs. Required when any output field uses a transform.

Returns

A list of Score objects, one for each output field defined. Each Score contains:
  • name: The output field name (e.g., “sentiment”, “confidence”)
  • value: The numeric value (for number/integer/boolean fields)
  • label: The string label (for string/categorical fields)
  • reasoning: Always None for CustomJudge (reasoning is returned as a field)

Example

Basic sentiment analysis with categorical output:
from fiddler_evals.evaluators import CustomJudge

evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="""
        Analyze the sentiment of the following customer review:

        Review: {{ review_text }}

        Classify the sentiment and explain your reasoning.
    """,
    output_fields={
        "sentiment": {
            "type": "string",
            "choices": ["positive", "negative", "neutral"],
        },
        "confidence": {
            "type": "number",
            "description": "Confidence score between 0 and 1"
        },
        "reasoning": {
            "type": "string",
        }
    }
)

scores = evaluator.score(inputs={
    "review_text": "The product exceeded my expectations! Fast shipping too."
})

# Access individual scores by index or iterate
for score in scores:
    print(f"{score.name}: {score.value or score.label}")
# Output:
# sentiment: positive
# confidence: 0.95
# reasoning: The review expresses satisfaction...

Example

Multi-criteria response quality evaluation:
evaluator = CustomJudge(
    model="anthropic/claude-3-sonnet",
    credential="my-anthropic-key",
    prompt_template="""
        Evaluate the quality of this customer support response.

        Customer Question: {{ question }}
        Support Response: {{ response }}

        Rate the response on multiple criteria.
    """,
    output_fields={
        "helpful": {
            "type": "boolean",
            "description": "Does the response address the customer's question?"
        },
        "professional_tone": {
            "type": "boolean",
            "description": "Is the tone professional and courteous?"
        },
        "quality_score": {
            "type": "integer",
            "description": "Overall quality rating from 1 (poor) to 5 (excellent)"
        }
    }
)

scores = evaluator.score(inputs={
    "question": "How do I reset my password?",
    "response": "Click 'Forgot Password' on the login page and follow the steps."
})

# Convert to dict for easy access
scores_dict = {s.name: s for s in scores}
print(f"Helpful: {scores_dict['helpful'].value}")  # True
print(f"Quality: {scores_dict['quality_score'].value}")  # 4

Example

Code review evaluator:
evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="""
        Review this code change for potential issues:
```{{ language }}
Context: {{ pr_description }} """, output_fields={ “has_bugs”: { “type”: “boolean”, “description”: “Are there any obvious bugs or logic errors?” }, “severity”: { “type”: “string”, “choices”: [“critical”, “major”, “minor”, “none”], “description”: “Severity of issues found” }, “feedback”: { “type”: “string”, “description”: “Specific feedback for the code author” } } )
{{ code_diff }}

## Example
Multi-message prompt with system instructions and numeric constraints:
```python
evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template=[
        {"role": "system", "content": "You are an expert code reviewer."},
        {"role": "user", "content": "Review this code:\n{{ code }}"},
    ],
    output_fields={
        "quality_score": {
            "type": "integer",
            "minimum": 1,
            "maximum": 10,
            "description": "Code quality score from 1 to 10"
        },
        "feedback": {
            "type": "string",
            "description": "Specific feedback for the code author"
        }
    },
    inputs={
        "code": {"required": True, "description": "The code to review"}
    }
)

Example

Using llm_response_fields with transforms for value mapping:
evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="Is the response faithful? Response: {{ response }}",
    llm_response_fields={
        "is_faithful": {
            "type": "string",
            "choices": ["faithful", "not_faithful"],
        },
        "reasoning": {"type": "string"},
    },
    output_fields={
        "label": {
            "type": "string",
            "choices": ["yes", "no"],
            "transform": {
                "source_field": "is_faithful",
                "value_map": {"faithful": "yes", "not_faithful": "no"},
            },
        },
        "reasoning": {"type": "string"},
    },
)

Example

Using a reusable CustomJudgeSpec object:
from fiddler_evals.evaluators.custom_judge import (
    CustomJudge, CustomJudgeSpec, Message, InputFieldSpec,
)

spec = CustomJudgeSpec(
    prompt_template=[
        Message(role="system", content="You are an expert evaluator."),
        Message(role="user", content="Rate this response:\n{{ response }}"),
    ],
    inputs={"response": InputFieldSpec(required=True)},
    output_fields={
        "quality": {
            "type": "integer",
            "minimum": 1,
            "maximum": 5,
            "description": "Quality rating from 1 to 5",
        }
    },
)

evaluator = CustomJudge(prompt_spec=spec, model="openai/gpt-4o")
  • Placeholder names in {{ }} must exactly match keys in the inputs dict
  • The LLM is instructed to return JSON matching your output schema
  • For best results, include clear descriptions for each output field
  • Use choices for categorical fields to ensure consistent outputs
  • Use minimum/maximum for numeric fields to constrain values
  • Use CustomJudgeSpec to bundle prompt configuration into a reusable object
  • This evaluator requires an active connection to the Fiddler API

name = ‘custom_judge’

score()

Score using the Custom Judge.

Parameters

inputs
Dict[str, Any]
required
Values for the {{ placeholders }} in your prompt_template. Keys must match placeholder names exactly.

Returns

A list of Score objects, one for each output field defined.

Raises

ValueError – If inputs is empty.