# CustomJudge

Create a fully customizable LLM-as-a-Judge evaluator with your own prompt and output schema.

The CustomJudge evaluator allows you to define arbitrary evaluation criteria by specifying a custom prompt template and structured output fields. This is the most flexible evaluator in the Fiddler Evals SDK, enabling you to build domain-specific evaluation logic without writing custom code.

Key Features:

* **Custom Prompts**: Define your own evaluation prompt with `{{ placeholder }}` syntax
* **Structured Outputs**: Specify typed output fields (string, boolean, integer, number)
* **Categorical Choices**: Constrain string outputs to predefined categories
* **Multi-Field Outputs**: Return multiple scores/labels from a single evaluation
* **Field Descriptions**: Guide the LLM with descriptions for each output field

Use Cases:

* **Domain-Specific Evaluation**: Create evaluators tailored to your industry or use case
* **Custom Rubrics**: Implement grading rubrics with specific criteria
* **Multi-Aspect Scoring**: Evaluate multiple dimensions (e.g., tone, accuracy, helpfulness)
* **Classification Tasks**: Categorize responses into predefined labels
* **Compliance Checking**: Verify responses meet specific guidelines or policies

Output Field Types:

* **string**: Free-form text output, or categorical if `choices` is specified
* **boolean**: True/False classification
* **integer**: Whole number scores (e.g., 1-5 rating scale)
* **number**: Floating-point scores (e.g., 0.0-1.0 confidence)

## Parameters

| Parameter         | Type                     | Required | Default | Description                                                                                                                                                                                                                                                      |
| ----------------- | ------------------------ | -------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `prompt_template` | `str`                    | ✗        | `None`  | The evaluation prompt with `{{ placeholder }}` markers for dynamic content. Placeholders are filled from the `inputs` dict passed to the `score()` method.                                                                                                       |
| `output_fields`   | `Dict[str, OutputField]` | ✗        | `None`  | Schema defining the expected outputs. Each field has: `type`: One of 'string', 'boolean', 'integer', 'number'; `choices` (optional): List of allowed values for categorical string fields; `description` (optional): Instructions for the `LLM` about this field |
| `model`           | `str`                    | ✗        | `None`  | `LLM` Gateway model name in `{provider}/{model}` format. E.g., `openai/gpt-4o`, `anthropic/claude-3-sonnet`                                                                                                                                                      |
| `credential`      | `str, optional`          | ✗        | `None`  | Name of the `LLM` Gateway credential for the provider.                                                                                                                                                                                                           |

## Returns

A list of Score objects, one for each output field defined. : Each Score contains:

* name: The output field name (e.g., "sentiment", "confidence")
* value: The numeric value (for number/integer/boolean fields)
* label: The string label (for string/categorical fields)
* reasoning: Always None for CustomJudge (reasoning is returned as a field)

**Return type:** list\[Score]

## Example

Basic sentiment analysis with categorical output:

```python
from fiddler_evals.evaluators import CustomJudge

evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="""
        Analyze the sentiment of the following customer review:

        Review: {{ review_text }}

        Classify the sentiment and explain your reasoning.
    """,
    output_fields={
        "sentiment": {
            "type": "string",
            "choices": ["positive", "negative", "neutral"],
        },
        "confidence": {
            "type": "number",
            "description": "Confidence score between 0 and 1"
        },
        "reasoning": {
            "type": "string",
        }
    }
)

scores = evaluator.score(inputs={
    "review_text": "The product exceeded my expectations! Fast shipping too."
})

# Access individual scores by index or iterate
for score in scores:
    print(f"{score.name}: {score.value or score.label}")
# Output:
# sentiment: positive
# confidence: 0.95
# reasoning: The review expresses satisfaction...
```

## Example

Multi-criteria response quality evaluation:

```python
evaluator = CustomJudge(
    model="anthropic/claude-3-sonnet",
    credential="my-anthropic-key",
    prompt_template="""
        Evaluate the quality of this customer support response.

        Customer Question: {{ question }}
        Support Response: {{ response }}

        Rate the response on multiple criteria.
    """,
    output_fields={
        "helpful": {
            "type": "boolean",
            "description": "Does the response address the customer's question?"
        },
        "professional_tone": {
            "type": "boolean",
            "description": "Is the tone professional and courteous?"
        },
        "quality_score": {
            "type": "integer",
            "description": "Overall quality rating from 1 (poor) to 5 (excellent)"
        }
    }
)

scores = evaluator.score(inputs={
    "question": "How do I reset my password?",
    "response": "Click 'Forgot Password' on the login page and follow the steps."
})

# Convert to dict for easy access
scores_dict = {s.name: s for s in scores}
print(f"Helpful: {scores_dict['helpful'].value}")  # True
print(f"Quality: {scores_dict['quality_score'].value}")  # 4
```

## Example

Code review evaluator:

````python
evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-key",
    prompt_template="""
        Review this code change for potential issues:
```{{ language }}
        ```

Context: {{ pr_description }}
""",
output_fields={
    "has_bugs": {
        "type": "boolean",
        "description": "Are there any obvious bugs or logic errors?"
    },
    "severity": {
        "type": "string",
        "choices": ["critical", "major", "minor", "none"],
        "description": "Severity of issues found"
    },
    "feedback": {
        "type": "string",
        "description": "Specific feedback for the code author"
    }
}
)
```python
{{ code_diff }}
````

```

<div data-gb-custom-block data-tag="hint" data-style='info'>

- Placeholder names in `{{ }}` must exactly match keys in the `inputs` dict
- The LLM is instructed to return JSON matching your output schema
- For best results, include clear descriptions for each output field
- Use `choices` for categorical fields to ensure consistent outputs
- This evaluator requires an active connection to the Fiddler API

</div>

#### name *= 'custom_judge'*

#### score()

Score using the Custom Judge.

#### Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `inputs` | `Dict[str, Any]` | ✗ | `None` | Values for the {{ placeholders }} in your prompt_template. Keys must match placeholder names exactly. |

#### Returns

A list of Score objects, one for each output field defined.

**Return type:** list[Score]

#### Raises
  **ValueError** -- If inputs is empty.
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.fiddler.ai/api/fiddler-evals-sdk/evaluators/custom-judge.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
