Advanced Prompt Specs

This comprehensive guide covers advanced LLM-as-a-Judge capabilities, including custom prompting, model configuration, performance optimization, and enterprise deployment patterns.

Prerequisites

Understanding of Prompt Specs fundamentals
Completion of the LLM Evaluation Quickstart
Familiarity with LLM evaluation concepts

Download this tutorial directly from GitHub or run it in Google Colab

Set Up Your Environment

import json
import pandas as pd
import requests

FIDDLER_BASE_URL = "https://your_company.fiddler.ai"

PROMPT_SPEC_URL = f"{FIDDLER_BASE_URL}/v3/llm-as-a-judge/prompt-spec"
FIDDLER_HEADERS = {
    "Authorization": f"Bearer {FIDDLER_TOKEN}",
    "Content-Type": "application/json",
}

Prepare Sample Data

We'll use news article data for this example:

# Load sample news data (using AG News dataset)
df_news = pd.read_parquet(
    "hf://datasets/fancyzhx/ag_news/data/test-00000-of-00001.parquet"
).sample(20, random_state=25)

# Map labels to topic names
df_news["original_topic"] = df_news["label"].map({
    0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"
})

# Summarize the count of each unique topic
print(df_news["original_topic"].value_counts())

Start with a Basic Prompt Spec

Define a simple evaluation schema:

basic_prompt_spec = {
    "input_fields": {
        "news_summary": {"type": "string"}
    },
    "output_fields": {
        "topic": {
            "type": "string",
            "choices": ["World", "Sports", "Business", "Sci/Tech"]
        },
        "reasoning": {"type": "string"}
    }
}

Validate

Validate your Prompt Spec schema:

validate_response = requests.post(
    f"{PROMPT_SPEC_URL}/validate",
    headers=FIDDLER_HEADERS,
    json={"prompt_spec": prompt_spec_basic},
)

validate_response.raise_for_status()
print("Status Code:", validate_response.status_code)
print(json.dumps(validate_response.json(), indent=2))

Test with Ad-hoc Data

Test with a larger set of data:

def get_prediction(prompt_spec, input_data):
    predict_response = requests.post(
        f"{PROMPT_SPEC_URL}/predict",
        headers=FIDDLER_HEADERS,
        json={"prompt_spec": prompt_spec, "input_data": input_data},
    )
    if predict_response.status_code != 200:
        print(f"Error ({predict_response.status_code}): {predict_response.text}")
        return {"topic": None, "reasoning": None}
    return predict_response.json()["prediction"]

print(
    json.dumps(
        get_prediction(
            prompt_spec_basic, {"news_summary": "Wimbledon 2025 is under way!"}
        ),
        indent=2,
    )
)

Test with a DataFrame

Evaluate a batch of data:

df_ag_news[["topic", "reasoning"]] = df_ag_news.apply(
    lambda row: get_prediction(prompt_spec_basic, {"news_summary": row["text"]}),
    axis=1,
    result_type="expand",
)

Inspect the Results

Note several Sci/Tech articles were misclassified as World. The reasoning field helps identify trends. We'll use this to update our prompt spec in the next section.

accuracy = (df_ag_news["original_topic"] == df_ag_news["topic"]).mean()
print(f"Accuracy: {accuracy:.0%}")

df_ag_news.value_counts(subset=["original_topic", "topic"])

for r in df_ag_news[
    (df_ag_news["original_topic"] == "Sci/Tech") & (df_ag_news["topic"] != "Sci/Tech")
]["reasoning"]:
    print(r)

Improve the Accuracy with Descriptions

Just as descriptive field names can help improve model performance, you can also add a task instruction and field descriptions. Here, we will add a description to topic to help with classifying Sci/Tech articles. Note the improved results.

prompt_spec_rich = {
    "instruction": "Determine the topic of the given news summary.",
    "input_fields": {
        "news_summary": {
            "type": "string",
        }
    },
    "output_fields": {
        "topic": {
            "type": "string",
            "choices": df_ag_news["original_topic"].unique().tolist(),
            "description": """Use topic 'Sci/Tech' if the news summary is about a company or business in the tech industry, or if the news summary is about a scientific discovery or research, including health and medicine.
            Use topic 'Sports' if the news summary is about a sports event or athlete.
            Use topic 'Business' if the news summary is about a company or industry outside of science, technology, or sports.
            Use topic 'World' if the news summary is about a global event or issue.
            """,
        },
        "reasoning": {
            "type": "string",
            "description": "The reasoning behind the predicted topic.",
        },
    },
}

print(
    json.dumps(
        get_prediction(prompt_spec_rich, {"news_summary": df_ag_news.loc[267, "text"]}),
        indent=2,
    )
)

Reevaluate the DataFrame with the New Prompt

Note the improvement of accuracy in the results:

df_ag_news[["topic", "reasoning"]] = df_ag_news.apply(
    lambda row: get_prediction(prompt_spec_rich, {"news_summary": row["text"]}),
    axis=1,
    result_type="expand",
)

accuracy = (df_ag_news["original_topic"] == df_ag_news["topic"]).mean()
print(f"Accuracy: {accuracy:.0%}")

df_ag_news.value_counts(subset=["original_topic", "topic"])

Deploying Your Evaluation to Production

Once you see the results you expect with your test data, deploy the custom evaluation to production and monitor your production application:

Create a Fiddler Project for your Monitoring Application

from datetime import datetime
import fiddler as fdl

fdl.init(url=FIDDLER_BASE_URL, token=FIDDLER_TOKEN)

PROJECT_NAME = "quickstart_examples"  # If the project already exists, the notebook will create the model under the existing project.
MODEL_NAME = "fiddler_news_classifier"
project = fdl.Project.get_or_create(name=PROJECT_NAME)

Update the DataFrame Schema Names

Recall we used news_summary in our prompt. Let's make our dataframe match this and add some metadata.

platform_df = df_ag_news.rename(columns={"text": "news_summary"})
platform_df["id"] = platform_df.index

Add the Prediction as a Fiddler GenAI Enrichment

name will be used as part of the generated column name; set it to something meaningful for your use case.
enrichment must always be llm_as_a_judge.
columns matches all the input columns your prompt spec uses.
config must set the prompt spec.

Then define the remainder of the schema that makes up this application to be monitored. For more details on setting up Fiddler to monitor your ML models and LLM/GenAI applications, refer to the ML Monitoring Quick Start and the LLM Monitoring Quick Start guides.

fiddler_llm_enrichments = [
    fdl.Enrichment(
        name="news_topic",
        enrichment="llm_as_a_judge",
        columns=["news_summary"],
        config={"prompt_spec": prompt_spec_rich},
    )
]

model_spec = fdl.ModelSpec(
    inputs=["news_summary"],
    metadata=["id", "original_topic"],
    custom_features=fiddler_llm_enrichments,
)
llm_application = fdl.Model.from_data(
    source=platform_df,
    name=MODEL_NAME,
    project_id=project.id,
    spec=model_spec,
    task=fdl.ModelTask.LLM,
    max_cardinality=5,
)
llm_application.create()

Publish Data to Simulate LLM Activity

Our prediction will add two columns: FDL news_topic (topic) and FDL news_topic (reasoning).

Note: The column names follow the pattern: FDL {enrichment name} ({prompt spec output column}), using values as specified.

production_publish_job = llm_application.publish(platform_df)

# wait for the job to complete
production_publish_job.wait(interval=20)

Download the Data Enriched by Fiddler

llm_application.download_data(
    output_dir="test_download",
    env_type=fdl.EnvType.PRODUCTION,
    start_time=start_time,
    end_time=datetime.now(),
    columns=[
        "id",
        "news_summary",
        "original_topic",
        "FDL news_topic (topic)",
        "FDL news_topic (reasoning)",
    ],
)

# See the original data and the results of LLM-as-a-Judge
fdl_data = pd.read_parquet("test_download/output.parquet")
fdl_data.sample(15)

Advanced Prompt Specs Configuration

Schema Design Patterns

Multi-Output Evaluation

Domain-Specific Classification

Performance Optimization Techniques

Field Description Best Practices

Be Specific: Use concrete examples rather than abstract descriptions
Avoid Ambiguity: Define edge cases and boundary conditions
Include Context: Reference domain-specific knowledge when needed

Bring-Your-Own-Prompt

For maximum customization, Fiddler supports custom prompt templates with multiple output format options.

Free-Form Output

Best for open-ended evaluations where structure is less important:

{
  "prompt_template": {
    "user": "Analyze the following text for potential bias: {text}. Provide detailed analysis."
  },
  "output_fields": ["analysis"],
  "output_format": {
    "type": "free_form"
  }
}

Guided Choice Output

For single categorical outputs with high accuracy requirements:

{
  "prompt_template": {
    "system": "You are an expert content moderator.",
    "user": "Classify this content: {content}. Choose the most appropriate category."
  },
  "output_fields": ["classification"],
  "output_format": {
    "type": "guided_choice",
    "choices": ["safe", "questionable", "harmful", "requires_review"]
  }
}

Guided JSON Output

For complex structured outputs with validation:

{
  "prompt_template": {
    "system": "Extract structured information from job postings.",
    "user": "Extract key details from: {job_posting}"
  },
  "output_fields": ["title", "company", "salary_range", "remote_work"],
  "output_format": {
    "type": "guided_json",
    "schema": {
      "type": "object",
      "properties": {
        "title": { "type": "string" },
        "company": { "type": "string" },
        "salary_range": { "type": "string" },
        "remote_work": { "type": "boolean" }
      },
      "required": ["title", "company"]
    }
  }
}

Additional Documentation

LLM Observability Overview: Understanding Fiddler's broader LLM monitoring capabilities
Enrichments: Technical details on Fiddler's evaluation infrastructure
API Reference: Complete API documentation

PreviousEvaluations NextEvals SDK Advanced Guide

Last updated 8 days ago

Was this helpful?