Advanced Prompt Specs
This comprehensive guide covers advanced LLM-as-a-Judge capabilities, including custom prompting, model configuration, performance optimization, and enterprise deployment patterns.
Prerequisites
Understanding of Prompt Specs fundamentals
Completion of the LLM Evaluation Quickstart
Familiarity with LLM evaluation concepts
Download this tutorial directly from GitHub or run it in Google Colab
Set Up Your Environment
import json
import pandas as pd
import requests
FIDDLER_BASE_URL = "https://your_company.fiddler.ai"
PROMPT_SPEC_URL = f"{FIDDLER_BASE_URL}/v3/llm-as-a-judge/prompt-spec"
FIDDLER_HEADERS = {
"Authorization": f"Bearer {FIDDLER_TOKEN}",
"Content-Type": "application/json",
}Prepare Sample Data
We'll use news article data for this example:
# Load sample news data (using AG News dataset)
df_news = pd.read_parquet(
"hf://datasets/fancyzhx/ag_news/data/test-00000-of-00001.parquet"
).sample(20, random_state=25)
# Map labels to topic names
df_news["original_topic"] = df_news["label"].map({
0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"
})
# Summarize the count of each unique topic
print(df_news["original_topic"].value_counts())Start with a Basic Prompt Spec
Define a simple evaluation schema:
basic_prompt_spec = {
"input_fields": {
"news_summary": {"type": "string"}
},
"output_fields": {
"topic": {
"type": "string",
"choices": ["World", "Sports", "Business", "Sci/Tech"]
},
"reasoning": {"type": "string"}
}
}Validate
Validate your Prompt Spec schema:
validate_response = requests.post(
f"{PROMPT_SPEC_URL}/validate",
headers=FIDDLER_HEADERS,
json={"prompt_spec": prompt_spec_basic},
)
validate_response.raise_for_status()
print("Status Code:", validate_response.status_code)
print(json.dumps(validate_response.json(), indent=2))Test with Ad-hoc Data
Test with a larger set of data:
def get_prediction(prompt_spec, input_data):
predict_response = requests.post(
f"{PROMPT_SPEC_URL}/predict",
headers=FIDDLER_HEADERS,
json={"prompt_spec": prompt_spec, "input_data": input_data},
)
if predict_response.status_code != 200:
print(f"Error ({predict_response.status_code}): {predict_response.text}")
return {"topic": None, "reasoning": None}
return predict_response.json()["prediction"]
print(
json.dumps(
get_prediction(
prompt_spec_basic, {"news_summary": "Wimbledon 2025 is under way!"}
),
indent=2,
)
)Test with a DataFrame
Evaluate a batch of data:
df_ag_news[["topic", "reasoning"]] = df_ag_news.apply(
lambda row: get_prediction(prompt_spec_basic, {"news_summary": row["text"]}),
axis=1,
result_type="expand",
)Inspect the Results
Note several Sci/Tech articles were misclassified as World. The reasoning field helps identify trends. We'll use this to update our prompt spec in the next section.
accuracy = (df_ag_news["original_topic"] == df_ag_news["topic"]).mean()
print(f"Accuracy: {accuracy:.0%}")
df_ag_news.value_counts(subset=["original_topic", "topic"])
for r in df_ag_news[
(df_ag_news["original_topic"] == "Sci/Tech") & (df_ag_news["topic"] != "Sci/Tech")
]["reasoning"]:
print(r)Improve the Accuracy with Descriptions
Just as descriptive field names can help improve model performance, you can also add a task instruction and field descriptions. Here, we will add a description to topic to help with classifying Sci/Tech articles. Note the improved results.
prompt_spec_rich = {
"instruction": "Determine the topic of the given news summary.",
"input_fields": {
"news_summary": {
"type": "string",
}
},
"output_fields": {
"topic": {
"type": "string",
"choices": df_ag_news["original_topic"].unique().tolist(),
"description": """Use topic 'Sci/Tech' if the news summary is about a company or business in the tech industry, or if the news summary is about a scientific discovery or research, including health and medicine.
Use topic 'Sports' if the news summary is about a sports event or athlete.
Use topic 'Business' if the news summary is about a company or industry outside of science, technology, or sports.
Use topic 'World' if the news summary is about a global event or issue.
""",
},
"reasoning": {
"type": "string",
"description": "The reasoning behind the predicted topic.",
},
},
}
print(
json.dumps(
get_prediction(prompt_spec_rich, {"news_summary": df_ag_news.loc[267, "text"]}),
indent=2,
)
)Reevaluate the DataFrame with the New Prompt
Note the improvement of accuracy in the results:
df_ag_news[["topic", "reasoning"]] = df_ag_news.apply(
lambda row: get_prediction(prompt_spec_rich, {"news_summary": row["text"]}),
axis=1,
result_type="expand",
)
accuracy = (df_ag_news["original_topic"] == df_ag_news["topic"]).mean()
print(f"Accuracy: {accuracy:.0%}")
df_ag_news.value_counts(subset=["original_topic", "topic"])Deploying Your Evaluation to Production
Once you see the results you expect with your test data, deploy the custom evaluation to production and monitor your production application:
Create a Fiddler Project for your Monitoring Application
from datetime import datetime
import fiddler as fdl
fdl.init(url=FIDDLER_BASE_URL, token=FIDDLER_TOKEN)
PROJECT_NAME = "quickstart_examples" # If the project already exists, the notebook will create the model under the existing project.
MODEL_NAME = "fiddler_news_classifier"
project = fdl.Project.get_or_create(name=PROJECT_NAME)
Update the DataFrame Schema Names
Recall we used news_summary in our prompt. Let's make our dataframe match this and add some metadata.
platform_df = df_ag_news.rename(columns={"text": "news_summary"})
platform_df["id"] = platform_df.indexAdd the Prediction as a Fiddler GenAI Enrichment
namewill be used as part of the generated column name; set it to something meaningful for your use case.enrichmentmust always bellm_as_a_judge.columnsmatches all the input columns your prompt spec uses.configmust set the prompt spec.
Then define the remainder of the schema that makes up this application to be monitored. For more details on setting up Fiddler to monitor your ML models and LLM/GenAI applications, refer to the ML Monitoring Quick Start and the LLM Monitoring Quick Start guides.
fiddler_llm_enrichments = [
fdl.Enrichment(
name="news_topic",
enrichment="llm_as_a_judge",
columns=["news_summary"],
config={"prompt_spec": prompt_spec_rich},
)
]
model_spec = fdl.ModelSpec(
inputs=["news_summary"],
metadata=["id", "original_topic"],
custom_features=fiddler_llm_enrichments,
)
llm_application = fdl.Model.from_data(
source=platform_df,
name=MODEL_NAME,
project_id=project.id,
spec=model_spec,
task=fdl.ModelTask.LLM,
max_cardinality=5,
)
llm_application.create()Publish Data to Simulate LLM Activity
Our prediction will add two columns: FDL news_topic (topic) and FDL news_topic (reasoning).
Note: The column names follow the pattern:
FDL {enrichment name} ({prompt spec output column}), using values as specified.
production_publish_job = llm_application.publish(platform_df)
# wait for the job to complete
production_publish_job.wait(interval=20)Download the Data Enriched by Fiddler
llm_application.download_data(
output_dir="test_download",
env_type=fdl.EnvType.PRODUCTION,
start_time=start_time,
end_time=datetime.now(),
columns=[
"id",
"news_summary",
"original_topic",
"FDL news_topic (topic)",
"FDL news_topic (reasoning)",
],
)
# See the original data and the results of LLM-as-a-Judge
fdl_data = pd.read_parquet("test_download/output.parquet")
fdl_data.sample(15)Advanced Prompt Specs Configuration
Schema Design Patterns
Multi-Output Evaluation
Domain-Specific Classification
Performance Optimization Techniques
Field Description Best Practices
Be Specific: Use concrete examples rather than abstract descriptions
Avoid Ambiguity: Define edge cases and boundary conditions
Include Context: Reference domain-specific knowledge when needed
Bring-Your-Own-Prompt
For maximum customization, Fiddler supports custom prompt templates with multiple output format options.
Free-Form Output
Best for open-ended evaluations where structure is less important:
{
"prompt_template": {
"user": "Analyze the following text for potential bias: {text}. Provide detailed analysis."
},
"output_fields": ["analysis"],
"output_format": {
"type": "free_form"
}
}Guided Choice Output
For single categorical outputs with high accuracy requirements:
{
"prompt_template": {
"system": "You are an expert content moderator.",
"user": "Classify this content: {content}. Choose the most appropriate category."
},
"output_fields": ["classification"],
"output_format": {
"type": "guided_choice",
"choices": ["safe", "questionable", "harmful", "requires_review"]
}
}Guided JSON Output
For complex structured outputs with validation:
{
"prompt_template": {
"system": "Extract structured information from job postings.",
"user": "Extract key details from: {job_posting}"
},
"output_fields": ["title", "company", "salary_range", "remote_work"],
"output_format": {
"type": "guided_json",
"schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"company": { "type": "string" },
"salary_range": { "type": "string" },
"remote_work": { "type": "boolean" }
},
"required": ["title", "company"]
}
}
}Additional Documentation
LLM Observability Overview: Understanding Fiddler's broader LLM monitoring capabilities
Enrichments: Technical details on Fiddler's evaluation infrastructure
API Reference: Complete API documentation
Last updated
Was this helpful?