LLM Evaluation - Prompt Specs Quick Start
Private Preview Notice
Prompt Specs are currently in private preview. This means:
API interfaces may change before general availability
Some features are still under active development
We welcome your feedback to help shape the final product
Please refer to our product maturity definitions for more details on policies and participation.
Get your first custom LLM evaluation running in minutes using Prompt Specs with Fiddler's LLM-as-a-Judge solution. This guide walks you through creating, testing, and deploying a custom evaluation using Prompt Specs.
What You'll Build
In this quick start, you'll create a news article topic classifier that:
Takes a news summary as input
Classifies it into one of four categories: World, Sports, Business, or Sci/Tech
Provides reasoning for its classification
Deploys to production monitoring in Fiddler
Prerequisites
Fiddler platform access with Private Preview enabled
Basic familiarity with Python and REST APIs
A Fiddler API token and base URL
Set Up Your Environment
Refer to the Fiddler Python client Installation and Setup Guide for details on the Fiddler Access Token, URL, and client initialization.
import json
import fiddler as fdl
import pandas as pd
import requests
# Replace with your actual values
FIDDLER_TOKEN = "your_token_here"
FIDDLER_BASE_URL = "https://your_company.fiddler.ai"
PROMPT_SPEC_URL = f"{FIDDLER_BASE_URL}/v3/llm-as-a-judge/prompt-spec"
FIDDLER_HEADERS = {
"Authorization": f"Bearer {FIDDLER_TOKEN}",
"Content-Type": "application/json",
}
Prepare Sample Data
We'll use news article data for this example:
# Load sample news data (using AG News dataset)
df_news = pd.read_parquet(
"hf://datasets/fancyzhx/ag_news/data/test-00000-of-00001.parquet"
).sample(20, random_state=25)
# Map labels to topic names
df_news["original_topic"] = df_news["label"].map({
0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"
})
# Summarize the count of each unique topic
print(df_news["original_topic"].value_counts())
Create Your First Prompt Spec
Define a simple evaluation schema:
basic_prompt_spec = {
"input_fields": {
"news_summary": {"type": "string"}
},
"output_fields": {
"topic": {
"type": "string",
"choices": ["World", "Sports", "Business", "Sci/Tech"]
},
"reasoning": {"type": "string"}
}
}
Validate
Validate your Prompt Spec schema:
validate_response = requests.post(
f"{PROMPT_SPEC_URL}/validate",
headers=FIDDLER_HEADERS,
json={"prompt_spec": basic_prompt_spec}
)
if validate_response.status_code == 200:
print("✅ Schema validation successful!")
else:
print("❌ Validation failed:", validate_response.text)
Test with Sample Data
def get_prediction(prompt_spec, input_data):
response = requests.post(
f"{PROMPT_SPEC_URL}/predict",
headers=FIDDLER_HEADERS,
json={"prompt_spec": prompt_spec, "input_data": input_data}
)
if response.status_code == 200:
return response.json()["prediction"]
return {"topic": None, "reasoning": None}
# Test with a single example
test_result = get_prediction(
basic_prompt_spec,
{"news_summary": "Wimbledon 2025 is under way!"}
)
print(json.dumps(test_result, indent=2))
Improve Accuracy With Descriptions
Add field descriptions to improve classification accuracy:
enhanced_prompt_spec = {
"instruction": "Determine the topic of the given news summary.",
"input_fields": {
"news_summary": {"type": "string"}
},
"output_fields": {
"topic": {
"type": "string",
"choices": ["World", "Sports", "Business", "Sci/Tech"],
"description": """Use 'Sci/Tech' for technology companies, scientific discoveries, or health/medical research.
Use 'Sports' for sports events or athletes.
Use 'Business' for companies outside of tech/sports.
Use 'World' for global events or issues."""
},
"reasoning": {
"type": "string",
"description": "Explain why you chose this topic."
}
}
}
Evaluate Performance
Test your enhanced Prompt Spec on multiple examples:
# Test on your dataset
results = []
for _, row in df_news.iterrows():
prediction = get_prediction(
enhanced_prompt_spec,
{"news_summary": row["text"]}
)
results.append({
"original": row["original_topic"],
"predicted": prediction["topic"],
"reasoning": prediction["reasoning"]
})
# Calculate accuracy
df_results = pd.DataFrame(results)
accuracy = (df_results["original"] == df_results["predicted"]).mean()
print(f"Accuracy: {accuracy:.1%}")
Deploy to Production Monitoring
Once satisfied with your Prompt Spec, deploy it as a Fiddler enrichment:
import fiddler as fdl
# Initialize Fiddler client
fdl.init(url=FIDDLER_BASE_URL, token=FIDDLER_TOKEN)
# Create project and enrichment
project = fdl.Project.get_or_create(name="llm_evaluation_demo")
enrichment = fdl.Enrichment(
name="news_topic_classifier",
enrichment="llm_as_a_judge",
columns=["news_summary"],
config={"prompt_spec": enhanced_prompt_spec}
)
# Create model with enrichment
model_spec = fdl.ModelSpec(
inputs=["news_summary"],
custom_features=[enrichment]
)
model = fdl.Model.from_data(
source=df_news.rename(columns={"text": "news_summary"}),
name="news_classifier",
project_id=project.id,
spec=model_spec,
task=fdl.ModelTask.LLM
)
model.create()
print(f"Model created: {model.name}")
Publish Events and Monitor
Publish your data and start monitoring:
# Publish production events
job = model.publish(df_news.rename(columns={"text": "news_summary"}))
job.wait()
if job.status == "SUCCESS":
print("✅ Data published successfully!")
print("🎯 Your evaluation is now running in production monitoring")
What Happens Next
After completing this quick start:
View Results: Check the Fiddler UI to see your model and enrichment results
Monitor Performance: Set up alerts based on classification accuracy or confidence scores
Iterate: Refine your Prompt Spec descriptions to improve accuracy
Scale: Apply the same approach to your own evaluation use cases
Key Takeaways
Fast Setup: From zero to production evaluation in minutes, not weeks
No Manual Prompting: JSON schema approach eliminates prompt engineering bottlenecks
Built-in Monitoring: Seamless integration with Fiddler's observability platform
Easy Iteration: Update schemas without rewriting prompts
Next Steps
Complete Interactive Notebook: Follow along with a full working example
Prompt Specs Guide: Learn more about the underlying framework
❓ Questions? Talk to a product expert or request a demo.
💡 Need help? Contact us at [email protected].
Last updated
Was this helpful?