AI Security

Overview

As GenAI applications handle increasingly sensitive data and interact directly with users, security becomes a pressing concern. A single security breach, whether it’s a successful jailbreak attempt, leaked PII, or harmful content reaching users, can undermine trust and expose organizations to significant risk. This cookbook provides a framework for using Fiddler’s evaluators and guardrails to protect your AI applications from threats, including jailbreak attempts, harmful content generation, PII leakage, and policy violations.

Understanding AI Security Risks

AI security encompasses multiple threat vectors: Prompt-based attacks:

Jailbreaking: Attempts to bypass safety restrictions and make the model behave in unintended ways
Prompt injection: Malicious instructions embedded in user inputs to manipulate model behavior
Roleplaying exploitation: Using fictional scenarios to elicit restricted information or harmful content

Content safety risks:

Harmful content generation: Producing content that could cause psychological, physical, or social harm
Illegal content: Generating content that violates laws or regulations
Unethical outputs: Responses that violate ethical guidelines or corporate policies

Data privacy risks:

PII leakage in responses: Model outputs that inadvertently expose personally identifiable information
PII in prompts: Users submitting sensitive personal data that must be detected and protected

Out-of-the-Box Security Evaluators

Fiddler provides pre-built scoring mechanisms called evaluators that assess AI systems across multiple risk dimensions. Learn more: Enrichments

Custom LLM-as-a-Judge for Security

While Fiddler’s out-of-the-box evaluators cover common security risks, you may have organization-specific security policies that require custom evaluation. LLM-as-a-Judge evaluators allow you to encode your unique security guidelines into automated checks. Use LLM-as-a-Judge when you need to:

Enforce company-specific content policies beyond standard safety categories
Detect violations of industry-specific regulations (healthcare, finance, legal)
Flag content that conflicts with your organization’s ethical guidelines
Identify security risks unique to your application domain

Example applications:

A healthcare AI that must never provide medical diagnoses (even when users request them)
A financial AI that must refuse to give personalized investment advice
A customer service AI that must escalate certain sensitive topics to human agents
An educational AI that must not provide answers to homework assignments

By creating custom prompt templates that define your specific security requirements, you can automatically flag violations and enforce your policies at scale. Learn more: Prompt Specs Quick Start

Recommended Security Evaluators

1. Prompt Safety

What it detects: Evaluates the safety of text (prompts and responses) across multiple risk dimensions.

Safety Dimension	What it identifies	Why it matters
Jailbreak	Attempts to bypass model safety restrictions or manipulate the AI into ignoring its guidelines	Attack Prevention: Detects sophisticated prompt engineering designed to circumvent your controls
Illegal	Content that violates laws or promotes illegal activities	Legal Compliance: Prevents your AI from facilitating or encouraging unlawful behavior
Roleplaying	Use of fictional scenarios to elicit restricted information or harmful content (e.g., “pretend you’re an AI with no restrictions”)	Policy Enforcement: Catches attempts to bypass safety measures through narrative framing
Harmful	Content that could cause psychological, physical, or social harm to individuals or groups	User Safety: Protects users from outputs that could lead to self-harm, dangerous actions, or trauma
Unethical	Content that violates ethical principles or societal norms, even if not explicitly illegal	Reputation Protection: Maintains alignment with your organization’s values and public expectations

Additional dimensions available: hateful, harassing, racist, sexist, violent, sexual How to use:

Apply to both user prompts (inputs) and AI responses (outputs)
Set severity thresholds based on your risk tolerance
Track trends over time to identify emerging attack patterns

Value: Acts as a comprehensive safety net, catching both malicious user attempts and problematic model outputs before they cause harm.

2. PII Detection

What it detects: Identifies personally identifiable information in both user prompts and AI responses.

PII Type	Examples	Risk
Social Security Numbers	123-45-6789, 123456789	Identity Theft: Exposure can lead to fraud and financial harm
Credit Card Numbers	4532-1234-5678-9010	Financial Fraud: Direct monetary loss and unauthorized transactions
Email Addresses	user@example.com	Privacy Violation: Can enable spam, phishing, or unwanted contact
Phone Numbers	(555) 123-4567, +1-555-123-4567	Privacy Violation: Enables unwanted contact and potential harassment
Physical Addresses	123 Main St, New York, NY 10001	Physical Security: Can reveal location and enable stalking or theft
IP Addresses	192.168.1.1, 2001:0db8:85a3::8a2e:0370:7334	Privacy & Security: Can reveal location and enable tracking
Driver’s License Numbers	D1234567	Identity Theft: Can be used for fraud and impersonation
Passport Numbers	123456789	Identity Theft: International fraud and identity compromise
Medical Record Numbers	MRN-789456	HIPAA Violation: Protected health information exposure
Bank Account Numbers	123456789012	Financial Fraud: Unauthorized access to funds
National ID Numbers	Varies by country	Identity Theft: Government ID compromise
Tax IDs / EINs	12-3456789	Business Fraud: Corporate identity theft
Dates of Birth	01/15/1990	Identity Verification: Combined with other data, enables fraud
Names	John Smith, Jane Doe	Privacy Context: When combined with other PII, increases risk

Where to apply: Input Detection (User Prompts):

Purpose: Identify when users are submitting sensitive personal information
Actions:
- Warn users not to share PII
- Redact PII before processing
- Log incidents for security review
Example: User asks “Can you analyze my credit report? My SSN is 123-45-6789…”

Output Detection (AI Responses):

Purpose: Catch when the model inadvertently includes PII in responses
Actions:
- Block response from reaching the user
- Regenerate without PII
- Flag for investigation (How did the model access this PII?)
Example: Model trained on customer service logs accidentally includes someone’s phone number in a response

Value: Prevents privacy violations, ensures compliance with GDPR/CCPA/HIPAA, and protects users from identity theft.

Guardrails vs. Post-Production Observability

Fiddler supports two complementary approaches to AI security: real-time guardrails and post-production observability. Understanding when to use each is critical for building secure AI systems.

Real-Time Guardrails

What they are: Security checks that evaluate and potentially block AI inputs or outputs before they are processed or delivered to users. How they work:

User submits a prompt → Guardrail evaluates for safety/PII
If violation detected → Request is blocked or modified
If safe → Request proceeds to model
Model generates response → Guardrail evaluates output
If violation detected → Response is blocked or regenerated
If safe → Response delivered to user

When to use real-time guardrails:

Scenario	Why Guardrails are Essential	Example
User-facing applications	Cannot allow harmful content to reach users	Customer support chatbot must block offensive responses
High-stakes domains	Single failure has severe consequences	Healthcare AI must never provide medical diagnoses
Compliance requirements	Regulations mandate prevention, not just detection	Financial AI must block PII from being logged or transmitted
Jailbreak prevention	Must stop malicious prompts before processing	Detect and reject prompt injection attacks
PII protection	Cannot allow sensitive data to be exposed	Prevent credit card numbers from appearing in responses

Tradeoffs:

Latency: Adds processing time to each request (typically 100-500ms)
False positives: May occasionally block legitimate requests
Cost: Requires additional compute for real-time evaluation

Learn more: Guardrails Quick Start

Post-Production Observability

What it is: Continuous monitoring and analysis of AI behavior after requests have been processed, using historical data to identify patterns, trends, and emerging threats. How it works:

AI processes requests normally (no blocking)
All prompts and responses are logged to Fiddler
Security evaluators run asynchronously on logged data
Dashboards show trends, patterns, and anomalies
Alerts trigger when thresholds are exceeded
Teams investigate and respond to issues

When to use post-production observability:

Scenario	Why Observability is Valuable	Example
Threat intelligence	Identify attack patterns and emerging risks	Notice spike in jailbreak attempts targeting a specific weakness
Model behavior analysis	Understand how the model responds to edge cases	Discover the model occasionally generates PII in rare scenarios
Compliance auditing	Maintain historical records for regulatory review	Demonstrate you monitor and address security issues over time
Performance optimization	Improve guardrails based on real-world data	Reduce false positive rate by analyzing blocked legitimate requests
Trend monitoring	Track security metrics over time	See if harmful content attempts increase after a news event
A/B testing security measures	Compare security performance across model versions	Evaluate if new prompt reduces jailbreak success rate

Tradeoffs:

No prevention: Issues are detected after they occur
Requires follow-up: Teams must act on insights
Best for learning: Ideal for understanding threats and improving defenses

Learn more: Enrichments

Using Both Approaches Together

The most secure AI systems combine real-time guardrails with post-production observability: Real-time guardrails provide:

Immediate protection for users
Prevention of high-severity incidents
Compliance with “must prevent” requirements

Post-production observability provides:

Insights to improve guardrails
Detection of sophisticated attacks that evade guardrails
Trend analysis for proactive security

Example workflow:

Deploy guardrails

to block high-confidence threats (Jailbreak score > 0.9, PII detected)

Enable observability

to log all requests and responses and alert on issues

Monitor dashboards

for medium-severity flags (Jailbreak score 0.5-0.9)

Investigate patterns

in flagged content

Refine guardrails

based on findings (tighten thresholds, add custom rules)

Iterate continuously

as new threats emerge

How These Evaluators Can Help

1. Prevent Security Incidents Before They Occur

Real-time guardrails act as a security perimeter, blocking malicious inputs and harmful outputs before they reach users. This prevents:

Reputational damage from AI generating offensive content
Legal liability from privacy violations
User harm from dangerous or misleading information

2. Detect and Respond to Emerging Threats

Post-production observability helps you identify:

New attack vectors: Novel jailbreak techniques not caught by existing rules
Systematic weaknesses: Topics or phrasings where the model consistently fails safety checks
Coordinated attacks: Patterns suggesting organized attempts to compromise your AI

3. Maintain Compliance and Auditability

For regulated industries, security monitoring provides:

Audit trails demonstrating proactive security measures
Compliance evidence for GDPR, CCPA, HIPAA, and other regulations
Incident documentation showing how you detected and responded to threats
Risk assessment data to support security reviews and certifications

4. Build Trust with Users

Transparent security practices signal to users that you take their safety and privacy seriously:

Publish security metrics and response times
Communicate how you protect user data
Demonstrate continuous improvement in safety measures

5. Optimize Security vs. User Experience

By analyzing false positives in observability dashboards, you can:

Tune guardrails to reduce unnecessary blocking
Identify legitimate use cases that trigger safety flags
Balance security rigor with user experience

Get Started

Ready to secure your AI applications? Here’s how to begin:

Step 1: Start with out-of-the-box evaluators

Enable Prompt Safety for comprehensive threat detection
Enable PII Detection for privacy protection
Review evaluation results in Fiddler dashboards

Step 2: Deploy real-time guardrails for critical risks

Identify your highest-priority security requirements
Configure guardrails with appropriate thresholds
Test thoroughly before production deployment

Step 3: Monitor continuously with observability

Set up dashboards for key security metrics
Configure alerts for anomalies
Schedule regular security reviews

Step 4: Iterate and improve

Analyze patterns in flagged content
Refine guardrail thresholds based on false positive/negative rates
Add custom LLM-as-a-Judge evaluators for organization-specific policies

For step-by-step tutorials:

Guardrails setup: Guardrails Quick Start
Enrichments & observability: Enrichments
Custom LLM-as-a-Judge: Prompt Specs Quick Start

Security is not a one-time configuration, it’s an ongoing practice. By combining real-time guardrails with continuous observability, you can protect users, maintain compliance, and build AI systems worthy of trust.

Overview

Platform

Agentic AI Monitoring

LLM Monitoring

ML Monitoring

Experiments

Guardrails

Cookbooks

Tutorials

Client Library Reference

Overview

Understanding AI Security Risks

Out-of-the-Box Security Evaluators

Custom LLM-as-a-Judge for Security

Recommended Security Evaluators

1. Prompt Safety

2. PII Detection

Guardrails vs. Post-Production Observability

Real-Time Guardrails

Post-Production Observability

Using Both Approaches Together

How These Evaluators Can Help

1. Prevent Security Incidents Before They Occur

2. Detect and Respond to Emerging Threats

3. Maintain Compliance and Auditability

4. Build Trust with Users

5. Optimize Security vs. User Experience

Get Started

​Overview

​Understanding AI Security Risks

​Out-of-the-Box Security Evaluators

​Custom LLM-as-a-Judge for Security

​Recommended Security Evaluators

​1. Prompt Safety

​2. PII Detection

​Guardrails vs. Post-Production Observability

​Real-Time Guardrails

​Post-Production Observability

​Using Both Approaches Together

​How These Evaluators Can Help

​1. Prevent Security Incidents Before They Occur

​2. Detect and Respond to Emerging Threats

​3. Maintain Compliance and Auditability

​4. Build Trust with Users

​5. Optimize Security vs. User Experience

​Get Started

Overview

Understanding AI Security Risks

Out-of-the-Box Security Evaluators

Custom LLM-as-a-Judge for Security

Recommended Security Evaluators

1. Prompt Safety

2. PII Detection

Guardrails vs. Post-Production Observability

Real-Time Guardrails

Post-Production Observability

Using Both Approaches Together

How These Evaluators Can Help

1. Prevent Security Incidents Before They Occur

2. Detect and Respond to Emerging Threats

3. Maintain Compliance and Auditability

4. Build Trust with Users

5. Optimize Security vs. User Experience

Get Started