Overview
As GenAI applications handle increasingly sensitive data and interact directly with users, security becomes a pressing concern. A single security breach, whether it’s a successful jailbreak attempt, leaked PII, or harmful content reaching users, can undermine trust and expose organizations to significant risk. This cookbook provides a framework for using Fiddler’s evaluators and guardrails to protect your AI applications from threats, including jailbreak attempts, harmful content generation, PII leakage, and policy violations.Understanding AI Security Risks
AI security encompasses multiple threat vectors: Prompt-based attacks:- Jailbreaking: Attempts to bypass safety restrictions and make the model behave in unintended ways
- Prompt injection: Malicious instructions embedded in user inputs to manipulate model behavior
- Roleplaying exploitation: Using fictional scenarios to elicit restricted information or harmful content
- Harmful content generation: Producing content that could cause psychological, physical, or social harm
- Illegal content: Generating content that violates laws or regulations
- Unethical outputs: Responses that violate ethical guidelines or corporate policies
- PII leakage in responses: Model outputs that inadvertently expose personally identifiable information
- PII in prompts: Users submitting sensitive personal data that must be detected and protected
Out-of-the-Box Security Evaluators
Fiddler provides pre-built scoring mechanisms called evaluators that assess AI systems across multiple risk dimensions. Learn more: EnrichmentsCustom LLM-as-a-Judge for Security
While Fiddler’s out-of-the-box evaluators cover common security risks, you may have organization-specific security policies that require custom evaluation. LLM-as-a-Judge evaluators allow you to encode your unique security guidelines into automated checks. Use LLM-as-a-Judge when you need to:- Enforce company-specific content policies beyond standard safety categories
- Detect violations of industry-specific regulations (healthcare, finance, legal)
- Flag content that conflicts with your organization’s ethical guidelines
- Identify security risks unique to your application domain
- A healthcare AI that must never provide medical diagnoses (even when users request them)
- A financial AI that must refuse to give personalized investment advice
- A customer service AI that must escalate certain sensitive topics to human agents
- An educational AI that must not provide answers to homework assignments
Recommended Security Evaluators
1. Prompt Safety
What it detects: Evaluates the safety of text (prompts and responses) across multiple risk dimensions.| Safety Dimension | What it identifies | Why it matters |
|---|---|---|
| Jailbreak | Attempts to bypass model safety restrictions or manipulate the AI into ignoring its guidelines | Attack Prevention: Detects sophisticated prompt engineering designed to circumvent your controls |
| Illegal | Content that violates laws or promotes illegal activities | Legal Compliance: Prevents your AI from facilitating or encouraging unlawful behavior |
| Roleplaying | Use of fictional scenarios to elicit restricted information or harmful content (e.g., “pretend you’re an AI with no restrictions”) | Policy Enforcement: Catches attempts to bypass safety measures through narrative framing |
| Harmful | Content that could cause psychological, physical, or social harm to individuals or groups | User Safety: Protects users from outputs that could lead to self-harm, dangerous actions, or trauma |
| Unethical | Content that violates ethical principles or societal norms, even if not explicitly illegal | Reputation Protection: Maintains alignment with your organization’s values and public expectations |
- Apply to both user prompts (inputs) and AI responses (outputs)
- Set severity thresholds based on your risk tolerance
- Track trends over time to identify emerging attack patterns
2. PII Detection
What it detects: Identifies personally identifiable information in both user prompts and AI responses.| PII Type | Examples | Risk |
|---|---|---|
| Social Security Numbers | 123-45-6789, 123456789 | Identity Theft: Exposure can lead to fraud and financial harm |
| Credit Card Numbers | 4532-1234-5678-9010 | Financial Fraud: Direct monetary loss and unauthorized transactions |
| Email Addresses | user@example.com | Privacy Violation: Can enable spam, phishing, or unwanted contact |
| Phone Numbers | (555) 123-4567, +1-555-123-4567 | Privacy Violation: Enables unwanted contact and potential harassment |
| Physical Addresses | 123 Main St, New York, NY 10001 | Physical Security: Can reveal location and enable stalking or theft |
| IP Addresses | 192.168.1.1, 2001:0db8:85a3::8a2e:0370:7334 | Privacy & Security: Can reveal location and enable tracking |
| Driver’s License Numbers | D1234567 | Identity Theft: Can be used for fraud and impersonation |
| Passport Numbers | 123456789 | Identity Theft: International fraud and identity compromise |
| Medical Record Numbers | MRN-789456 | HIPAA Violation: Protected health information exposure |
| Bank Account Numbers | 123456789012 | Financial Fraud: Unauthorized access to funds |
| National ID Numbers | Varies by country | Identity Theft: Government ID compromise |
| Tax IDs / EINs | 12-3456789 | Business Fraud: Corporate identity theft |
| Dates of Birth | 01/15/1990 | Identity Verification: Combined with other data, enables fraud |
| Names | John Smith, Jane Doe | Privacy Context: When combined with other PII, increases risk |
- Purpose: Identify when users are submitting sensitive personal information
- Actions:
- Warn users not to share PII
- Redact PII before processing
- Log incidents for security review
- Example: User asks “Can you analyze my credit report? My SSN is 123-45-6789…”
- Purpose: Catch when the model inadvertently includes PII in responses
- Actions:
- Block response from reaching the user
- Regenerate without PII
- Flag for investigation (How did the model access this PII?)
- Example: Model trained on customer service logs accidentally includes someone’s phone number in a response
Guardrails vs. Post-Production Observability
Fiddler supports two complementary approaches to AI security: real-time guardrails and post-production observability. Understanding when to use each is critical for building secure AI systems.Real-Time Guardrails
What they are: Security checks that evaluate and potentially block AI inputs or outputs before they are processed or delivered to users. How they work:- User submits a prompt → Guardrail evaluates for safety/PII
- If violation detected → Request is blocked or modified
- If safe → Request proceeds to model
- Model generates response → Guardrail evaluates output
- If violation detected → Response is blocked or regenerated
- If safe → Response delivered to user
| Scenario | Why Guardrails are Essential | Example |
|---|---|---|
| User-facing applications | Cannot allow harmful content to reach users | Customer support chatbot must block offensive responses |
| High-stakes domains | Single failure has severe consequences | Healthcare AI must never provide medical diagnoses |
| Compliance requirements | Regulations mandate prevention, not just detection | Financial AI must block PII from being logged or transmitted |
| Jailbreak prevention | Must stop malicious prompts before processing | Detect and reject prompt injection attacks |
| PII protection | Cannot allow sensitive data to be exposed | Prevent credit card numbers from appearing in responses |
- Latency: Adds processing time to each request (typically 100-500ms)
- False positives: May occasionally block legitimate requests
- Cost: Requires additional compute for real-time evaluation
Post-Production Observability
What it is: Continuous monitoring and analysis of AI behavior after requests have been processed, using historical data to identify patterns, trends, and emerging threats. How it works:- AI processes requests normally (no blocking)
- All prompts and responses are logged to Fiddler
- Security evaluators run asynchronously on logged data
- Dashboards show trends, patterns, and anomalies
- Alerts trigger when thresholds are exceeded
- Teams investigate and respond to issues
| Scenario | Why Observability is Valuable | Example |
|---|---|---|
| Threat intelligence | Identify attack patterns and emerging risks | Notice spike in jailbreak attempts targeting a specific weakness |
| Model behavior analysis | Understand how the model responds to edge cases | Discover the model occasionally generates PII in rare scenarios |
| Compliance auditing | Maintain historical records for regulatory review | Demonstrate you monitor and address security issues over time |
| Performance optimization | Improve guardrails based on real-world data | Reduce false positive rate by analyzing blocked legitimate requests |
| Trend monitoring | Track security metrics over time | See if harmful content attempts increase after a news event |
| A/B testing security measures | Compare security performance across model versions | Evaluate if new prompt reduces jailbreak success rate |
- No prevention: Issues are detected after they occur
- Requires follow-up: Teams must act on insights
- Best for learning: Ideal for understanding threats and improving defenses
Using Both Approaches Together
The most secure AI systems combine real-time guardrails with post-production observability: Real-time guardrails provide:- Immediate protection for users
- Prevention of high-severity incidents
- Compliance with “must prevent” requirements
- Insights to improve guardrails
- Detection of sophisticated attacks that evade guardrails
- Trend analysis for proactive security
How These Evaluators Can Help
1. Prevent Security Incidents Before They Occur
Real-time guardrails act as a security perimeter, blocking malicious inputs and harmful outputs before they reach users. This prevents:- Reputational damage from AI generating offensive content
- Legal liability from privacy violations
- User harm from dangerous or misleading information
2. Detect and Respond to Emerging Threats
Post-production observability helps you identify:- New attack vectors: Novel jailbreak techniques not caught by existing rules
- Systematic weaknesses: Topics or phrasings where the model consistently fails safety checks
- Coordinated attacks: Patterns suggesting organized attempts to compromise your AI
3. Maintain Compliance and Auditability
For regulated industries, security monitoring provides:- Audit trails demonstrating proactive security measures
- Compliance evidence for GDPR, CCPA, HIPAA, and other regulations
- Incident documentation showing how you detected and responded to threats
- Risk assessment data to support security reviews and certifications
4. Build Trust with Users
Transparent security practices signal to users that you take their safety and privacy seriously:- Publish security metrics and response times
- Communicate how you protect user data
- Demonstrate continuous improvement in safety measures
5. Optimize Security vs. User Experience
By analyzing false positives in observability dashboards, you can:- Tune guardrails to reduce unnecessary blocking
- Identify legitimate use cases that trigger safety flags
- Balance security rigor with user experience
Get Started
Ready to secure your AI applications? Here’s how to begin:Step 1: Start with out-of-the-box evaluators
- Enable Prompt Safety for comprehensive threat detection
- Enable PII Detection for privacy protection
- Review evaluation results in Fiddler dashboards
Step 2: Deploy real-time guardrails for critical risks
- Identify your highest-priority security requirements
- Configure guardrails with appropriate thresholds
- Test thoroughly before production deployment
Step 3: Monitor continuously with observability
- Set up dashboards for key security metrics
- Configure alerts for anomalies
- Schedule regular security reviews
- Guardrails setup: Guardrails Quick Start
- Enrichments & observability: Enrichments
- Custom LLM-as-a-Judge: Prompt Specs Quick Start