Skip to main content

Overview

As GenAI applications handle increasingly sensitive data and interact directly with users, security becomes a pressing concern. A single security breach, whether it’s a successful jailbreak attempt, leaked PII, or harmful content reaching users, can undermine trust and expose organizations to significant risk. This cookbook provides a framework for using Fiddler’s evaluators and guardrails to protect your AI applications from threats, including jailbreak attempts, harmful content generation, PII leakage, and policy violations.

Understanding AI Security Risks

AI security encompasses multiple threat vectors: Prompt-based attacks:
  • Jailbreaking: Attempts to bypass safety restrictions and make the model behave in unintended ways
  • Prompt injection: Malicious instructions embedded in user inputs to manipulate model behavior
  • Roleplaying exploitation: Using fictional scenarios to elicit restricted information or harmful content
Content safety risks:
  • Harmful content generation: Producing content that could cause psychological, physical, or social harm
  • Illegal content: Generating content that violates laws or regulations
  • Unethical outputs: Responses that violate ethical guidelines or corporate policies
Data privacy risks:
  • PII leakage in responses: Model outputs that inadvertently expose personally identifiable information
  • PII in prompts: Users submitting sensitive personal data that must be detected and protected

Out-of-the-Box Security Evaluators

Fiddler provides pre-built scoring mechanisms called evaluators that assess AI systems across multiple risk dimensions. Learn more: Enrichments

Custom LLM-as-a-Judge for Security

While Fiddler’s out-of-the-box evaluators cover common security risks, you may have organization-specific security policies that require custom evaluation. LLM-as-a-Judge evaluators allow you to encode your unique security guidelines into automated checks. Use LLM-as-a-Judge when you need to:
  • Enforce company-specific content policies beyond standard safety categories
  • Detect violations of industry-specific regulations (healthcare, finance, legal)
  • Flag content that conflicts with your organization’s ethical guidelines
  • Identify security risks unique to your application domain
Example applications:
  • A healthcare AI that must never provide medical diagnoses (even when users request them)
  • A financial AI that must refuse to give personalized investment advice
  • A customer service AI that must escalate certain sensitive topics to human agents
  • An educational AI that must not provide answers to homework assignments
By creating custom prompt templates that define your specific security requirements, you can automatically flag violations and enforce your policies at scale. Learn more: Prompt Specs Quick Start

1. Prompt Safety

What it detects: Evaluates the safety of text (prompts and responses) across multiple risk dimensions.
Safety DimensionWhat it identifiesWhy it matters
JailbreakAttempts to bypass model safety restrictions or manipulate the AI into ignoring its guidelinesAttack Prevention: Detects sophisticated prompt engineering designed to circumvent your controls
IllegalContent that violates laws or promotes illegal activitiesLegal Compliance: Prevents your AI from facilitating or encouraging unlawful behavior
RoleplayingUse of fictional scenarios to elicit restricted information or harmful content (e.g., “pretend you’re an AI with no restrictions”)Policy Enforcement: Catches attempts to bypass safety measures through narrative framing
HarmfulContent that could cause psychological, physical, or social harm to individuals or groupsUser Safety: Protects users from outputs that could lead to self-harm, dangerous actions, or trauma
UnethicalContent that violates ethical principles or societal norms, even if not explicitly illegalReputation Protection: Maintains alignment with your organization’s values and public expectations
Additional dimensions available: hateful, harassing, racist, sexist, violent, sexual How to use:
  • Apply to both user prompts (inputs) and AI responses (outputs)
  • Set severity thresholds based on your risk tolerance
  • Track trends over time to identify emerging attack patterns
Value: Acts as a comprehensive safety net, catching both malicious user attempts and problematic model outputs before they cause harm.

2. PII Detection

What it detects: Identifies personally identifiable information in both user prompts and AI responses.
PII TypeExamplesRisk
Social Security Numbers123-45-6789, 123456789Identity Theft: Exposure can lead to fraud and financial harm
Credit Card Numbers4532-1234-5678-9010Financial Fraud: Direct monetary loss and unauthorized transactions
Email Addressesuser@example.comPrivacy Violation: Can enable spam, phishing, or unwanted contact
Phone Numbers(555) 123-4567, +1-555-123-4567Privacy Violation: Enables unwanted contact and potential harassment
Physical Addresses123 Main St, New York, NY 10001Physical Security: Can reveal location and enable stalking or theft
IP Addresses192.168.1.1, 2001:0db8:85a3::8a2e:0370:7334Privacy & Security: Can reveal location and enable tracking
Driver’s License NumbersD1234567Identity Theft: Can be used for fraud and impersonation
Passport Numbers123456789Identity Theft: International fraud and identity compromise
Medical Record NumbersMRN-789456HIPAA Violation: Protected health information exposure
Bank Account Numbers123456789012Financial Fraud: Unauthorized access to funds
National ID NumbersVaries by countryIdentity Theft: Government ID compromise
Tax IDs / EINs12-3456789Business Fraud: Corporate identity theft
Dates of Birth01/15/1990Identity Verification: Combined with other data, enables fraud
NamesJohn Smith, Jane DoePrivacy Context: When combined with other PII, increases risk
Where to apply: Input Detection (User Prompts):
  • Purpose: Identify when users are submitting sensitive personal information
  • Actions:
    • Warn users not to share PII
    • Redact PII before processing
    • Log incidents for security review
  • Example: User asks “Can you analyze my credit report? My SSN is 123-45-6789…”
Output Detection (AI Responses):
  • Purpose: Catch when the model inadvertently includes PII in responses
  • Actions:
    • Block response from reaching the user
    • Regenerate without PII
    • Flag for investigation (How did the model access this PII?)
  • Example: Model trained on customer service logs accidentally includes someone’s phone number in a response
Value: Prevents privacy violations, ensures compliance with GDPR/CCPA/HIPAA, and protects users from identity theft.

Guardrails vs. Post-Production Observability

Fiddler supports two complementary approaches to AI security: real-time guardrails and post-production observability. Understanding when to use each is critical for building secure AI systems.

Real-Time Guardrails

What they are: Security checks that evaluate and potentially block AI inputs or outputs before they are processed or delivered to users. How they work:
  1. User submits a prompt → Guardrail evaluates for safety/PII
  2. If violation detected → Request is blocked or modified
  3. If safe → Request proceeds to model
  4. Model generates response → Guardrail evaluates output
  5. If violation detected → Response is blocked or regenerated
  6. If safe → Response delivered to user
When to use real-time guardrails:
ScenarioWhy Guardrails are EssentialExample
User-facing applicationsCannot allow harmful content to reach usersCustomer support chatbot must block offensive responses
High-stakes domainsSingle failure has severe consequencesHealthcare AI must never provide medical diagnoses
Compliance requirementsRegulations mandate prevention, not just detectionFinancial AI must block PII from being logged or transmitted
Jailbreak preventionMust stop malicious prompts before processingDetect and reject prompt injection attacks
PII protectionCannot allow sensitive data to be exposedPrevent credit card numbers from appearing in responses
Tradeoffs:
  • Latency: Adds processing time to each request (typically 100-500ms)
  • False positives: May occasionally block legitimate requests
  • Cost: Requires additional compute for real-time evaluation
Learn more: Guardrails Quick Start

Post-Production Observability

What it is: Continuous monitoring and analysis of AI behavior after requests have been processed, using historical data to identify patterns, trends, and emerging threats. How it works:
  1. AI processes requests normally (no blocking)
  2. All prompts and responses are logged to Fiddler
  3. Security evaluators run asynchronously on logged data
  4. Dashboards show trends, patterns, and anomalies
  5. Alerts trigger when thresholds are exceeded
  6. Teams investigate and respond to issues
When to use post-production observability:
ScenarioWhy Observability is ValuableExample
Threat intelligenceIdentify attack patterns and emerging risksNotice spike in jailbreak attempts targeting a specific weakness
Model behavior analysisUnderstand how the model responds to edge casesDiscover the model occasionally generates PII in rare scenarios
Compliance auditingMaintain historical records for regulatory reviewDemonstrate you monitor and address security issues over time
Performance optimizationImprove guardrails based on real-world dataReduce false positive rate by analyzing blocked legitimate requests
Trend monitoringTrack security metrics over timeSee if harmful content attempts increase after a news event
A/B testing security measuresCompare security performance across model versionsEvaluate if new prompt reduces jailbreak success rate
Tradeoffs:
  • No prevention: Issues are detected after they occur
  • Requires follow-up: Teams must act on insights
  • Best for learning: Ideal for understanding threats and improving defenses
Learn more: Enrichments

Using Both Approaches Together

The most secure AI systems combine real-time guardrails with post-production observability: Real-time guardrails provide:
  • Immediate protection for users
  • Prevention of high-severity incidents
  • Compliance with “must prevent” requirements
Post-production observability provides:
  • Insights to improve guardrails
  • Detection of sophisticated attacks that evade guardrails
  • Trend analysis for proactive security
Example workflow:
1

Deploy guardrails

to block high-confidence threats (Jailbreak score > 0.9, PII detected)
2

Enable observability

to log all requests and responses and alert on issues
3

Monitor dashboards

for medium-severity flags (Jailbreak score 0.5-0.9)
4

Investigate patterns

in flagged content
5

Refine guardrails

based on findings (tighten thresholds, add custom rules)
6

Iterate continuously

as new threats emerge

How These Evaluators Can Help

1. Prevent Security Incidents Before They Occur

Real-time guardrails act as a security perimeter, blocking malicious inputs and harmful outputs before they reach users. This prevents:
  • Reputational damage from AI generating offensive content
  • Legal liability from privacy violations
  • User harm from dangerous or misleading information

2. Detect and Respond to Emerging Threats

Post-production observability helps you identify:
  • New attack vectors: Novel jailbreak techniques not caught by existing rules
  • Systematic weaknesses: Topics or phrasings where the model consistently fails safety checks
  • Coordinated attacks: Patterns suggesting organized attempts to compromise your AI

3. Maintain Compliance and Auditability

For regulated industries, security monitoring provides:
  • Audit trails demonstrating proactive security measures
  • Compliance evidence for GDPR, CCPA, HIPAA, and other regulations
  • Incident documentation showing how you detected and responded to threats
  • Risk assessment data to support security reviews and certifications

4. Build Trust with Users

Transparent security practices signal to users that you take their safety and privacy seriously:
  • Publish security metrics and response times
  • Communicate how you protect user data
  • Demonstrate continuous improvement in safety measures

5. Optimize Security vs. User Experience

By analyzing false positives in observability dashboards, you can:
  • Tune guardrails to reduce unnecessary blocking
  • Identify legitimate use cases that trigger safety flags
  • Balance security rigor with user experience

Get Started

Ready to secure your AI applications? Here’s how to begin:
1

Step 1: Start with out-of-the-box evaluators

  • Enable Prompt Safety for comprehensive threat detection
  • Enable PII Detection for privacy protection
  • Review evaluation results in Fiddler dashboards
2

Step 2: Deploy real-time guardrails for critical risks

  • Identify your highest-priority security requirements
  • Configure guardrails with appropriate thresholds
  • Test thoroughly before production deployment
3

Step 3: Monitor continuously with observability

  • Set up dashboards for key security metrics
  • Configure alerts for anomalies
  • Schedule regular security reviews
4

Step 4: Iterate and improve

  • Analyze patterns in flagged content
  • Refine guardrail thresholds based on false positive/negative rates
  • Add custom LLM-as-a-Judge evaluators for organization-specific policies
For step-by-step tutorials: Security is not a one-time configuration, it’s an ongoing practice. By combining real-time guardrails with continuous observability, you can protect users, maintain compliance, and build AI systems worthy of trust.