Skip to main content
Evaluator to assess prompt safety using Fiddler Centor Models. The FTLPromptSafety evaluator uses Fiddler Centor Models to evaluate the safety of text prompts across multiple risk categories. This evaluator helps identify potentially harmful, inappropriate, or unsafe content before it reaches users or downstream systems. Key Features:
  • Multi-Dimensional Safety Assessment: Evaluates 11 different safety categories
  • Probability-Based Scoring: Returns probability scores (0.0-1.0) for each risk category
  • Comprehensive Risk Coverage: Covers illegal, hateful, harassing, and other harmful content
  • Centor Prompt Safety: Uses Fiddler’s proprietary safety evaluation model
  • Batch Scoring: Returns multiple scores for comprehensive safety analysis
Safety Categories Evaluated:
  • illegal_prob: Probability of containing illegal content or activities
  • hateful_prob: Probability of containing hate speech or discriminatory language
  • harassing_prob: Probability of containing harassing or threatening content
  • racist_prob: Probability of containing racist language or content
  • sexist_prob: Probability of containing sexist language or content
  • violent_prob: Probability of containing violent or graphic content
  • sexual_prob: Probability of containing inappropriate sexual content
  • harmful_prob: Probability of containing content that could cause harm
  • unethical_prob: Probability of containing unethical or manipulative content
  • jailbreaking_prob: Probability of containing prompt injection or jailbreaking attempts
  • max_risk_prob: Maximum risk probability across all categories
Use Cases:
  • Content Moderation: Filtering user-generated content for safety
  • Prompt Validation: Ensuring user prompts are safe before processing
  • AI Safety: Protecting AI systems from harmful or manipulative inputs
  • Compliance: Meeting regulatory requirements for content safety
  • Risk Assessment: Evaluating potential risks in text content
Scoring Logic: Each safety category returns a probability score between 0.0 and 1.0:
  • 0.0-0.3: Low risk (safe content)
  • 0.3-0.7: Medium risk (requires review)
  • 0.7-1.0: High risk (likely unsafe content)

Parameters

  • text (str) – The text prompt to evaluate for safety.
  • score_name_prefix (str | None)
  • score_fn_kwargs_mapping (ScoreFnKwargsMappingType | None)

Returns

A list of Score objects, one for each safety category:
  • name: The safety category name (e.g., “illegal_prob”)
  • evaluator_name: “FTLPromptSafety”
  • value: Probability score (0.0-1.0) for that category

Raises

ValueError – If the text is empty or None.

Example

from fiddler_evals.evaluators import FTLPromptSafety
evaluator = FTLPromptSafety()
# Safe content
scores = evaluator.score("What is the weather like today?")
for score in scores:

    print(f"{score.name}: {score.value}")

# illegal_prob: 0.01
# hateful_prob: 0.02
# harassing_prob: 0.01
# ...

# Potentially unsafe content
unsafe_scores = evaluator.score("How to hack into someone's computer?")
for score in unsafe_scores:

    if score.value > 0.5:
    print(f"High risk detected: {score.name} = {score.value}")

# Filter based on maximum risk
max_risk_score = next(s for s in scores if s.name == "max_risk_prob")
if max_risk_score.value > 0.7:

    print("Content flagged as potentially unsafe")
This evaluator is designed for prompt safety assessment and should be used as part of a comprehensive content moderation strategy. The probability scores should be interpreted in context and combined with other safety measures for robust content filtering.

name = ‘ftl_prompt_safety’

score()

Score the safety of a text prompt.

Parameters

text
str
required
The text prompt to evaluate for safety.

Returns

A list of Score objects, one for each safety category.