LLM Observability Metrics Reference - Fiddler Documentation

Fiddler provides a comprehensive set of enrichments for monitoring LLM applications in production. Enrichments augment your application data with automatically generated trust, safety, and quality metrics during model onboarding. These metrics integrate directly with Fiddler’s monitoring dashboards, alerting systems, and analytics tools. Configure enrichments using the fdl.Enrichment() class in the Python Client SDK. For detailed configuration examples, see the Enrichments Guide. For help choosing the right enrichment, see Selecting Enrichments.

For ML model metrics (performance, drift, data integrity), see the ML Metrics Reference.

Safety metrics

Safety enrichments detect and flag unsafe, harmful, or policy-violating content in your LLM application’s inputs and outputs.

Metric	Enrichment Key	LLM Required?	Output Type	Description
Safety	`ftl_prompt_safety`	Yes (Fiddler Centor)	bool + float per dimension	Evaluates text safety across 11 dimensions using Fiddler Centor Models
PII Detection	`pii`	No	bool + matches + entities	Detects personally identifiable information using Presidio
Profanity	`profanity`	No	bool	Flags offensive or inappropriate language
Banned Keywords	`banned_keywords`	No	bool	Detects user-defined restricted terms
Regex Match	`regex_match`	No	category	Matches text against a user-defined regular expression
Language Detection	`language_detection`	No	string + float	Identifies the language of the source text
Topic Classification	`topic_model`	No	list[float] + string	Classifies text into user-defined topics using zero-shot classification

Safety

The Safety enrichment evaluates text safety across 11 dimensions using Fiddler’s proprietary Centor Models. Each dimension produces a boolean flag and a confidence probability score. Enrichment key: ftl_prompt_safety

Dimension	Output Columns	Score Range	Description
`illegal`	`illegal`, `illegal score`	0.0 — 1.0	Content promoting illegal activities
`hateful`	`hateful`, `hateful score`	0.0 — 1.0	Hateful or discriminatory content
`harassing`	`harassing`, `harassing score`	0.0 — 1.0	Harassing or bullying content
`racist`	`racist`, `racist score`	0.0 — 1.0	Racist content
`sexist`	`sexist`, `sexist score`	0.0 — 1.0	Sexist content
`violent`	`violent`, `violent score`	0.0 — 1.0	Content promoting violence
`sexual`	`sexual`, `sexual score`	0.0 — 1.0	Sexually explicit content
`harmful`	`harmful`, `harmful score`	0.0 — 1.0	Generally harmful content
`unethical`	`unethical`, `unethical score`	0.0 — 1.0	Unethical content
`jailbreaking`	`jailbreaking`, `jailbreaking score`	0.0 — 1.0	Jailbreaking or prompt injection attempts
`roleplaying`	`roleplaying`, `roleplaying score`	0.0 — 1.0	Roleplaying attempts to bypass safety

An aggregate max_risk_prob output is also generated, representing the maximum probability across all 11 dimensions. For configuration details, see Enrichments: Safety.

PII Detection

Detects and flags personally identifiable information using Presidio. Generates a boolean flag, matched text spans, and detected entity types. Enrichment key: pii Commonly used entity types: CREDIT_CARD, CRYPTO, DATE_TIME, EMAIL_ADDRESS, IBAN_CODE, IP_ADDRESS, LOCATION, PERSON, PHONE_NUMBER, URL, US_SSN, US_DRIVER_LICENSE, US_ITIN, US_PASSPORT Fiddler supports 32 entity types in total, including international identifiers for Australia, India, Singapore, and the UK. For the full list, see the Presidio supported entities. For configuration details, see Enrichments: PII.

Profanity

Flags offensive or inappropriate language using curated word lists from SurgeAI and Google. Enrichment key: profanity For configuration details, see Enrichments: Profanity.

Banned Keywords

Detects user-defined restricted terms in text inputs. The list of banned keywords is specified in the enrichment configuration. Enrichment key: banned_keywords For configuration details, see Enrichments: Banned Keywords.

Regex Match

Matches text against a user-defined regular expression pattern. Produces a categorical output of “Match” or “No Match”. Enrichment key: regex_match For configuration details, see Enrichments: Regex Match.

Language Detection

Identifies the language of the source text using fasttext models. Produces the detected language and a confidence probability. Enrichment key: language_detection For configuration details, see Enrichments: Language Detection.

Topic Classification

Classifies text into user-defined topics using a zero-shot classification model. Produces per-topic probability scores and the top-scoring topic. Enrichment key: topic_model For configuration details, see Enrichments: Topic.

Quality and hallucination metrics

Quality enrichments assess the accuracy, groundedness, and relevance of LLM-generated responses.

Metric	Enrichment Key	LLM Required?	Output Type	Description
Faithfulness (Centor Model)	`ftl_response_faithfulness`	Yes (Fiddler Centor)	bool + float	Evaluates factual groundedness using Fiddler Centor Models
RAG Faithfulness	`faithfulness`	Yes (OpenAI)	bool	Evaluates factual accuracy of responses against provided context
Answer Relevance	`answer_relevance`	Yes (OpenAI)	bool	Evaluates whether responses address the input prompt
Coherence	`coherence`	Yes (OpenAI)	bool	Assesses logical flow and clarity of responses
Conciseness	`conciseness`	Yes (OpenAI)	bool	Evaluates brevity and clarity of responses

Faithfulness (Centor Model)

Evaluates the factual groundedness of AI-generated responses against provided context using Fiddler’s proprietary Centor Models. Produces a boolean faithfulness flag and a confidence probability score. Enrichment key: ftl_response_faithfulness

The faithfulness threshold defaults to 0.5 and can be adjusted in the configuration to control scoring sensitivity. Higher thresholds result in stricter faithfulness detection (fewer responses labeled as faithful).

For configuration details, see Enrichments: Faithfulness (Centor Model).

RAG Faithfulness

Evaluates the accuracy and reliability of facts presented in AI-generated responses by checking whether the information aligns with the provided context documents. Uses an OpenAI LLM for evaluation. Enrichment key: faithfulness

RAG Faithfulness vs Faithfulness (Centor Model): This enrichment uses OpenAI for evaluation. Faithfulness (Centor Model) uses Fiddler Centor Models for lower latency. See LLM-Based Metrics for a detailed comparison.

For configuration details, see Enrichments: Faithfulness.

Answer Relevance

Evaluates whether AI-generated responses address the input prompt. Produces a binary relevant/not-relevant result. Enrichment key: answer_relevance For configuration details, see Enrichments: Answer Relevance.

Coherence

Assesses the logical flow and clarity of AI-generated responses, checking whether the content maintains a consistent theme and argument structure. Enrichment key: coherence For configuration details, see Enrichments: Coherence.

Conciseness

Evaluates whether AI-generated responses communicate their message efficiently without unnecessary elaboration or redundancy. Enrichment key: conciseness For configuration details, see Enrichments: Conciseness.

Text statistics metrics

Text statistics enrichments provide quantitative analysis of text properties, including readability, length, and n-gram-based evaluation scores.

Metric	Enrichment Key	LLM Required?	Output Type	Description
Textstat	`textstat`	No	float	Generates up to 19 text readability and complexity statistics
Evaluate	`evaluate`	No	float	Computes n-gram-based evaluation scores (BLEU, ROUGE, METEOR)
Sentiment	`sentiment`	No	float + string	Provides sentiment analysis using VADER
Token Count	`token_count`	No	int	Counts the number of tokens in a string

Textstat

Generates text readability and complexity statistics using the textstat library. You can select specific statistics or use all 19 available metrics. Enrichment key: textstat

Sub-metric	Range	Description
`char_count`	0 — 64,000	Character count
`letter_count`	0 — 64,000	Letter count (alphabetical characters)
`miniword_count`	0 — 64,000	Count of short words
`words_per_sentence`	0 — 1,000	Average words per sentence
`polysyllabcount`	0 — 64,000	Polysyllabic word count
`lexicon_count`	0 — 64,000	Word count
`syllable_count`	0 — 96,000	Total syllable count
`sentence_count`	0 — 32,000	Sentence count
`flesch_reading_ease`	-100 — 100	Flesch Reading Ease score (higher = easier to read)
`smog_index`	0 — 30	SMOG readability index
`flesch_kincaid_grade`	-3.4 — 100	Flesch-Kincaid Grade Level
`coleman_liau_index`	0 — 20	Coleman-Liau readability index
`automated_readability_index`	-3.4 — 100	Automated Readability Index
`dale_chall_readability_score`	0 — 10	Dale-Chall readability score
`difficult_words`	0 — 64,000	Count of difficult words
`linsear_write_formula`	0 — 20	Linsear Write readability formula
`gunning_fog`	0 — 20	Gunning Fog readability index
`long_word_count`	0 — 64,000	Count of long words
`monosyllabcount`	0 — 64,000	Monosyllabic word count

If no statistics are specified in the configuration, the default statistic is flesch_kincaid_grade.

For configuration details, see Enrichments: Textstat.

Evaluate

Computes n-gram-based evaluation metrics for comparing two text passages, such as an AI-generated response and a reference answer. These metrics score highest when the reference and generated texts contain overlapping sequences. Enrichment key: evaluate

Sub-metric	Output Column	Score Range	Description
BLEU	`bleu`	0.0 — 1.0	Precision of word n-grams between generated and reference text
ROUGE-1	`rouge1`	0.0 — 1.0	Unigram recall between generated and reference text
ROUGE-2	`rouge2`	0.0 — 1.0	Bigram recall between generated and reference text
ROUGE-L	`rougeL`	0.0 — 1.0	Longest common subsequence between generated and reference text
ROUGE-Lsum	`rougeLsum`	0.0 — 1.0	ROUGE-L applied at the summary level
METEOR	`meteor`	0.0 — 1.0	Combines precision, recall, and semantic matching

For configuration details, see Enrichments: Evaluate.

Sentiment

Provides sentiment analysis using NLTK’s VADER lexicon. Produces a compound score and a categorical sentiment label. Enrichment key: sentiment

Output Column	Type	Description
`compound`	float	Raw compound sentiment score
`sentiment`	string	One of `positive`, `negative`, or `neutral`

For configuration details, see Enrichments: Sentiment.

Token Count

Counts the number of tokens in a string using the tiktoken library. Enrichment key: token_count For configuration details, see Enrichments: Token Count.

Text validation metrics

Text validation enrichments verify the structural correctness of generated text outputs such as SQL queries and JSON payloads.

Metric	Enrichment Key	LLM Required?	Output Type	Description
SQL Validation	`sql_validation`	No	bool + string	Validates SQL syntax for a specified dialect
JSON Validation	`json_validation`	No	bool + string	Validates JSON syntax and optionally against a schema

SQL Validation

Validates SQL query syntax for a specified dialect. Supports 25+ SQL dialects including MySQL, PostgreSQL, Snowflake, BigQuery, and others. Enrichment key: sql_validation

Query validation is syntax-based and does not check against any existing schema or databases for validity.

For configuration details, see Enrichments: SQL Validation.

JSON Validation

Validates JSON for correctness and optionally against a user-defined JSON Schema. Enrichment key: json_validation For configuration details, see Enrichments: JSON Validation.

Embedding metrics

Embedding enrichments convert text into vector representations for drift detection and visualization.

Metric	Enrichment Key	LLM Required?	Output Type	Description
Text Embedding	`TextEmbedding`	No	vector + float	Generates text embeddings for UMAP visualization and drift detection
Centroid Distance	(auto-generated)	No	float	Distance from the nearest cluster centroid

Text Embedding

Converts unstructured text into high-dimensional vector representations for semantic analysis. Enables Fiddler’s 3D UMAP visualizations and embedding-based drift detection. Class: fdl.TextEmbedding()

TextEmbedding is configured using fdl.TextEmbedding() rather than fdl.Enrichment(). See the Enrichments Guide for usage examples.

Centroid Distance

Measures the distance between a data point’s embedding and the nearest cluster centroid. This metric is automatically generated when a TextEmbedding enrichment is created. For configuration details, see Enrichments: Centroid Distance.

ML Metrics Reference — Built-in metrics for ML model monitoring
Enrichments Guide — Configuration examples for all enrichments
Selecting Enrichments — Choosing the right enrichment for your use case
LLM-Based Metrics — Detailed comparison of LLM-based evaluation methods

​Safety metrics

​Safety

​PII Detection

​Profanity

​Banned Keywords

​Regex Match

​Language Detection

​Topic Classification

​Quality and hallucination metrics

​Faithfulness (Centor Model)

​RAG Faithfulness

​Answer Relevance

​Coherence

​Conciseness

​Text statistics metrics

​Textstat

​Evaluate

​Sentiment

​Token Count

​Text validation metrics

​SQL Validation

​JSON Validation

​Embedding metrics

​Text Embedding

​Centroid Distance

​Related resources

Safety metrics

Safety

PII Detection

Profanity

Banned Keywords

Regex Match

Language Detection

Topic Classification

Quality and hallucination metrics

Faithfulness (Centor Model)

RAG Faithfulness

Answer Relevance

Coherence

Conciseness

Text statistics metrics

Textstat

Evaluate

Sentiment

Token Count

Text validation metrics

SQL Validation

JSON Validation

Embedding metrics

Text Embedding

Centroid Distance

Related resources