Agentic Document Extraction
Build reliable, measurable document extraction pipelines using Fiddler's agentic observability (tracing), custom evaluators, and experiments to catch hallucinated fields, schema drift, and silent accuracy degradation. While this cookbook uses invoice extraction as a running example, the patterns apply to any pipeline that extracts structured data from documents — medical records, legal filings, research papers, support tickets, or any other source.
Use this cookbook when: You have an LLM-based agent extracting structured data from any document type — invoices, medical records, legal contracts, research papers, support tickets, or other sources — and need observability, automated quality evaluation, and production monitoring.
Time to complete: ~30 minutes
Prerequisites
Fiddler account with API access
LLM credential configured in Settings > LLM Gateway
pip install fiddler-evals pandas
Understanding the Problem
Document extraction — pulling structured data from unstructured or semi-structured documents — is one of the most common enterprise AI applications. Whether the source is an invoice, a medical record, a legal filing, or a support ticket, the core challenge is the same. When an LLM-based agent handles this work, it introduces new failure modes that traditional rule-based parsers never had: hallucinated field values, inconsistent schemas, and silent accuracy degradation after model updates.
A document extraction agent typically follows a multi-step workflow:
Parse — normalize raw document text into a consistent format
Extract — use an LLM to pull structured fields (vendor name, invoice number, line items, totals) into JSON
Validate — check that the output is complete, correctly formatted, and mathematically consistent
Each step can fail in different ways, and the failures compound. A parsing step that drops a line item produces an extraction that looks correct but is missing data. An LLM that hallucinates a subtotal produces a validation that passes the schema check but fails the math check. Without observability at every step, these issues are invisible until a downstream system — or a customer — catches them.
Common failure modes in document extraction:
Schema drift: The model starts omitting fields it previously extracted reliably
Numeric hallucination: Dollar amounts or quantities that don't match the source document
Date format inconsistency: Dates returned in varying formats despite explicit instructions
Math errors: Extracted totals that don't equal subtotal + tax
Silent degradation: Accuracy drops gradually after a model update, with no hard errors to trigger alerts
How Fiddler Helps: Three Layers of Observability
Fiddler provides three complementary capabilities for document extraction pipelines:
1. Agentic Observability (Tracing)
By instrumenting your extraction pipeline with OpenTelemetry, every step — parse, extract, validate — appears as a span in Fiddler's trace view. This gives you:
Span hierarchy: A root
chainspan with childtoolandllmspans, showing exactly how the agent orchestrates each stepLLM telemetry: Model name, token usage (input/output/total), and the full prompt and response captured on the extraction span
Error attribution: When an extraction fails, you can see which step failed and why, including error type and message
Custom attributes: Tag spans with metadata like
document_type,invoice_id, or any business-relevant dimension for filtering in dashboards
Learn more: OpenTelemetry Quick Start
Other integration options: While this cookbook uses OpenTelemetry for maximum flexibility, Fiddler also provides dedicated SDKs with auto-instrumentation for popular agentic frameworks — including LangGraph, Strands, LangChain, and LiteLLM. These SDKs require minimal code changes (often a single instrument() call) and produce the same traces in Fiddler. For custom Python agents without a framework, the Fiddler OTel SDK (fiddler-otel) provides a @trace decorator for lightweight instrumentation. See the full integration guide to choose the right option for your stack.
2. Fiddler Experiments (Offline Evaluation)
Experiments let you systematically measure extraction quality against a ground-truth dataset. You define a dataset of test cases (source documents + expected extractions), run your pipeline against them, and score the results with custom evaluators. This gives you:
Repeatable benchmarks: Compare extraction accuracy across model versions, prompt changes, or schema updates
Per-test-case drill-down: See exactly which fields mismatched and why, for every test case
Side-by-side comparison: Run the same dataset against different model versions and compare field accuracy in a single view
Learn more: Experiments
3. Production Monitoring Signals
In production, you compute aggregate signals over rolling time windows and set alerts on threshold breaches. These signals act as early warning systems for extraction quality degradation.
Learn more: Agentic Monitoring
Definition: Built-In Evaluators (Fiddler Evals SDK)
The Fiddler Evals SDK (fiddler-evals) provides pre-built evaluators that deliver immediate, generalized assessments of LLM performance. For document extraction, they establish a baseline for output quality without requiring ground-truth data. Built-in evaluators include Coherence, Conciseness, AnswerRelevance, Sentiment, and more — each available as a Python class you instantiate and pass to the evaluate() function.
Learn more: Evals SDK Reference
Definition: Custom Evaluators
Custom evaluators allow you to encode domain-specific quality standards directly into the evaluation process. For document extraction, this means comparing extracted fields against known correct values, checking schema completeness, and validating mathematical consistency. The Fiddler Evals SDK provides three approaches for building custom evaluators:
CustomJudge— An LLM-as-a-Judge evaluator that uses a Jinja prompt template and structured output fields. Ideal for nuanced, qualitative assessments like per-field accuracy or math consistency checks.EvalFn— Wraps any Python function as an evaluator. Best for deterministic checks like schema completeness or exact-match comparisons.Subclass
Evaluator— Extend the baseEvaluatorclass for full control over scoring logic, input handling, and multi-score returns.
Learn more: CustomJudge Evaluators
Recommended Evaluators for Document Extraction
Coherence
Assesses the logical flow and clarity of the LLM's extraction output.
Output Quality: Catches garbled or malformed extraction responses before they reach downstream systems.
Conciseness
Evaluates whether the extraction output is focused and free of extraneous commentary.
Schema Discipline: Ensures the model returns structured data, not explanations or caveats mixed into the output.
"Field Accuracy" Custom Evaluator (See Deep Dive below)
Compares each extracted scalar field (vendor name, invoice number, date, subtotal, tax, total) against a ground-truth value.
Granular Quality Control: Pinpoints exactly which fields the model extracts reliably vs. which require prompt tuning or model changes.
"Schema Completeness" Custom Evaluator (See Deep Dive below)
Measures the fraction of required fields that are present and non-null in the extraction output.
Completeness Assurance: Catches schema drift — when a model starts silently omitting fields it previously extracted correctly.
"Per-Field Accuracy" CustomJudge (See Deep Dive below)
Uses a CustomJudge evaluator to assess extraction accuracy for each specific field against the source text.
Automated Review: Provides human-like accuracy assessment at scale without requiring manual comparison of every extraction.
"Math Consistency" Custom Evaluator
Checks whether extracted numeric fields are internally consistent (e.g., total == subtotal + tax).
Numeric Integrity: Catches hallucinated dollar amounts that pass schema validation but fail basic arithmetic.
Recommended Production Monitoring Signals
Beyond per-document evaluators, you should track aggregate signals over rolling time windows to detect systemic issues.
Success Rate
Fraction of extractions completing without exception.
Alert when < 95%. Catches API errors, timeouts, and crashes.
Validation Failure Rate
Fraction of extractions with at least one validation error (missing fields, bad date format, math mismatch).
Alert when > 20%. Catches silent quality degradation.
Field Completeness
Average fraction of required fields present and non-null across all extractions.
Alert when < 90%. Catches schema drift after model updates.
Math Accuracy
Fraction of extractions where total == subtotal + tax within a small tolerance.
Alert when < 90%. Catches numeric hallucination trends.
In production, these signals would be computed over rolling windows (e.g., hourly or daily) and configured as Fiddler alerts. A sudden drop in math accuracy after a model update, for example, would trigger an alert before the bad data propagates to downstream accounting systems.
Deep Dive: Tracing an Extraction Pipeline with OpenTelemetry
Fiddler's agentic observability uses OpenTelemetry (OTEL) to capture the full execution of your extraction pipeline as a structured trace. Each trace consists of spans with a parent-child hierarchy that mirrors your agent's logic.
Span Hierarchy for Document Extraction
A typical extraction trace has the following structure:
extraction_pipeline— the root span, typed aschain. Carries agent-level metadata:gen_ai.agent.name,gen_ai.agent.id, and custom attributes likefiddler.span.user.document_typeandfiddler.span.user.invoice_id.parse_document— atoolspan. Records the raw input and cleaned output of the normalization step.extract_fields— anllmspan. This is where the richest telemetry lives:gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.input.messages, andgen_ai.output.messages.validate_output— atoolspan. Records the validation result: which checks passed, which failed, and the specific error messages.
Why This Matters
When an extraction produces incorrect data, the span hierarchy lets you pinpoint the root cause:
Parse failed? The
parse_documentspan shows the raw input was garbled or the normalization dropped content.LLM hallucinated? The
extract_fieldsspan shows the exact prompt and response, plus token usage that may indicate the model was truncating output.Validation caught it? The
validate_outputspan shows which checks failed, so you know whether the issue is a missing field, a bad date, or a math error.
Without this trace structure, you only see "extraction failed" — with it, you see why.
Error Handling in Traces
When any step throws an exception, the root span captures fiddler.error.message and fiddler.error.type, making it filterable in Fiddler dashboards. You can quickly find all traces where the LLM returned unparseable JSON, or where the OpenAI API timed out, without searching through logs.
Learn more: OpenTelemetry Integration
Deep Dive: Custom Evaluators for Extraction Quality
While built-in evaluators like Coherence catch general output quality issues, document extraction requires domain-specific evaluators that understand your schema and can compare against ground truth. The Fiddler Evals SDK provides multiple ways to build these: subclassing Evaluator for complex scoring logic, wrapping functions with EvalFn for simple checks, and using CustomJudge for LLM-based assessment.
Field Accuracy Evaluator (Subclass Evaluator)
This evaluator compares each extracted scalar field against a known correct value. It handles numeric fields with a tolerance (to account for rounding differences) and string fields with case-insensitive comparison.
What it checks: vendor_name, invoice_number, date, subtotal, tax, total
Scoring: Returns the fraction of fields that match (0.0–1.0). A score of 0.83 means 5 out of 6 fields matched.
Why it matters: Aggregate field accuracy tells you how reliable your pipeline is. But the per-field breakdown tells you where to focus improvement. If date is the field that most often mismatches, you know to adjust the date formatting instructions in your prompt — not rebuild the entire pipeline.
Schema Completeness Evaluator (EvalFn)
This evaluator checks what fraction of required fields are present and non-null in the extraction output, independent of whether the values are correct. Wrapping a simple Python function with EvalFn is the most concise way to build deterministic evaluators.
What it checks: All required schema fields (vendor_name, invoice_number, date, line_items, subtotal, tax, total)
Scoring: Returns the fraction present (0.0–1.0). A score of 0.86 means 6 out of 7 required fields were populated.
Why it matters: Schema completeness is the earliest signal of extraction degradation. A model may still extract some fields correctly while silently dropping others. Tracking completeness separately from accuracy lets you distinguish between "the model is wrong" and "the model isn't even trying."
Per-Field Accuracy CustomJudge
For cases where you want a more nuanced assessment — or where ground-truth data is unavailable — you can use a CustomJudge to evaluate extraction quality directly from the source text. CustomJudge uses a Jinja prompt template with {{ placeholder }} syntax and structured output_fields to define what the LLM judge should return.
The judge takes the original source document and the extracted fields as inputs — no ground truth required.
It assesses each field independently, producing both per-field booleans and an overall accuracy classification.
By flagging "Mostly Incorrect" extractions, you can route them for human review or reprocessing.
This approach scales to production volumes where maintaining ground-truth datasets is impractical.
Math Consistency CustomJudge
This CustomJudge evaluates whether the extracted numeric fields are internally consistent with the source document.
This judge catches a category of errors that schema validation alone cannot: values that are present and correctly formatted but numerically wrong.
A "Major Error" flag on
math_consistencyindicates the model hallucinated dollar amounts — a critical failure for any financial document extraction system.Tracking the
line_items_sum_correctfield over time reveals whether the model struggles more with line-item aggregation than with reading individual totals.
Deep Dive: Using Fiddler Experiments for Extraction Benchmarking
Fiddler Experiments provide a structured way to benchmark your extraction pipeline against a ground-truth dataset. This is especially valuable when you need to:
Compare model versions: Does your current model extract dates more accurately than a smaller or newer variant?
Test prompt changes: Does adding "Return 0 for tax if not applicable" reduce errors on tax-exempt invoices?
Validate schema changes: After adding a new required field, does the existing accuracy hold?
Set Up the Experiment
The Fiddler Evals SDK (fiddler-evals) provides the evaluate() function as the main entry point for running experiments. The workflow is:
Create a Dataset with source documents as inputs and ground-truth extractions as expected outputs
Define a Task — a Python function that runs your extraction pipeline on each input
Attach Evaluators — built-in evaluators,
CustomJudgeinstances,EvalFn-wrapped functions, orEvaluatorsubclassesRun with
evaluate()and review per-test-case scores in the Fiddler UI
Replace url, token, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.
The experiment results show you not just aggregate scores, but individual test cases where your pipeline struggled. If 7 out of 8 invoices score 100% field accuracy but one scores 67%, you can drill into that specific case to see what went wrong — perhaps it was an invoice with an unusual date format or a tax-exempt order.
Iterate with Experiments
The real power of Experiments is iterative improvement. Because evaluate() returns structured results and each experiment is tracked in the Fiddler UI, you can run A/B comparisons:
Run a baseline experiment with your current prompt and model
Identify the weakest test cases (lowest field accuracy or schema completeness)
Adjust your prompt, model, or preprocessing to address those failures
Run a new experiment against the same dataset and compare side-by-side with the baseline
Repeat until accuracy meets your threshold
This workflow turns prompt engineering from guesswork into a measurable process.
Learn more: Experiments
How These Evaluators Can Help
1. Catching Silent Quality Degradation
Document extraction pipelines rarely fail loudly. More often, they degrade gradually — a model update causes date formatting to shift, or a prompt change inadvertently reduces field completeness. By tracking Field Accuracy and Schema Completeness over time, you catch these regressions before they reach downstream systems. A 5% drop in math accuracy after a model update is invisible in error logs but immediately visible in Fiddler dashboards.
2. Reducing Manual Review Burden
Without automated evaluation, every extracted document needs human review to verify accuracy. With Fiddler evaluators acting as an automated quality gate, your team reviews only the extractions flagged for low accuracy or missing fields. If 95% of extractions pass validation, your reviewers focus on the 5% that need attention — not the full volume.
3. Enabling Confident Model and Prompt Changes
Changing a model or prompt in a document extraction pipeline is risky without a way to measure the impact. Fiddler Experiments give you a controlled environment to test changes against a known dataset before deploying to production. You can prove that a prompt change improved date accuracy from 87% to 98% before it touches a single real document.
4. Building Trust with Downstream Consumers
Finance teams, compliance officers, and ERP systems that consume extracted data need to trust its accuracy. Fiddler's monitoring signals — success rate, field completeness, math accuracy — provide auditable evidence that your extraction pipeline is performing within acceptable bounds. When a downstream consumer questions a data point, you can trace it back to the specific span that produced it.
5. Detecting Document-Type-Specific Weaknesses
By tagging traces with custom attributes like document_type (invoice, receipt, contract) or vendor_name, you can segment your monitoring signals and discover that your pipeline handles standard invoices at 98% accuracy but struggles with receipts (85%) or international invoices with VAT (78%). This guides where to invest in prompt engineering or training data.
Next Steps
Building Custom Judge Evaluators — Deep-dive into
CustomJudgecapabilitiesRunning RAG Experiments at Scale — Structured experiments with Datasets and golden label validation
Monitoring Agentic Content Generation — Quality and brand compliance for content agents
Evaluator Rules — Deploy evaluators in production
Evals SDK Integration — Integration patterns for agentic workflows
Related: OpenTelemetry Quick Start | Experiments Quick Start | CustomJudge Evaluators
❓ Questions? Talk to a product expert or request a demo.
💡 Need help? Contact us at [email protected].