Monitoring & Alerting Overview

Connect Fiddler alerts to incident management, observability, and communication tools

Route Fiddler AI observability alerts to your existing incident management and communication tools. Integrate with observability platforms, on-call systems, and team collaboration tools to ensure AI issues are detected, triaged, and resolved quickly.

Why Alert Integration Matters

AI models fail in unique ways—drift, data quality issues, performance degradation, safety violations. Fiddler's alert integrations ensure your team responds immediately:

  • Unified Incident Management - AI alerts flow into the same systems as infrastructure alerts

  • Faster Response Times - On-call engineers notified via existing escalation policies

  • Context-Rich Alerts - Model context, affected predictions, and root cause analysis included

  • Reduced Alert Fatigue - Intelligent grouping and deduplication across tools

  • Automated Remediation - Trigger workflows to rollback models or scale resources

Integration Categories

📊 Observability Platforms

Send Fiddler metrics and alerts to enterprise observability platforms for unified monitoring.

Supported Platforms:

  • Datadog - Application performance monitoring and infrastructure observability ✓ GA

Common Use Cases:

  • Correlate AI model issues with infrastructure metrics

  • Build unified dashboards combining Fiddler + infrastructure data

  • Use Datadog's anomaly detection on Fiddler metrics

  • Alert on compound conditions (model drift + high latency)

🚨 Incident Management

Connect alerts to on-call systems for immediate engineer notification.

Supported Platforms:

  • PagerDuty - Incident management and on-call scheduling ✓ GA

Common Use Cases:

  • Page on-call ML engineers for critical model failures

  • Escalate unresolved AI incidents automatically

  • Track MTTR (Mean Time To Resolution) for model issues

  • Integrate with incident runbooks and response workflows

💬 Team Collaboration

Send alerts to team communication tools for visibility and collaboration.

Supported Platforms:

  • Slack - Team messaging and collaboration ✓ GA (Coming Soon)

  • Microsoft Teams - Enterprise communication platform ✓ GA (Coming Soon)

Common Use Cases:

  • Notify ML team channel when drift is detected

  • Alert data science team on data quality issues

  • Share model performance reports automatically

  • Collaborative incident triage in team channels

Observability Platform Integrations

Datadog

Integrate Fiddler with Datadog for unified application and AI monitoring.

Why Datadog + Fiddler:

  • Unified Dashboards - Combine infrastructure, application, and AI model metrics

  • Correlated Alerts - Alert on compound conditions (e.g., "high model drift + high API latency")

  • Service Map Integration - See model health in Datadog service dependency graphs

  • Anomaly Detection - Leverage Datadog's ML-based alerting on Fiddler metrics

Key Features:

  • Metric Export - Send Fiddler drift, performance, and data quality metrics to Datadog

  • Event Streaming - Stream model events (predictions, drift detections) as Datadog events

  • Alert Forwarding - Route Fiddler alerts to Datadog for unified incident management

  • Tag Propagation - Maintain consistent tagging across platforms (model, environment, team)

Status:GA - Production-ready

Get Started with Datadog →

Quick Start:

Example Datadog Dashboard:

Incident Management Integrations

PagerDuty

Route critical AI alerts to on-call engineers via PagerDuty.

Why PagerDuty + Fiddler:

  • On-Call Escalation - Page the right ML engineer based on escalation policies

  • Incident Deduplication - Prevent alert storms from related model issues

  • Incident Timeline - Track when AI issues were detected, acknowledged, resolved

  • Postmortem Integration - Include model context in incident reports

Key Features:

  • Severity Mapping - Map Fiddler alert criticality to PagerDuty severity levels

  • Service Integration - Associate alerts with PagerDuty services (e.g., "Fraud Detection Service")

  • Custom Payloads - Include model metadata, drift scores, affected predictions

  • Bidirectional Updates - Acknowledge/resolve incidents in PagerDuty or Fiddler

Status:GA - Production-ready

Get Started with PagerDuty →

Quick Start:

Example PagerDuty Incident:

Team Collaboration Integrations

Slack (Coming Soon)

Planned Features:

  • Channel Notifications - Post alerts to team Slack channels

  • Interactive Messages - Acknowledge, snooze, or resolve alerts from Slack

  • Scheduled Reports - Daily/weekly model performance summaries

  • Threaded Discussions - Collaborate on incident resolution in threads

Example Configuration:

Microsoft Teams (Coming Soon)

Planned Features:

  • Adaptive Cards - Rich, interactive alert notifications

  • Team Channels - Route alerts to relevant team channels

  • Bot Commands - Query model status from Teams chat

  • Integration with Workflows - Trigger Teams workflows on alerts

Alert Routing Patterns

Pattern 1: Severity-Based Routing

Route alerts to different channels based on severity:

Pattern 2: Team-Based Routing

Different teams get different alerts:

Pattern 3: Composite Alerting

Alert on compound conditions across multiple platforms:

Metric Export Patterns

Export Fiddler Metrics to Datadog

Exported Metrics:

  • fiddler.model.drift.score - Overall drift score (0-1)

  • fiddler.model.drift.feature.<feature_name> - Per-feature drift

  • fiddler.model.performance.<metric> - Model performance metrics

  • fiddler.model.data_quality.score - Data quality score

  • fiddler.model.predictions.count - Prediction volume

  • fiddler.model.predictions.latency - Prediction latency percentiles

Query Fiddler Metrics in Datadog

Alert Lifecycle Management

Alert States

Fiddler alerts transition through these states:

Synchronization with External Tools:

  • PagerDuty: Bidirectional state sync (acknowledge, resolve)

  • Datadog: Event-based updates

  • Slack: Interactive message updates

Alert Deduplication

Prevent alert storms with intelligent deduplication:

Custom Webhook Integrations

For platforms not natively supported, use generic webhooks:

Webhook Payload Example:

Monitoring Integration Health

Track Integration Status

Alerts on Integration Failures

Best Practices

Alert Fatigue Prevention

1. Use Appropriate Severity Levels:

2. Implement Alert Throttling:

3. Use Alert Grouping:

Incident Response Runbooks

Include runbook links in alert payloads:

Security & Compliance

Secure Credential Management

Never hardcode credentials:

Alert Data Privacy

PII Redaction in Alerts:

Audit Logging

Track alert delivery:

Troubleshooting

Common Issues

Alerts Not Delivered:

  • Verify integration credentials are valid and not expired

  • Check network connectivity from Fiddler to external platform

  • Ensure webhook endpoints are reachable (not blocked by firewall)

  • Validate alert thresholds are actually being triggered

Duplicate Alerts:

  • Enable alert deduplication with appropriate time windows

  • Check if multiple notification channels are configured

  • Verify integration isn't configured twice

Missing Alert Context:

  • Ensure include_context=True in alert configuration

  • Check payload template includes necessary fields

  • Verify external platform supports rich payloads (some SMS gateways don't)

Integration Selector

Choose the right integration for your use case:

Your Need
Recommended Integration
Why

On-call engineer paging

PagerDuty

Escalation policies, incident management

Infrastructure correlation

Datadog

Unified metrics, correlated dashboards

Team notifications

Slack (Coming Soon)

Channel-based, collaborative triage

Custom internal tools

Generic Webhooks

Flexible, integrate with any HTTP endpoint

Multi-tool strategy

Datadog + PagerDuty

Metrics + incidents in one workflow


Last updated

Was this helpful?