Why Data Integration Matters
AI observability requires continuous data flow from your ML pipelines and applications. Fiddler’s data platform integrations enable:- Automated Data Ingestion - Pull training datasets and production events without manual uploads
- Real-Time Monitoring - Stream prediction events for immediate drift and performance detection
- Unified Data Pipeline - Single integration point for all your ML data sources
- Ground Truth Enrichment - Automatically join production predictions with delayed labels
- Historical Analysis - Query data warehouse for model performance over time
Integration Categories
🏢 Data Warehouses
Connect Fiddler to your cloud data warehouse for batch data ingestion and historical analysis. Supported Platforms:- Snowflake - Cloud data warehouse with zero-copy data sharing ✓ GA
- Google BigQuery - Serverless data warehouse with SQL analytics ✓ GA
- Import training datasets from data warehouse tables
- Query historical model predictions for performance analysis
- Join production events with delayed ground truth labels
- Export Fiddler metrics back to warehouse for BI tools
📊 Data Streaming
Stream real-time prediction events and feedback directly to Fiddler for immediate observability. Supported Platforms:- Apache Kafka - Distributed event streaming platform ✓ GA
- Amazon S3 - Object storage with event notifications ✓ GA
- Stream model predictions in real-time from production services
- Monitor agentic AI interactions as they occur
- Trigger alerts on data quality issues within seconds
- Capture ground truth feedback from user interactions
🔄 Orchestration & Pipelines
Integrate Fiddler into your ML workflow orchestration for automated monitoring at every pipeline stage. Supported Platforms:- Apache Airflow - Workflow orchestration platform ✓ GA
- AWS SageMaker Pipelines - Managed ML pipeline service ✓ GA
- Automatically upload datasets when training pipelines complete
- Trigger model evaluation as part of CI/CD workflows
- Schedule periodic drift checks and performance reports
- Orchestrate ground truth label collection and enrichment
Data Warehouse Integrations
Snowflake
Why Snowflake + Fiddler:- Zero-Copy Data Sharing - No data duplication, direct queries to Snowflake
- Secure Data Access - OAuth 2.0 and key-pair authentication
- Scalable Analytics - Leverage Snowflake’s compute for large datasets
- Cost-Effective - Pay only for queries executed, no data transfer fees
Google BigQuery
Why BigQuery + Fiddler:- Serverless Architecture - No infrastructure management
- SQL-Based Queries - Familiar interface for data teams
- Federated Queries - Join Fiddler data with other GCP sources
- Machine Learning - BigQuery ML model monitoring integration
Streaming Integrations
Apache Kafka
Why Kafka + Fiddler:- Real-Time Monitoring - Sub-second latency from prediction to observability
- High Throughput - Handle millions of events per second
- Event Replay - Replay historical events for testing and validation
- Exactly-Once Semantics - Guaranteed delivery for critical predictions
Amazon S3
Why S3 + Fiddler:- Batch Processing - Ingest large datasets efficiently
- Event Notifications - Automatic processing when new files arrive
- Data Lake Integration - Monitor models trained on S3 data lakes
- Cost-Effective Storage - Archive historical predictions in S3
Orchestration & Pipeline Integrations
Apache Airflow
Why Airflow + Fiddler:- Automated Workflows - Trigger Fiddler operations as DAG tasks
- Dependency Management - Ensure data quality before model training
- Scheduling - Periodic drift checks and model evaluations
- Observability - Monitor ML pipelines and models in one platform
AWS SageMaker Pipelines
Why SageMaker Pipelines + Fiddler:- Native AWS Integration - Seamless with SageMaker Partner AI App
- End-to-End ML Workflows - From data prep to model monitoring
- Model Registry Integration - Automatic monitoring setup for registered models
- Cost Optimization - Leverage existing SageMaker infrastructure
Integration Selector
Not sure which data integration to use? Here’s a quick decision guide:| Your Data Source | Recommended Integration | Why |
|---|---|---|
| Snowflake data warehouse | Snowflake connector | Zero-copy sharing, direct queries |
| BigQuery tables | BigQuery connector | Serverless, SQL-based, GCP-native |
| Real-time prediction streams | Kafka integration | Sub-second latency, high throughput |
| S3 data lake | S3 integration | Batch processing, event-driven uploads |
| Airflow ML pipelines | Airflow operators | Automated workflows, task dependencies |
| SageMaker workflows | SageMaker Pipelines | Native AWS integration, model registry |
Getting Started
Prerequisites
Before setting up data integrations, ensure you have:- Fiddler Account - Cloud or on-premises instance
- API Key - Generate from Fiddler UI Settings
- Data Source Access - Credentials with read permissions
- Network Connectivity - Firewall rules allowing Fiddler → Data Source
General Setup Pattern
All data integrations follow this pattern: 1. Configure ConnectionAdvanced Patterns
Pattern 1: Multi-Source Data Enrichment
Combine data from multiple sources for comprehensive monitoring:Pattern 2: Data Quality Validation
Validate data quality before ingestion:Pattern 3: Incremental Updates
Efficiently update datasets with only new data:Data Format Requirements
Baseline/Training Data
Must include:- Features - All model input features
- Predictions - Model outputs (for validation)
- Metadata (optional) - Additional context fields
Production Event Data
Must include:- Event ID - Unique identifier
- Timestamp - Event time
- Features - Model inputs
- Predictions - Model outputs
- Model Version (optional) - For multi-model monitoring
Security & Compliance
Authentication Methods
Snowflake:- Username/Password
- Key Pair Authentication (recommended for production)
- OAuth 2.0
- Service Account JSON key
- Application Default Credentials
- Workload Identity (GKE)
- SASL/PLAIN
- SASL/SCRAM
- mTLS
- IAM Role (recommended for AWS deployments)
- Access Key / Secret Key
- Cross-account access via IAM role assumption
Data Privacy
- Encryption in Transit - TLS 1.3 for all data transfers
- Encryption at Rest - Data encrypted in Fiddler storage
- PII Redaction - Automatically detect and redact sensitive fields
- Data Retention - Configurable retention policies per dataset
Network Security
Firewall Rules:- AWS PrivateLink - For SageMaker Partner AI App
- VPC Peering - Direct connection to data sources
- VPN Tunnels - Secure connectivity for on-premises sources
Monitoring Data Pipeline Health
Connection Health Checks
Data Ingestion Metrics
Monitor data pipeline performance:- Ingestion Latency - Time from source to Fiddler
- Throughput - Events per second processed
- Error Rate - Failed ingestion attempts
- Data Freshness - Time since last successful update
Alerts on Pipeline Failures
Troubleshooting
Common Issues
Connection Timeouts:- Check network connectivity and firewall rules
- Verify credentials are current and have proper permissions
- Ensure data source is reachable from Fiddler’s network
- Validate data types match Fiddler’s expected schema
- Check for null values in required fields
- Ensure timestamp fields use supported formats (ISO 8601)
- For Kafka: Check consumer lag and partition count
- For Data Warehouses: Optimize queries, add indexes
- For S3: Use Parquet or ORC instead of CSV
- Enable data validation rules before ingestion
- Set up alerts for out-of-range values
- Configure automatic PII redaction
Related Integrations
- Cloud Platforms - Deploy Fiddler on AWS, Azure, GCP
- ML Platforms - Integrate with Databricks, MLflow
- Agentic AI - Monitor LangGraph and Strands Agents
- Monitoring & Alerting - Send alerts to incident management tools