Integration With S3

This guide explains how to integrate AWS S3 with Fiddler to retrieve baseline or production data for model monitoring. You'll learn how to:

  • Extract data from S3 buckets using different authentication methods

  • Load data efficiently based on your needs

  • Connect the extracted data with Fiddler's monitoring capabilities

How to Integrate Fiddler with AWS S3

Prerequisites

Before getting started, ensure you have:

  • An AWS account with access to the required S3 bucket

  • Required Python packages installed: boto3, pandas, and fiddler-client

  • Appropriate AWS credentials or profile configuration

  • Basic familiarity with Python and AWS S3 concepts

AWS Authentication Methods

Method 1: Using AWS Access Keys

If you're using AWS access keys for authentication, use this approach:

import boto3
import pandas as pd

# AWS Configuration
S3_BUCKET = 'your_bucket_name'
S3_FILENAME = 'path/to/your/file.csv'
AWS_ACCESS_KEY_ID = 'your_access_key'
AWS_SECRET_ACCESS_KEY = 'your_secret_key'
AWS_REGION = 'your_region' 

# Create AWS session
session = boto3.session.Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION,
)

# Initialize S3 client
s3 = session.client('s3')

# Read data into pandas DataFrame
s3_data = s3.get_object(Bucket=S3_BUCKET, Key=S3_FILENAME)['Body']
df = pd.read_csv(s3_data)

For enhanced security, we recommend using AWS profiles instead of hardcoding credentials:

import boto3
import pandas as pd

# Configuration
S3_BUCKET = 'your_bucket_name'
S3_FILENAME = 'path/to/your/file.csv'
AWS_PROFILE = 'your_profile_name'

# Create session using profile
session = boto3.session.Session(profile_name=AWS_PROFILE)
s3 = session.client('s3')

# Read data
s3_data = s3.get_object(Bucket=S3_BUCKET, Key=S3_FILENAME)['Body']
df = pd.read_csv(s3_data)

Data Loading Options

Option 1: Direct Memory Loading

For smaller datasets that fit in memory, load directly into a pandas DataFrame as shown in the examples above.

Option 2: File System Loading

For larger datasets or when memory constraints exist, save to disk first:

import boto3

# AWS Configuration
S3_BUCKET = 'your_bucket_name'
S3_FILENAME = 'path/to/your/file.csv'
OUTPUT_PATH = 'local/path/to/output.csv'

# Initialize S3 client (using either authentication method)
session = boto3.session.Session(profile_name='your_profile_name')
s3 = session.client('s3')

# Download file
s3.download_file(
    Bucket=S3_BUCKET,
    Key=S3_FILENAME,
    Filename=OUTPUT_PATH
)

Using AWS S3 Data with Fiddler

For Baseline Datasets

After loading your data, you can use it to create a baseline dataset in Fiddler. See the Creating a Baseline Dataset guide for more details.

import fiddler as fdl

# Assumes an initialized Python client session and instantiated Model
job = model.publish(
    source=s3_data_df,
    environment=fdl.EnvType.PRE_PRODUCTION,
    dataset_name='your_baseline_name',
)
print(
    f'Initiated pre-production dataset upload with Job ID = {job.id}'
)

For Production Traffic

To publish production data for monitoring. Refer to the batch publishing guide for more details. For more publishing options, see the additional publishing guides located here.

import fiddler as fdl

# Assumes an initialized Python client session and instantiated Model
job = model.publish(
    source=s3_data_df,
    environment=fdl.EnvType.PRODUCTION,
)
print(
    f'Initiated Production dataset upload with Job ID = {job.id}'
)

Best Practices

  • Always use AWS profiles instead of hardcoded credentials in production environments

  • Implement proper error handling around S3 operations

  • Consider data size when choosing between memory and file system loading

  • Use appropriate AWS IAM roles and permissions

  • Monitor memory usage when working with large datasets

Last updated

Was this helpful?