Creating a Baseline Dataset

This document explains the importance of setting up a baseline dataset for monitoring data integrity in production. It provides examples of creating different types of baselines such as static pre-production, static production, and rolling production.

Set up Baseline

To monitor drift or data integrity issues in production data, baseline data is needed to be compared. A baseline dataset is a representative sample of the data you expect to see in production. It represents the ideal data that your model works best on. For this reason, a baseline dataset should be sampled from your model’s training set.

A few things to keep in mind when designing a baseline dataset:

  • It’s important to include enough data to ensure you have a representative sample of the training set.
  • You may want to consider including extreme values (min/max) of each column in your training set so you can properly monitor range violations in production data. However, if you choose not to, you can manually specify these ranges before uploading, see customizing your dataset schema.

Baseline Type: Static pre-production

dataset = next(fdl.Dataset.list(model_id=model.id))

static_pre_prod_baseline = fdl.Baseline(
    name='static_preprod_1',
    model_id=model.id,
    environment=fdl.EnvType.PRE_PRODUCTION,
    type_=fdl.BaselineType.STATIC,
    dataset_id=dataset.id,
)
static_pre_prod_baseline.create()

print(f'Static pre-production baseline created with id - {static_pre_prod_baseline.id}')

Baseline Type: Static Production

static_prod_baseline = fdl.Baseline(
    name='static_prod_1',
    model_id=model.id,
    environment=fdl.EnvType.PRODUCTION,
    type_=fdl.BaselineType.STATIC,
    start_time=(datetime.now() - timedelta(days=0.5)).timestamp(),
    end_time=(datetime.now() - timedelta(days=0.25)).timestamp(),
)
static_prod_baseline.create()

print(f'Static production baseline created with id - {static_prod_baseline.id}')

Baseline Type: Rolling Production

rolling_prod_baseline = fdl.Baseline(
    name='rolling_prod_1',
    model_id=model.id,
    environment=fdl.EnvType.PRODUCTION,
    type_=fdl.BaselineType.ROLLING,
    window_bin_size=fdl.WindowBinSize.WEEK,
    offset_delta=4,
)
rolling_prod_baseline.create()

print(f'Rolling production baseline created with id - {rolling_prod_baseline.id}')

List baselines

for x in fdl.Baseline.list(model_id=model.id):
    print(f'Dataset: {x.id} - {x.name}')