Skip to content

Designing a Baseline Dataset

In order for Fiddler to monitor drift or data integrity issues in incoming production data, it needs something to compare this data to.

A baseline dataset is a representative sample of the kind of data you expect to see in production. It represents the ideal form of data that your model works best on.

For this reason, it should be sampled from your model’s training set.

A few things to keep in mind when designing a baseline dataset:

It’s best to include at least 10,000-50,000 rows of data to ensure a representative sample, but you can add more if you like. Just keep in mind that very large baseline datasets will likely give diminishing returns and slower performance.

You may want to consider including extreme values (min/max) of each column in your training set so you can properly monitor out-of-range violations in production data. However, if you choose not to, you can manually specify these ranges before upload.

Back to top