Skip to content

Data Integrity

Monitor_Data_Integrity

ML models are increasingly driven by complex feature pipelines and automated workflows that involve dynamic data. Data is transformed from source to model input which can result in data inconsistencies and errors.

There are three types of violations that can occur at model inference: missing feature values, type mismatches (e.g. sending a float input for a categorical feature type) or range mismatches (e.g. sending an unknown US State for a State categorical feature).

You can track all these violations in the Data Integrity tab. The time series above tracks the violations of data integrity constraints set up for this model.

How does this work?

It can be tedious to set up constraints for individual features when they number in the tens or hundreds. To avoid this, you can provide Fiddler with a sample baseline dataset that is representative of the data your model will infer on when deployed. This is typically the training set for most teams. It can be uploaded into Fiddler using the python package and tied to the model you want to monitor.

Fiddler automatically generates constraints based on the distribution of data in this dataset. For example

  • Missing value: If the feature has no missing values, then the data integrity constraint will be set up to trigger when it sees any missing value. Similarly, if the feature has 50% of its values missing, then the data integrity constraint will be set up to trigger when it sees more than 50% of values missing in the specified time range.
  • Type mismatch: The data integrity constraint will be set up to expect the same type specified in this baseline dataset.
  • Range mismatch: For a categorical feature, the data integrity constraint will be set up to trigger when it sees any value besides the ones specified in the baseline. Similarly for continuous variables, the constraint will be set up to trigger if the values are outside of the range specified in the baseline.

What is being tracked?

The time series above tracks the violations of data integrity constraints set up for this model.

  • Missing values This indicates the percentage of missing value violations over all features for a given period of time.
  • Type mismatch – This indicates the percentage of data type mismatch violations over all features for a given period of time.
  • Range mismatch – This indicates the percentage of range mismatch violations over all features for a given period of time.
  • All violating events – This indicates the aggregation of all the data integrity violations above for a given period of time.

Why is it being tracked?

  • Data integrity issues can cause incorrect data to flow into the model, which can lead to poor model performance and have a negative impact on the business or end-user experience.

What steps should I take with this information?

  • The drill-down below informs us of the feature-wise breakdown of the violations. The raw counts of the violations are shown in parentheses.
  • If there is a spike in violations, or an unexpected violation occurs (such as missing values for a feature that doesn’t accept a missing value), then a deeper examination of the feature pipeline may be required.
  • You can also drill down deeper into the data by examining it in the Analyze tab. We can use SQL to slice and dice the data, and try to find the root cause of the issues.

Reference


  1. Join our community Slack to ask any questions 

Back to top