Supervised machine learning involves identifying a predictive task, finding data to enable that task, and building a model using that data.
Fiddler captures this workflow with project, dataset, and model entities.
In Fiddler, a project is essentially a parent folder that hosts one or more model (s) for the ML task (e.g. A Project HousePredict for predicting house prices will LinearRegression-HousePredict, RandomForest-HousePredict).
A model in Fiddler represents a placeholder for a machine-learning model. It's a placeholder because we may not need the model artifacts. Instead, we may just need adequate information about the model in order to monitor model-specific data.
You can upload your model artifacts to Fiddler to unlock high-fidelity explainability for your model. However, it is not required. If you do not wish to upload your artifact but want to explore explainability with Fiddler, we can build a surrogate model on the backend to be used in place of your artifact.
A dataset in Fiddler is a data table containing information about data such as features, model outputs, and a target for machine learning models. Optionally, you can also upload metadata and “decision” columns, which can be used to segment the dataset for analyses, track business decisions, and work as protected attributes in bias-related workflows.
In order to monitor production data, a dataset must be uploaded to be used as a baseline for making comparisons. This baseline dataset should be sampled from your model's training data. The sample should be unbiased and should faithfully capture moments of the parent distribution. Further, values appearing in the baseline dataset's columns should be representative of their entire ranges within the complete training dataset.
Datasets are used by Fiddler in the following ways:
- As a reference for drift calculations and data integrity violations on the Monitor page
- To train a model to be used as a surrogate when using
- For computing model performance metrics globally on the Evaluate page, or on slices on the Analyze page
- As a reference for explainability algorithms (e.g. partial dependence plots, permutation feature impact, approximate Shapley values, and ICE plots).
Based on the above uses, datasets with sizes much in excess of 10K rows are often unnecessary and can lead to excessive upload, precomputation, and query times. That being said, here are some situations where larger datasets may be desirable:
- Auto-modeling for tasks with significant class imbalance; or strong and complex feature interactions, possibly with deeply encoded semantics
- However, in use cases like these, most users opt to upload carefully-engineered model artifacts tailored to the specific application.
- Deep segmentation analysis
- If it’s desirable to perform model analyses on very specific subpopulations (e.g. “55-year-old Canadian home-owners who have been customers between 18 and 24 months”), large datasets may be necessary to have sufficient reference representation to drive model analytics.
Datasets can be uploaded to Fiddler using the Python API client.
[^1]: Join our community Slack to ask any questions
Updated 4 months ago