Supervised machine learning involves identifying a predictive task, finding data to enable that task, and building a model using that data. Fiddler captures this workflow with project, dataset, and model entities.
You can access your projects from the left-side menu.
A project represents a machine learning task (e.g. predicting house prices, assessing creditworthiness, or detecting fraud).
A project can contain one or more models for the ML task (e.g. LinearRegression-HousePredict, RandomForest-HousePredict).
Create a project by clicking on Projects and then clicking on Add Project.
- Create New Project — A window will pop up where you can enter the project name and click Create. Once the project is created, it will be displayed on the projects page.
A dataset in Fiddler is a data table containing features, model outputs, and a target for machine learning models. Optionally, you can also upload metadata and “decision” columns, which can be used to segment the dataset for analyses, track business decisions, and work as protected attributes in bias-related workflows.
In order to monitor production data, a dataset must be uploaded to be used as a baseline for making comparisons. This baseline dataset should be sampled from your model's training data. The sample should be unbiased and should faithfully capture moments of the parent distribution. Further, values appearing in the baseline dataset's columns should be representative of their entire ranges within the complete training dataset.
Datasets are used by Fiddler in the following ways:
- As a reference for drift calculations and data integrity violations on the
- To train a model to be used as a surrogate when using
- For computing model performance metrics globally on the Evaluate page, or on slices on the Analyze page
- As a reference for explainability algorithms (e.g. partial dependence plots, permutation feature impact, approximate Shapley values, and ICE plots).
Based on the above uses, datasets with sizes much in excess of 10K rows are often unnecessary and can lead to excessive upload, precomputation, and query times. That being said, here are some situations where larger datasets may be desirable:
- Auto-modeling for tasks with significant class imbalance; or strong and complex feature interactions, possibly with deeply encoded semantics
- However, in use cases like these, most users opt to upload carefully-engineered model artifacts tailored to the specific application.
- Deep segmentation analysis
- If it’s desirable to perform model analyses on very specific subpopulations (e.g. “55-year-old Canadian home-owners who have been customers between 18 and 24 months”), large datasets may be necessary to have sufficient reference representation to drive model analytics.
Datasets can be uploaded to Fiddler using the Python API client.
A model in Fiddler represents a machine learning model. A project will have one or more models for the ML task (e.g. a project to predict house prices might contain LinearRegression-HousePredict and RandomForest-HousePredict).
You can upload your model artifact to Fiddler to unlock high-fidelity explainability for your model. However, it is not required. If you do not upload your artifact, Fiddler will build a surrogate model on the backend to be used in its place.
At its most basic level, a model in Fiddler is simply a directory that contains three key components:
- The model file (e.g.
model.yaml: A YAML file containing all the metadata needed to describe the model, what goes into the model, and what should come out of it. This model metadata is used in Fiddler’s explanations, analytics, and UI.
package.py: A wrapper script containing all of the code needed to standardize the execution of the model.
You can collate specific visualizations under the Project Dashboard. After visualizations are created using the Model Analytics tool, you can pin them to the dashboard, which can then be shared with others.
[^1]: Join our community Slack to ask any questions
Updated 2 months ago