Supervised machine learning involves identifying a predictive task, finding data to enable the task and building models using this data for the task. Fiddler captures this workflow with Project, Dataset, and Model entities. Models belong to a Project. In addition, the Dashboard entity represents visualization insights for a Project or the entire team.
You can access Datasets and Projects from the left menu at all times.
A project represents a machine learning task, e.g. predicting house prices, assessing creditworthiness, or detecting fraud.
A project can contain one or more models for the ML task, e.g. LinearRegression-HousePredict, RandomForest-HousePredict.
Create a project by clicking on Projects and then clicking on Add Project.
- 'Create New Project': A window will pop up where user will enter the project name and click Create. Once the project is created, it will be displayed in this project page as well as the project pulldown menu on the left side.
A dataset in Fiddler is a data table containing features and targets for machine learning models and possibly metadata and “decisions” columns which can be used to segment the dataset for analyses, track business decisions, or as protected attributes in bias-related workflows. Typically users upload a representative sample of their model’s training data. Often a holdout test set is also included.
The sample should be unbiased, faithfully capturing moments of the parent distribution. Further, values appearing in dataset columns should be representative of their entire ranges or of all possible values for categorical variables.
Datasets are used by Fiddler in the following ways:
- To train a model using the UI auto-model, or when the register_model method of Fiddler Client generates a surrogate model
- For computing model performance metrics in Evaluate, or on segments in Analyze
- As a reference for explainability algorithms – e.g. partial dependence plots (1K samples), permutation feature impact (10K samples), approximate Shapley values (1K samples), and to set the grid boundaries for ICE plots..
- As a reference for drift calculations which are carried out on univariate distributions and to compute outliers in Monitoring
Based on the above uses, datasets with sizes much in excess of 10K rows are often unnecessary and can lead to excessive upload, precomputation, and query times. That being said, here are some situations where larger datasets may be desirable:
- Auto-modeling for tasks with significant class imbalance; or strong and complex feature interactions, possibly with deeply encoded semantics. However, In use-cases like these, most users opt to upload carefully-engineered model artifacts tailored to the specific application.
- Deep segmentation analysis. If it’s desirable to perform model analyses on very specific subpopulations, e.g. “55-year-old Canadian home-owners who have been customers between 18 and 24 months”, large datasets may be necessary to have sufficient reference representation to drive model analytics (e.g. PDP, Global Feature Impact).
Upload a dataset to Fiddler using the Fiddler Python client (two-step process):
Use fiddler_client.DatasetInfo.from_dataframe() to infer a dataset schema from a Pandas DataFrame. This produces a DatasetInfo object which can be further edited if necessary before uploading (e.g. for Data Integrity in Monitoring, one might modify numeric ranges or available options for categories).
Use fiddler_api.upload_dataset() to upload the above schema and named datasets (“train”, “test”, etc’) to the Fiddler platform.
A model in Fiddler represents a machine learning model. A project will have one or more models for the ML task (e.g. a project to predict house prices might contain LinearRegression-HousePredict and RandomForest-HousePredict).
You can use Fiddler to import your existing model or generate a model from a dataset
'Fiddler Python Client': Users can use the Fiddler API to import their custom model into Fiddler. Refer to the Fiddler Python Client documentation for details.
At its most basic level, a model in Fiddler is simply a directory that contains three key components:
- The model file (e.g. *.pkl)
model.yaml: The YAML file containing all the metadata needed to describe the model, what goes into the model, and what should come out of it. This model metadata is used in Fiddler’s explanations, analytics, and UI. See the sample below.
package.pycontains all of the code needed to standardize the execution of the model.
Please see the Model Package documentation below for details.
You can collate specific visualizations under Dashboards. There is a universal dashboard at the team level which can be accessed from the left menu along with project specific dashboards. After visualizations are created using the Model Analytics tool, you can pin them to these dashboards which can then be shared with others.