Skip to content

Upload a scikit-learn Regression Model

View In Github

This tutorial example shows how to train a wine quality model using sklearn and upload dataset and model onto Fiddler Platform using Fiddler Client API and how to serve predictions from that model.

Initialize Fiddler Client

We begin this section as usual by establishing a connection to our Fiddler instance. We can establish this connection either by specifying our credentials directly, or by utilizing our fiddler.ini file. More information can be found in the setup section.

import fiddler as fdl

# client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=auth_token)
client = fdl.FiddlerApi()

Load Dataset

Here we will load in our baseline dataset from a csv called train.csv. We will also create a schema using this information.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/fiddler-labs/fiddler-samples/master/content_root/samples/datasets/winequality/train.csv')
df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)

Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case. Uploading our dataset in the next step will depend on our created project's project_id.

project_id = 'sklearn_tabular'
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

Upload Dataset

To upload a model, you first need to upload a sample of the data of the model’s inputs, targets, and additional metadata that might be useful for model analysis. This data sample helps us (among other things) to infer the model schema and the data types and values range of each feature.

if 'wine_quality' not in client.list_datasets(project_id):
    upload_result = client.upload_dataset(
        project_id=project_id,
        dataset={'train': df},
        dataset_id='wine_quality')
Heads up! We are inferring the details of your dataset from the dataframe(s) provided. Please take a second to check our work.

If the following DatasetInfo is an incorrect representation of your data, you can construct a DatasetInfo with the DatasetInfo.from_dataframe() method and modify that object to reflect the correct details of your dataset.

After constructing a corrected DatasetInfo, please re-upload your dataset with that DatasetInfo object explicitly passed via the `info` parameter of FiddlerApi.upload_dataset().

You may need to delete the initially uploaded versionvia FiddlerApi.delete_dataset('wine_quality').

Inferred DatasetInfo to check:
  DatasetInfo:
    display_name:
    files: []
    columns:
                        column    dtype count(possible_values) is_nullable  \
      0                 row_id  INTEGER                              False
      1          fixed acidity    FLOAT                              False
      2       volatile acidity    FLOAT                              False
      3            citric acid    FLOAT                              False
      4         residual sugar    FLOAT                              False
      5              chlorides    FLOAT                              False
      6    free sulfur dioxide    FLOAT                              False
      7   total sulfur dioxide    FLOAT                              False
      8                density    FLOAT                              False
      9                     pH    FLOAT                              False
      10             sulphates    FLOAT                              False
      11               alcohol    FLOAT                              False
      12               quality  INTEGER                              False

            value_range
      0       0 - 1,597
      1     4.7 - 15.9
      2    0.12 - 1.58
      3     0.0 - 1.0
      4     0.9 - 15.5
      5   0.012 - 0.611
      6     1.0 - 72.0
      7     6.0 - 289.0
      8    0.99 - 1.004
      9    2.74 - 4.01
      10   0.37 - 2.0
      11    8.4 - 14.9
      12      3 - 8

Create Model Schema

As you must have noted, in the dataset upload step we did not ask for the model’s features and targets, or any model specific information. That’s because we allow for linking multiple models to a given dataset schema. Hence we require an Infer model schema step which helps us know the features relevant to the model and the model task. Here you can specify the input features, the target column, decision columns and metadata columns, and also the type of model.

target = 'quality'
train_input = df.drop(columns=['row_id', 'quality'])
train_target = df[target]

feature_columns = list(train_input.columns)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, 'wine_quality'),
    target=target,
    features=feature_columns,
    display_name='sklearn model',
    description='this is a sklearn model from tutorial'
)

Train Model

Install Scikit-learn

Scikit-Learn v0.21.2

Fiddler currently supports Scikit-learn v0.21.2. Install Scikit-learn v0.21.2 to proceed. If you have another version, please contact Fiddler for assistance.

import sklearn

assert sklearn.__version__ == '0.21.2', 'Please change sklearn version to 0.21.2'

Build and train your model.

import sklearn.linear_model
import sklearn.pipeline
import sklearn.preprocessing


regressor = sklearn.linear_model.LinearRegression()

full_model = sklearn.pipeline.Pipeline(steps=[
        ('standard_scaling', sklearn.preprocessing.StandardScaler()),
        ('model_name', regressor),
    ])

full_model.fit(train_input, train_target)
full_model.predict(train_input)

Save model and schema

Next step, we need to save the model and any pre-processing step you had on the input features (for example Categorical encoder, Tokenization, ...).

import pathlib
import shutil
import pickle
import yaml

project_id = 'tutorial'
model_id = 'wine_quality_model'

# create temp dir
model_dir = pathlib.Path(model_id)
shutil.rmtree(model_dir, ignore_errors=True)
model_dir.mkdir()

# save model
with open(model_dir / 'model.pkl', 'wb') as pkl_file:
    pickle.dump(full_model, pkl_file)

# save model schema
with open(model_dir / 'model.yaml', 'w') as yaml_file:
    yaml.dump({'model': model_info.to_dict()}, yaml_file)

Write package.py Wrapper

A wrapper is needed between Fiddler and the model. This wrapper can be used to translate the inputs and outputs to fit what the model expects and what Fiddler is able to consume. More information can be found here

%%writefile wine_quality_model/package.py

import pickle
from pathlib import Path
import pandas as pd

PACKAGE_PATH = Path(__file__).parent

class SklearnModelPackage:

    def __init__(self):
        self.is_classifier = False
        self.is_multiclass = False
        self.output_columns = ['predicted_quality']
        with open(PACKAGE_PATH / 'model.pkl', 'rb') as infile:
            self.model = pickle.load(infile)

    def predict(self, input_df):
        if self.is_classifier:
            if self.is_multiclass:
                predict_fn = self.model.predict_proba
            else:
                def predict_fn(x):
                    return self.model.predict_proba(x)[:, 1]
        else:
            predict_fn = self.model.predict
        return pd.DataFrame(predict_fn(input_df), columns=self.output_columns)

def get_model():
    return SklearnModelPackage()
Writing wine_quality_model/package.py

Validate Model Package

This verifies consistency between df_schema, model_info, and package.py; and performs local functional tests on the wrapped model.

from fiddler import PackageValidator
validator = PackageValidator(model_info, df_schema, model_dir)
passed, errors = validator.run_chain()
Validation Result: PASS

Upload Model

Now that we have all the parts that we need, we can go ahead and upload the model to the Fiddler platform. You can use the upload_model_package to upload this entire directory in one shot. We need the following for uploading a model: - The path to the directory - The project_id to which the model belongs - The model_id, which is the name you want to give the model. You can access it in Fiddler henceforth via this ID - The dataset which the model is linked to (optional)

In total, we will have a model.yaml, a *.pkl, and a package.py file within our model directory.

if project_id not in client.list_projects():
    client.create_project(project_id)
client.delete_model(project_id, model_id)
client.upload_model_package(model_dir, project_id, model_id)

Run Model

Now, let's test out our model by interfacing with the client and calling run model.

prediction_input = train_input[0: 10]
result = client.run_model(project_id, model_id, prediction_input, log_events=True)
result

Get Explanation

Let's get an explanation on a selected data point to better understand how our model came to the conclusion it did. We can do so by calling the run_explanation method. In this case, we will call for an explanation using 'fiddler_shapley_values'. More information on this method can be found here

selected_point = train_input.head(1)
ex_fiddler = client.run_explanation(
    project_id=project_id,
    model_id=model_id,
    df=selected_point,
    dataset_id='wine_quality',
    explanations='fiddler_shapley_values')
ex_fiddler
AttributionExplanation(algorithm='fiddler_shapley_values', inputs=['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates',
'alcohol'], attributions=[0.009804465365594747, 0.059509307478706634, -0.00454374668714475, -0.007885060597416912,
-0.05091310525155545, 0.007887581453356447, -0.2659775676442224, -0.013162502094803026, 0.15488637079076337,
1.146266145811042, -0.17300973252391694], misc={'model_prediction': 6.505067816095052, 'baseline_prediction':
5.642205659994647, 'explanation_ci': {'fixed acidity': 0.004575048611312511, 'volatile acidity': 0.029510279429877637,
'citric acid': 0.002446049161334008, 'residual sugar': 0.003592716900558121, 'chlorides': 0.009712755556082283, 'free
sulfur dioxide': 0.0028662348793511926, 'total sulfur dioxide': 0.013654930329826967, 'density': 0.004373296419055475,
'pH': 0.008998112643280863, 'sulphates': 0.016525181778215443, 'alcohol': 0.042744136519003516},
'explanation_ci_level': 0.95})
Back to top