Skip to content

PyTorch for Tabular Data with IG

View In Github

Initialize Fiddler Client

This python client is a powerful way to:

  • Upload the dataset and model to Fiddler
  • Ingest production events to Fiddler

This can be done from a Jupyter Notebook or any python editor that you use to load data and build models.

First, we need to initialize the client object by specifying:

  • url: url is the fiddler URL that you have been provided to access. Usually of the form ‘https://xxxxx.fiddler.ai’. Contact Fiddler if you don’t have it
  • org_id: organization id is an identifier for the account. See Fiddler_URL/settings/general to find this id (listed as "Organization ID") png
  • auth_token: this token is used to authenticate access. See Fiddler_URL/settings/credentials to find, create, or change this token png

You can also save this config as a file called fiddler.ini in the same folder as the notebook/script. That saves you from specifying the parameters in every notebook and script. png

import fiddler as fdl

url = 'http://xxx.fiddler.ai'
token = 'xxxxxxxx'
org_id = 'xxxx'

client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=token)

Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case. Uploading our dataset in the next step will depend on our created project's project_id.

project_id = 'pytorch_tabular'
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

Load Dataset

Load the data you are going to use for training your model.

import pandas as pd

df = pd.read_csv('/app/fiddler_samples/samples/datasets/heart_disease/data.csv')
df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)
df.head()

Upload Dataset

To upload a model, you first need to upload a sample of the data of the model’s inputs, targets, and additional metadata that might be useful for model analysis. This data sample helps us (among other things) to infer the model schema and the data types and values range of each feature.

  • This sample has to be a flat table that can be loaded as a pandas DF (upload_dataset()) or saved as a csv (upload_dataset_from_dir()).
  • In this example age, sex, trestbps, chol, fbs, thalach, exang, oldpeak, slope are input features, and target is the target column for the model.
  • This input data sample is used for many downstream functions in Fiddler
    • Shapley value methods - background data to simulate the missing of features
    • What-if (ICE) plots - background data
    • PDP plots - background data
    • Drift - to serve as a baseline
    • Outliers - to serve as a baseline
    • Data integrity - to serve as a baseline
  • We suggest uploading a sample of the model’s training data as it’s the most meaningful for the tasks listed above. For example, model outliers should be ideally based on the training data as that’s the data the model has seen.
  • You can upload multiple datasets with string identifiers, but we currently do not ascribe any meaning to those. For example: dataset={'data': df} or dataset={'train': train_df, 'test': test_df}.
  • Currently we support two input types:
    • Tabular
    • Single string text, meaning text data in a single column
if 'heart_disease' not in client.list_datasets(project_id):
    upload_result = client.upload_dataset(
        project_id=project_id,
        dataset={'data': df}, 
        dataset_id='heart_disease')

Create Model Schema

As you must have noted, in the dataset upload step we did not ask for the model’s features and targets, or any model specific information. That’s because we allow for linking multiple models to a given dataset schema. Hence we require an Infer model schema step which helps us know the features relevant to the model and the model task. Here you can specify the input features, the target column, decision columns and metadata columns, and also the type of model.

  • Currently we support only one target column. This is not to be confused with output columns, which can be more than one.
  • Decision columns specify the decisions made on the basis of the model’s predictions. For example, in a credit lending scenario, the business decision to give or not to give a loan based on the model’s output. This is helpful while monitoring models after deployment, to keep track of the business impact of the model.
  • Metadata is data that is not used by the model, but can be relevant for understanding the model’s behavior on different segments of the data. For example, gender, race, age and other such sensitive features may not be used in the model, but we can analyze along these dimensions post facto to understand if the model is biased.
  • We can infer the model task from the target column, or it can explicitly set. Currently we support three model types:

    • Regression
    • Binary Classification
    • Multi-class Classification
target = 'target'
train_input = df.drop(columns=['target'])
train_target = df[target]

feature_columns = list(train_input.columns)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, 'heart_disease'),
    target=target, 
    features=feature_columns,
    display_name='PyTorch Tabular IG',
    description='This is a PyTorch model using tabular data and IG enabled from tutorial',
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION
)

Train Model

Install PyTorch

Pytorch v1.x

Fiddler currently supports Pytorch version 1.x If you have another version, please contact Fiddler for assistance.

import torch
from distutils.version import LooseVersion

assert LooseVersion(torch.__version__) >= LooseVersion("1.0.0"), 'Please use a pytorch version 1.x'

Build and train your model.

#https://github.com/jcjohnson/pytorch-examples
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


class TwoLayerModel(nn.Module):

    def __init__(self, D_in, H1, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerModel, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H1)
        self.output = torch.nn.Linear(H1, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary (differentiable) operations on Tensors.
        """
        #l1 = F.relu(self.linear1(x))
        l1 = self.linear1(x)
        output = F.sigmoid(self.output(l1))
        return output

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, X, y, features, target):
        # TODO
        # 1. Initialize file paths or a list of file names. 
        self.X = X
        self.y = y
        self.features = features
        self.target = target

    def __getitem__(self, index):
        # TODO
        # 1. Read one data from file (e.g. using numpy.fromfile, PIL.Image.open).
        # 2. Preprocess the data (e.g. torchvision.Transform).
        # 3. Return a data pair (e.g. image and label).
        X_ind =  self.X.iloc[index:index+1]
        return torch.tensor(X_ind[self.features].values, dtype=torch.float), \
               torch.tensor(self.y.iloc[index: index+1].values, dtype=torch.long)

    def __len__(self):
        # You should change 0 to the total size of your dataset.
        return len(self.X)

# You can then use the prebuilt data loader. 
custom_dataset_train = CustomDataset(X=train_input, y=train_target, features=feature_columns, target=target)
LEARNING_RATE = 1e-4
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 32
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0
                }

training_loader = DataLoader(custom_dataset_train, **train_params)

model = TwoLayerModel(D_in=9, H1=2, D_out=2)

model.train()
model
NUM_EPOCHS = 10
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
count = 0

def binary_accuracy(preds, y):

    rounded_preds = torch.round(preds)
    correct = (rounded_preds.argmax(axis=1) == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

def print_accuracy():
    train_preds = model(torch.from_numpy(train_input.values).float())
    train_y = torch.from_numpy(train_target.values).float()

    print(f'Train accuracy {binary_accuracy(train_preds, train_y)}')


for epoch in range(NUM_EPOCHS):
    print_accuracy()
    for X, y in training_loader:
        X = torch.squeeze(X)
        y = torch.squeeze(y)
        optimizer.zero_grad()
        output = model(X).squeeze()
        loss = loss_function(output, y)
        if count % 500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        count += 1
        loss.backward()
        optimizer.step()

print_accuracy()

Save Model And Schema

Next step, we need to save the model and any pre-processing step you had on the input features (for example Categorical encoder, Tokenization, ...).
We currently support the following stored model formats:

  • For sklearn API based models, pickled models, or any storage format that you can load in the package.py (details below).
  • For TF, we support TF Saved Model and Keras .h5
  • For PyTorch, we support any model format you can load in the package.py
import pathlib
import shutil
import yaml

# Let's save the model
model_id = 'heart_disease_pytorch'

# create temp dir
model_dir = pathlib.Path(model_id)
shutil.rmtree(model_dir, ignore_errors=True)
model_dir.mkdir()

# save model
torch.save(model.state_dict(), f'{model_dir}/heart_disease.pt')

In the following section, we are providing the code to upload your model in the two saved format supported. Please refer to the appropriate section.

Package.py is the interface between Fiddler’s backend and your model. This code helps Fiddler to understand the model, its inputs and outputs.

  • Load the model, and any associated files such as feature transformers or tokenizers.
  • Transform the data into a format that the model recognizes.
  • Make batch predictions using the model.
  • Understand the differentiable tensors of the model, in case we want to enable Integrated Gradients.

For certain common highly standardized frameworks, the Fiddler client provides helper upload methods to auto-generate this module (e.g. for scikit-learn models).

Writting the package.py file:

  • package.py will be invoked within the model’s specific assets directory and must implement a get_model() function which takes no arguments and returns an instance of a model class implementing the following methods:

    • The initialization parameters For PyTorch models:
      • self.max_allowed_error: Float specifying a percentage value for the maximum allowed integral approximation error for IG computation. If None then IG will be calculated for a pre-determined number of steps. Otherwise, the number of steps will be increased till the error is within the specified limit.
      • self.model: the code to load the model in the given session, you need to specify the file name essentially.
      • self.output_columns: a list of names of the output columns for the model.
      • self.batch_size: set a batch size for the model which will not cause OOM errors on the machines(s) the Fiddler cluster is hosted on. For the machine’s configuration, please check with Fiddler.
      • self.ig_enabled: if you want the Integrated gradients explanation method for your model. If False, then you can skip all the below parameters
    • transform_input(input_df): Accepts a pandas DataFrame object containing rows of raw feature vectors. The output of this method can be any Python object. This function can also be used to deserialize complex data types stored in dataset columns (e.g. images stored in a field in UTF-8 format). This function is typically called by predict, but the platform may also need to invoke it directly for certain operations (e.g. computing path integral steps in the Integrated Gradients explanation method).
    • generate_baseline(input_df): Generates a DataFrame specifying a baseline that is required for calculating Integrated Gradients. The Baseline is a certain 'informationless' input relative to which attributions must be computed. For instance, in a text classification model, the baseline could be the empty text.mThe baseline could be the same for all inputs or could be specific to the input at hand. The choice of baseline is important as explanations are contextual to a baseline. For more information please refer to this link
    • predict(input_df): Accepts a pandas DataFrame object containing rows of raw feature vectors. Outputs a pandas DataFrame object containing the model predictions whose column labels must match the output column names in model.yaml. Typically this function invokes transform_input explicitly.
%%writefile heart_disease_pytorch/package.py

import pandas as pd
import pathlib
import torch
import torch.nn as nn
import torch.nn.functional as F

PACKAGE_PATH = pathlib.Path(__file__).parent


class TwoLayerModel(nn.Module):

    def __init__(self, D_in, H1, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerModel, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H1)
        self.output = torch.nn.Linear(H1, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary (differentiable) operations on Tensors.
        """
        #l1 = F.relu(self.linear1(x))
        l1 = self.linear1(x)
        output = F.sigmoid(self.output(l1))
        return output

class MyModel:

    def __init__(self, max_allowed_error=None):

        # Tells us it is a PyTorch model for which ig is enabled
        self.ig_enabled = True
        self.model_framework = 'Pytorch'


        # Load the model to device
        self.device = torch.device('cuda:4' if torch.cuda.is_available()
                                   else 'cpu')

        # Modify these lines
        # -------------- Required User Input Starts  --------------------------
        # max allowed percentage error, override to increase or decrease
        # accuracy. Higher accuracy comes at a time cost
        self.max_allowed_error = max_allowed_error


        # Load the saved model
        self.model = TwoLayerModel(D_in=9, H1=2, D_out=2)
        self.model.load_state_dict(torch.load(PACKAGE_PATH/'heart_disease.pt', map_location=self.device))


        # the output column names of the model, as specified in the YAML
        self.output_columns = ['probability_target_True']

        # The layer in the model to attribute. 
        self.layer_to_attribute = self.model.linear1

        # If we want to attribute to the layer input
        self.attribute_to_layer_input = True

        # if we want to attribute to a particular index. For multi-class,
        # we will set it to None if we want to attribute to the arg-max output
        # Setting to 1 here to always attribute to the toxic class
        self.target_index = 1

        # ----------- Required User Input Ends --------------------------------
        self.model.eval()
        self.model.to(self.device)

    # -------------------- User Defined Functions Start  ---------------------

    def transform_input(self, input_df):
        """
        Transforms the provided dataframe into a dictionary mapping the keys
        'inputs' and 'auxiliary_inputs' to their corresponding tensors.
        'inputs': Are the tensors that correspond to the
                layers for which layer integrated gradients are computed. If
                the model's forward_func takes a single tensor as input,
                a single input tensor should be provided. If forward_func
                takes multiple tensors as input, a tuple of the input tensors
                should be provided.
        'auxiliary_inputs': If the forward function requires additional
                arguments other than the inputs for which attributions
                should not be computed, this argument can be provided. It
                must be either a single additional argument of a Tensor or
                arbitrary (non-tuple) type or a tuple containing multiple
                additional arguments including tensors or any arbitrary
                python types.
        """
        return {'inputs': torch.tensor(input_df.values.tolist())}

    def generate_baseline(self, input_df):
        """
        Creates the baseline for Integrated Gradients attributions
        from the provided dataframe into a dictionary mapping the keys
        'inputs' and 'auxiliary_inputs' to their corresponding tensors.
        'inputs': Are the tensors that correspond to the
                layers for which layer integrated gradients are computed. If
                the model's forward_func takes a single tensor as input,
                a single input tensor should be provided. If forward_func
                takes multiple tensors as input, a tuple of the input tensors
                should be provided.
        'auxiliary_inputs': If the forward function requires additional
                arguments other than the inputs for which attributions
                should not be computed, this argument can be provided. It
                must be either a single additional argument of a Tensor or
                arbitrary (non-tuple) type or a tuple containing multiple
                additional arguments including tensors or any arbitrary
                python types.

        The choice of baseline is important as explanations are contextual to a
        baseline. For more information please refer to the following document:
        https://github.com/ankurtaly/Integrated-Gradients/blob/master/howto.md
        """
        baseline = input_df * 0
        return {'inputs': torch.tensor(baseline.values.tolist())}

    def project_attributions(self, attributions, input_df):
        """
        Maps the attributions to the original input.

        This method returns a dictionary mapping features of the untransformed
        input to the untransformed feature value, and (projected) attributions
        computed for that feature.


        This method guarantees that for each feature the projected attributions
        have the same shape as the (returned) untransformed feature value. The
        specific projection being applied is left as an implementation detail.
        Below we provided some guidance on the projections that should be
        applied for three different transformations

        Identity transformation
        This is the simplest case. Since the transformation is identity, the
        projection would also be the identity function.

        One-hot transformation for categorical features
        Here the original feature is categorical, and the transformed feature
        is a one-hot encoding. In this case, the returned untransformed feature
        value is the specific input category, and the projected attribution is
        the sum of the attribution across all fields of the one-hot encoding.

        Token ID transformation for text features
        Here the original feature is a sentence, and transformed feature is a
        vector of token ids (w.r.t.a certain vocabulary). Here the
        untransformed feature value would be a vector of tokens corresponding
        to the token ids, and the projected attribution vector would be the
        same as the one provided to this method. In some cases, token ids
        corresponding to dummy token such a padding tokens, start tokens, end
        tokens, etc. may be ignored during the projection. In that case, the
        attributions values  corresponding to these tokens must be dropped from
        the projected attributions vector.

        :param attributions: numpy array of attribution values for each of
        the 'input' tensors as provided by the transform_input function
        :param input_df: the original, raw input DataFrame

        :returns: projected_inputs: dictionary with keys being the features
            of the original untransformed input. The features are specified
            in the model.yaml. The keys are mapped to a pair containing the
            original untransformed input and the projected attribution.
        """
        return {col: attributions.tolist()[i]
                for i, col in enumerate(list(input_df.columns))}

    # -------------------- User Provided Functions End  ---------------------

    # -------------------- Override these if necessary  -----------------------


    def model_function(self, *inputs):
        out = self.model(*inputs)
        out = torch.softmax(out, dim=-1)
        return out

    def get_inputs_as_list(self, transformed_input):
        list_of_inputs = []
        list_of_inputs.append(transformed_input['inputs'])
        if 'auxiliary_inputs' in transformed_input.keys():
            list_of_inputs.append(transformed_input['auxiliary_inputs'])

        return list_of_inputs

    def predict(self, input_df):
        transformed_input = self.transform_input(input_df)
        list_of_inputs = self.get_inputs_as_list(transformed_input)

        with torch.no_grad():
            prediction = self.model_function(*list_of_inputs).detach().numpy()

        if self.target_index is not None:
            print(prediction.shape)
            prediction = prediction[:, self.target_index]

        return pd.DataFrame(data=prediction, columns=self.output_columns)



def get_model():
    model = MyModel(
        max_allowed_error=99)

    return model

Validate Model Package

This verifies the model package for consistency between df_schema, model_info, and package.py; and performs local functional tests on the wrapped model.

# Validate Keras model package
from fiddler import PackageValidator
validator = PackageValidator(model_info, df_schema, model_dir)
passed, errors = validator.run_chain()

Upload Model

Now that we have all the parts that we need, we can go ahead and upload the model to the Fiddler platform. You can use the upload_model_custom to upload this entire directory in one shot. We need the following for uploading a model:

  • The path to the directory
  • The modelinfo that we created above, which is essentially the model schema
  • The project to which the model belongs
  • The model ID, which is the name you want to give the model. You can access it in Fiddler henceforth via this ID
  • The dataset which the model is linked to (optional)
# Let's first delete the model if it already exists in the project
if model_id in client.list_models(project_id):
    client.delete_model(project_id, model_id)
    print('Model deleted')

client.upload_model_custom(model_dir, model_info, project_id, model_id)

Run Model

prediction_input = train_input[:10]
result = client.run_model(project_id, model_id, prediction_input, log_events=True)
result

Get Explanation

selected_point = df.head(1)
ex_ig = client.run_explanation(
    project_id=project_id,
    model_id=model_id, 
    df=selected_point, 
    dataset_id='heart_disease',
    explanations='ig')
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

fig = plt.figure(figsize=(12, 6))
num_features = selected_point.shape[1] - 1
sorted_att_list = sorted(list(zip(np.abs(ex_ig.attributions), ex_ig.inputs, ex_ig.attributions)),
                         reverse=True)
out_list = [[f[1], f[2]] for f in sorted_att_list]
out_list = np.asarray(out_list[::-1])

plt.barh(list(range(num_features)), out_list[:,1].astype('float'))
plt.yticks(list(range(num_features)), out_list[:,0]);
plt.xlabel('Attribution')
plt.title(f'Top IG attributions for heart disease model')
plt.show()
Back to top