Skip to content

TensorFlow For Tabular Data with IG

View In Github

Initialize Fiddler Client

This python client is a powerful way to:

  • Upload the dataset and model to Fiddler
  • Ingest production events to Fiddler

This can be done from a Jupyter Notebook or any python editor that you use to load data and build models.

First, we need to initialize the client object by specifying:

  • url: url is the fiddler URL that you have been provided to access. Usually of the form ‘https://xxxxx.fiddler.ai’. Contact Fiddler if you don’t have it
  • org_id: organization id is an identifier for the account. See Fiddler_URL/settings/general to find this id (listed as "Organization ID") png
  • auth_token: this token is used to authenticate access. See Fiddler_URL/settings/credentials to find, create, or change this token png

You can also save this config as a file called fiddler.ini in the same folder as the notebook/script. That saves you from specifying the parameters in every notebook and script. png

import fiddler as fdl

url = 'http://xxx.fiddler.ai'
token = 'my_token'
org_id = 'my_org_id'

client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=token)

Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case.

project_id = 'tf_tabular'
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

Load Dataset

Load the data you are going to use for training your model.

import pandas as pd

df = pd.read_csv('/app/fiddler_samples/samples/datasets/heart_disease/data.csv')
df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)
df.head()

Upload Dataset

To upload a model, you first need to upload a sample of the data of the model’s inputs, targets, and additional metadata that might be useful for model analysis. This data sample helps us (among other things) to infer the model schema and the data types and values range of each feature.

  • This sample has to be a flat table that can be loaded as a pandas DF (upload_dataset()) or saved as a csv (upload_dataset_from_dir()).
  • In this example age, sex, trestbps, chol, fbs, thalach, exang, oldpeak, slope are input features, and target is the target column for the model.
  • This input data sample is used for many downstream functions in Fiddler
    • Shapley value methods - background data to simulate the missing of features
    • What-if (ICE) plots - background data
    • PDP plots - background data
    • Drift - to serve as a baseline
    • Outliers - to serve as a baseline
    • Data integrity - to serve as a baseline
  • We suggest uploading a sample of the model’s training data as it’s the most meaningful for the tasks listed above. For example, model outliers should be ideally based on the training data as that’s the data the model has seen.
  • You can upload multiple datasets with string identifiers, but we currently do not ascribe any meaning to those. For example: dataset={'data': df} or dataset={'train': train_df, 'test': test_df}.
  • Currently we support two input types:
    • Tabular
    • Single string text, meaning text data in a single column
if 'heart_disease' not in client.list_datasets(project_id):
    upload_result = client.upload_dataset(
        project_id=project_id,
        dataset={'data': df}, 
        dataset_id='heart_disease')

Create Model Schema

As you must have noted, in the dataset upload step we did not ask for the model’s features and targets, or any model specific information. That’s because we allow for linking multiple models to a given dataset schema. Hence we require an Infer model schema step which helps us know the features relevant to the model and the model task. Here you can specify the input features, the target column, decision columns and metadata columns, and also the type of model.

  • Currently we support only one target column. This is not to be confused with output columns, which can be more than one.
  • Decision columns specify the decisions made on the basis of the model’s predictions. For example, in a credit lending scenario, the business decision to give or not to give a loan based on the model’s output. This is helpful while monitoring models after deployment, to keep track of the business impact of the model.
  • Metadata is data that is not used by the model, but can be relevant for understanding the model’s behavior on different segments of the data. For example, gender, race, age and other such sensitive features may not be used in the model, but we can analyze along these dimensions post facto to understand if the model is biased.
  • We can infer the model task from the target column, or it can explicitly set. Currently we support three model types:

    • Regression
    • Binary Classification
    • Multi-class Classification
target = 'target'
train_input = df.drop(columns=['target'])
train_target = df[target]

feature_columns = list(train_input.columns)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, 'heart_disease'),
    target=target, 
    features=feature_columns,
    display_name='Keras Tabular IG',
    description='this is a keras model using tabular data and IG enabled from tutorial',
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION
)

Train Model

Install TensorFlow

TensorFlow v1.14

Fiddler currently supports TensorFlow v1.14. Install TensorFlow v1.14 to proceed. If you have another version, please contact Fiddler for assistance.

import tensorflow as tf

assert tf.__version__=='1.14.0', 'Please change tensorflow version to 1.14.0'
# !pip install tensorflow==1.14

Build and train your model.

import tensorflow as tf

inputs = tf.keras.Input(shape=(train_input.shape[1], ))
activations = tf.keras.layers.Dense(32, activation='linear', use_bias=True)(inputs)
activations = tf.keras.layers.Dense(128, activation=tf.nn.relu, use_bias=True)(activations)
activations = tf.keras.layers.Dense(128, activation=tf.nn.relu, use_bias=True)(activations)
activations = tf.keras.layers.Dense(1, activation='sigmoid', use_bias=True)(activations)
model = tf.keras.Model(inputs=inputs, outputs=activations, name='keras_model')

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='binary_crossentropy',
    metrics=['accuracy']
)

model.fit(train_input, train_target.values, batch_size=32, epochs=8)
model.evaluate(train_input, train_target) 

Save Model And Schema

Next step, we need to save the model and any pre-processing step you had on the input features (for example Categorical encoder, Tokenization, ...).
We currently support the following stored model formats:

  • For sklearn API based models, pickled models, or any storage format that you can load in the package.py (details below).
  • For TF, we support TF Saved Model and Keras .h5

Note

  • Keras models have to have their input tensor differentiable if Integrated Gradients support is desired
  • We also need to save the data preprocessing pipeline code, if any. This will be accessed in the package.py
import pathlib
import shutil
import yaml

# Let's save the model in the TF Saved Model format
model_id_tf = 'heart_disease_tf'

# create temp dir
model_dir_tf = pathlib.Path(model_id_tf)
shutil.rmtree(model_dir_tf, ignore_errors=True)
model_dir_tf.mkdir()

# save model
tf.saved_model.save(model, str(model_dir_tf / 'saved_model'))
# For demo purpose, let's save this model in Keras .h5 demo.
model_id_keras = 'heart_disease_keras'

# create temp dir
model_dir_keras = pathlib.Path(model_id_keras)
shutil.rmtree(model_dir_keras, ignore_errors=True)
model_dir_keras.mkdir()

# save model
model.save(str(model_dir_keras / 'model.h5'), include_optimizer=False)

In the following section, we are providing the code to upload your model in the two saved format supported. Please refer to the appropriate section.

1. For TF Saved Model

We need to import 2 wrappers for tensorflow. Those files are stored in the utils directory.

  • The tf_saved_model_wrapper.py file contains a wrapper to load and run a TF model from a saved_model path.
  • The tf_saved_model_wrapper_ig.py file contains a wrapper to support Integrated Gradients (IG) computation for a TF model loaded from a saved_model path.
files = ['utils/tf_saved_model_wrapper.py', 'utils/tf_saved_model_wrapper_ig.py']
for f in files:
    shutil.copy(f, model_dir_tf)

Package.py is the interface between Fiddler’s backend and your model. This code helps Fiddler to understand the model, its inputs and outputs.

  • Load the model, and any associated files such as feature transformers or tokenizers.
  • Transform the data into a format that the model recognizes.
  • Make batch predictions using the model.
  • Understand the differentiable tensors of the model, in case we want to enable Integrated Gradients.

For certain common highly standardized frameworks, the Fiddler client provides helper upload methods to auto-generate this module (e.g. for scikit-learn models).

Writting the package.py file:

  • package.py will be invoked within the model’s specific assets directory and must implement a get_model() function which takes no arguments and returns an instance of a model class implementing the following methods:

    • The initialization parameters For TF models:

      • self.max_allowed_error: Float specifying a percentage value for the maximum allowed integral approximation error for IG computation. If None then IG will be calculated for a pre-determined number of steps. Otherwise, the number of steps will be increased till the error is within the specified limit.
      • self.model: the code to load the model in the given session, you need to specify the file name essentially.
      • self.output_columns: a list of names of the output columns for the model.
      • self.batch_size: set a batch size for the model which will not cause OOM errors on the machines(s) the Fiddler cluster is hosted on. For the machine’s configuration, please check with Fiddler.
      • self.ig_enabled: if you want the Integrated gradients explanation method for your model. If False, then you can skip all the below parameters
    • transform_input(input_df): Accepts a pandas DataFrame object containing rows of raw feature vectors. The output of this method can be any Python object. This function can also be used to deserialize complex data types stored in dataset columns (e.g. images stored in a field in UTF-8 format). This function is typically called by predict, but the platform may also need to invoke it directly for certain operations (e.g. computing path integral steps in the Integrated Gradients explanation method).

    • generate_baseline(input_df): Generates a DataFrame specifying a baseline that is required for calculating Integrated Gradients. The Baseline is a certain 'informationless' input relative to which attributions must be computed. For instance, in a text classification model, the baseline could be the empty text.mThe baseline could be the same for all inputs or could be specific to the input at hand. The choice of baseline is important as explanations are contextual to a baseline. For more information please refer to this link
    • predict(input_df): Accepts a pandas DataFrame object containing rows of raw feature vectors. Outputs a pandas DataFrame object containing the model predictions whose column labels must match the output column names in model.yaml. Typically this function invokes transform_input explicitly.
%%writefile heart_disease_tf/package.py

import pathlib
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import load_model
from .tf_saved_model_wrapper_ig import TFSavedModelWrapperIg

tf.compat.v1.disable_eager_execution()

PACKAGE_PATH = pathlib.Path(__file__).parent
SAVED_MODEL_PATH = PACKAGE_PATH / 'saved_model'

class MyModel(TFSavedModelWrapperIg):
    """
    :param saved_model_path: Path to the directory containing the TF
       model in SavedModel format.
       See: https://www.tensorflow.org/guide/saved_model#build_and_load_a_savedmodel

    :param sig_def_key: Key for the specific SignatureDef to be used for
       executing the model.
       See: https://www.tensorflow.org/tfx/serving/signature_defs#signaturedef_structure

    :param output_columns: List containing the names of the output
       column(s) that corresponds to the output of the model. If the
       model is a binary classification model then the number of output
       columns is one, otherwise, the number of columns must match the
       shape of the output tensor corresponding to the output key
       specified.

    :param is_binary_classification [optional]: Boolean specifying if the
       model is a binary classification model. If True, the number of
       output columns is one. The default is False.

    :param output_key [optional]: Key for the specific output tensor (
       specified in the SignatureDef) whose predictions must be explained.
       The output tensor must specify a differentiable output of the
       model. Thus, output tensors that are generated as a result of
       discrete operations (e.g., argmax) are disallowed. The default is
       None, in which case the first output listed in the SignatureDef is
       used. The 'saved_model_cli' can be used to view the output tensor
       keys available in the signature_def.
       See: https://www.tensorflow.org/guide/saved_model#cli_to_inspect_and_execute_savedmodel

    :param batch_size [optional]: the batch size for input into the model.
       Depends on model and instance config.

    :param input_tensor_to_differentiable_layer_mapping [optional]:
       Dictionary that maps input tensors to the first differentiable
       layer/tensor in the graph they are attached to. For instance,
       in a text model, an input tensor containing token ids
       may not be differentiable but may feed into an embedding tensor.
       Such an input tensor must be mapped to the corresponding the
       embedding tensor in this dictionary.

       All input tensors must be mentioned in the dictionary. An input
       tensor that is directly differentiable may be mapped to itself.

       For each differentiable tensor, the first dimension must be the
       batch dimension. If <k1, …, kn> is the shape of the input then the
       differentiable tensor must either have the same shape or the shape
       <k1, …, kn, d>.

       The default is None, in which case all input tensors are assumed
       to be differentiable.

    :param max_allowed_error: Float specifying a percentage value
       for the maximum allowed integral approximation error for IG
       computation. If None then IG will be  calculated for a
       pre-determined number of steps. Otherwise, the number of steps
       will be increased till the error is within the specified limit.
    """
    def __init__(self, saved_model_path, sig_def_key,
                 is_binary_classification=False,
                 output_key=None,
                 batch_size=8,
                 output_columns=[],
                 input_tensor_to_differentiable_layer_mapping={},
                 max_allowed_error=None):

        super().__init__(saved_model_path, sig_def_key,
                         is_binary_classification=is_binary_classification,
                         output_key=output_key,
                         batch_size=batch_size,
                         output_columns=output_columns,
                         input_tensor_to_differentiable_layer_mapping=
                         input_tensor_to_differentiable_layer_mapping,
                         max_allowed_error=max_allowed_error)


    def transform_input(self, input_df):
        """
        Transform the provided pandas DataFrame into one that complies with
        the input interface of the model. This method returns a pandas
        DataFrame with columns corresponding to the input tensor keys in the
        SavedModel SignatureDef. The contents of each column match the input
        tensor shape described in the SignatureDef.

        Args:
        :param input_df: DataFrame corresponding to the dataset yaml
            associated with the project. Specifically, the columns in the
            DataFrame must correspond to the feature names mentioned in the
            yaml.

        Returns:
        - transformed_input_df: DataFrame with columns corresponding to the
            input tensor keys in the saved model SignatureDef. The contents
            of the columns must match the corresponding shape of the input
            tensor described in the SignatureDef. For instance, if the
            input to the model is a serialized tf.Example then the returned
            DataFrame would have a single column containing serialized
            examples.

        """
        return pd.DataFrame({'input_1': input_df.values.tolist()})

    def generate_baseline(self, input_df):
        """
        Generates a DataFrame specifying a baseline that is required for
        calculating Integrated Gradients.

        The Baseline is a certain 'informationless' input relative to which
        attributions must be computed. For instance, in a text
        classification model, the baseline could be the empty text.

        The baseline could be the same for all inputs or could be specific
        to the input at hand. 

        The choice of baseline is important as explanations are contextual to a
        baseline. For more information please refer to the following document:
        https://github.com/ankurtaly/Integrated-Gradients/blob/master/howto.md
        """
        baseline = input_df * 0
        return pd.DataFrame({'input_1': baseline.values.tolist()})

    def project_attributions(self, input_df, transformed_input_df,
                             attributions):
        """
        Maps the attributions for the provided transformed_input to
        the original untransformed input.

        This method returns a dictionary mapping features of the untransformed
        input to the untransformed feature value, and (projected) attributions
        computed for that feature.

        This method guarantees that for each feature the projected attributions
        have the same shape as the (returned) untransformed feature value. The
        specific projection being applied is left as an implementation detail.
        Below we provided some guidance on the projections that should be
        applied for three different transformations

        Identity transformation
        This is the simplest case. Since the transformation is identity, the
        projection would also be the identity function.

        One-hot transformation for categorical features
        Here the original feature is categorical, and the transformed feature
        is a one-hot encoding. In this case, the returned untransformed feature
        value is the specific input category, and the projected attribution is
        the sum of the attribution across all fields of the one-hot encoding.

        Token ID transformation for text features
        Here the original feature is a sentence, and transformed feature is a
        vector of token ids (w.r.t.a certain vocabulary). Here the
        untransformed feature value would be a vector of tokens corresponding
        to the token ids, and the projected attribution vector would be the
        same as the one provided to this method. In some cases, token ids
        corresponding to dummy token such a padding tokens, start tokens, end
        tokens, etc. may be ignored during the projection. In that case, the
        attributions values  corresponding to these tokens must be dropped from
        the projected attributions vector.

        :param input_df: Pandas DataFrame specifying the input whose prediction
            is being attributed. Its columns must correspond to the dataset
            yaml associated with the project. Specifically, the columns must
            correspond to the feature names mentioned in the yaml.

        :param transformed_input_df: Pandas DataFrame returned by the
            transform_input method extended in package.py. It has exactly
            one row as currently only instance explanations are supported.

        :param attributions: dictionary mapping each column of the
            transformed_input to the corresponding attributions tensor. The
            attribution tensor must have the same shape as corresponding
            column in transformed_input.

        Returns:
        - projected_inputs: dictionary with keys being the features of the
            original untransformed input. The features are specified in the
            model.yaml. The keys are mapped to a pair containing the original
            untransformed input and the projected attribution.
        """
        return {col: attributions['input_1'][0][i].tolist()
                for i, col in enumerate(list(input_df.columns))}


def get_model():
    model = MyModel(
        SAVED_MODEL_PATH,
        tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY,
        is_binary_classification=True,
        batch_size=32,
        output_columns=['predicted_target'],
        input_tensor_to_differentiable_layer_mapping=
        {'input_1': 'serving_default_input_1:0'},
        max_allowed_error=5
    )
    model.load_model()
    return model

2. For Keras .h5 Model

To reiterate the point made above, Keras models have to have their input tensor (model.input) differentiable if Integrated Gradients support is desired. If that's not the case, please save the model as a TF Saved Model.

%%writefile heart_disease_keras/package.py

import pathlib
import pandas as pd
from tensorflow.keras.models import load_model
import tensorflow as tf
tf.compat.v1.disable_eager_execution()


class MyModel:
    def __init__(self, max_allowed_error=None,
                 output_columns=['predicted_target']):
        self.max_allowed_error = max_allowed_error

        model_dir = pathlib.Path(__file__).parent

        self.sess = tf.Session()
        with self.sess.as_default():
            self.model = load_model(pathlib.Path(model_dir) /
                                    'model.h5')
        self.ig_enabled = True
        self.is_input_differentiable = True
        self.batch_size = 32
        self.output_columns = output_columns
        self.input_tensors = self.model.input
        self.output_tensor = self.model.output
        self.gradient_tensors = \
            {self.output_columns[0]:
                 {self.input_tensors:
                      tf.gradients(self.output_tensor, self.input_tensors)}}
        self.input_tensor_to_differentiable_layer_mapping = {
            self.input_tensors: self.input_tensors}
        self.differentiable_tensors = {self.input_tensors: self.input_tensors}

    def get_feed_dict(self, input_df):
        """
        Returns the input dictionary to be fed to the TensorFlow graph given
        input_df which is a pandas DataFrame. The input_df DataFrame is
        obtained after applying transform_input on the raw input. The
        transform_input function is extended in package.py.
        """

        feed = {self.input_tensors: input_df.values}
        return feed

    def transform_input(self, input_df):
        return input_df

    def generate_baseline(self, input_df):
        return input_df*0

    def predict(self, input_df):
        transformed_input_df = self.transform_input(input_df)
        predictions = []
        for ind in range(0, len(transformed_input_df), self.batch_size):
            df_chunk = transformed_input_df.iloc[ind: ind + self.batch_size]
            feed = self.get_feed_dict(df_chunk)
            with self.sess.as_default():
                predictions += self.sess.run(self.output_tensor, feed).tolist()
        return pd.DataFrame(predictions, columns=self.output_columns)

    def project_attributions(self, input_df, transformed_input_df,
                             attributions):
        return {col: attributions[self.input_tensors][0][i].tolist()
                for i, col in enumerate(input_df.columns)}


def get_model():
    model = MyModel(max_allowed_error=1)
    return model

Validate Model Package

This step finds issues with the package.py composed above to enable easy debugging.

# Validate Keras model package
from fiddler import PackageValidator
validator = PackageValidator(model_info, df_schema, model_dir_keras)
passed, errors = validator.run_chain()

Upload Model

Now that we have all the parts that we need, we can go ahead and upload the model to the Fiddler platform. You can use the upload_model_custom to upload this entire directory in one shot. We need the following for uploading a model: - The path to the directory - The modelinfo that we created above, which is essentially the model schema - The project to which the model belongs - The model ID, which is the name you want to give the model. You can access it in Fiddler henceforth via this ID - The dataset which the model is linked to (optional)

Note: this step and all the fllowing are exactly the same for a Keras .h5 model. For our demo, we are going to upload and run the TF Saved model.

# Let's first delete the model if it already exists in the project
if model_id_tf in client.list_models(project_id):
    client.delete_model(project_id, model_id_tf)
    print('Model deleted')

client.upload_model_custom(model_dir_tf, model_info, project_id, model_id_tf)

Run Model

prediction_input = train_input[:10]
result = client.run_model(project_id, model_id_tf, prediction_input, log_events=True)
result

Get Explanation

selected_point = df.head(1)
ex_ig = client.run_explanation(
    project_id=project_id,
    model_id=model_id_tf, 
    df=selected_point, 
    dataset_id='heart_disease',
    explanations='ig')
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

fig = plt.figure(figsize=(12, 6))
num_features = selected_point.shape[1] - 1
sorted_att_list = sorted(list(zip(np.abs(ex_ig.attributions), ex_ig.inputs, ex_ig.attributions)),
                         reverse=True)
out_list = [[f[1], f[2]] for f in sorted_att_list]
out_list = np.asarray(out_list[::-1])

plt.barh(list(range(num_features)), out_list[:,1].astype('float'))
plt.yticks(list(range(num_features)), out_list[:,0]);
plt.xlabel('Attribution')
plt.title(f'Top IG attributions for heart disease model')
plt.show()
Back to top