Skip to content

TensorFlow For Text Data With IG

View In Github

Initialize Fiddler Client

We begin this section as usual by establishing a connection to our Fiddler instance. We can establish this connection either by specifying our credentials directly, or by utilizing our fiddler.ini file. More information can be found in the setup section.

import fiddler as fdl

# client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=auth_token)
client = fdl.FiddlerApi()

Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case. Uploading our dataset in the next step will depend on our created project's project_id.

project_id = 'tf_text'
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

Load Dataset

Here we will load in our baseline dataset from a csv called imdb_rnn.csv. We will also create a schema using this information.

import pandas as pd
df = pd.read_csv('/app/fiddler_samples/samples/datasets/imdb_rnn/imdb_rnn.csv')
df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)
df.head()

Upload Dataset

To upload a model, you first need to upload a sample of the data of the model’s inputs, targets, and additional metadata that might be useful for model analysis. This data sample helps us (among other things) to infer the model schema and the data types and values range of each feature.

if 'imdb_rnn' not in client.list_datasets(project_id):
    upload_result = client.upload_dataset(
        project_id=project_id,
        dataset={'train': df},
        dataset_id='imdb_rnn')

Create Model Schema

As you must have noted, in the dataset upload step we did not ask for the model’s features and targets, or any model specific information. That’s because we allow for linking multiple models to a given dataset schema. Hence we require an Infer model schema step which helps us know the features relevant to the model and the model task. Here you can specify the input features, the target column, decision columns and metadata columns, and also the type of model.

target = 'polarity'
feature_columns = ['sentence']
train_input = df[feature_columns]
train_target = df[target]

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, 'imdb_rnn'),
    target=target,
    features=feature_columns,
    display_name='Text IG',
    description='this is a tensorflow model using text data and IG enabled from tutorial',
    input_type=fdl.ModelInputType.TEXT
)

Train Model

Install Scikit-Learn & Tensorflow

Scikit-Learn v0.21.2 & TensorFlow v1.14

Fiddler currently supports Scikit-learn v0.21.2 and TensorFlow v1.14. Install these versions to proceed. If you have another version, please contact Fiddler for assistance.

import tensorflow as tf

assert tf.__version__=='1.14.0', 'Please change tensorflow version to 1.14.0'
import sklearn

assert sklearn.__version__=='0.21.2', 'Please change sklearn version to 0.21.2'
#!pip install tensorflow==1.14
#!pip install scikit-learn==0.21.2

Build and train your model.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_target = le.fit_transform(train_target)
train_target = train_target.reshape(-1,1)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

vocab_size = 1000
max_seq_length = 150
tok = Tokenizer(num_words=vocab_size)
tok.fit_on_texts(train_input['sentence'])
sequences = tok.texts_to_sequences(train_input['sentence'])
sequences_matrix = sequence.pad_sequences(sequences, maxlen=max_seq_length, padding='post')
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.models import Model

def RNN():
    inputs = Input(name='inputs', shape=[max_seq_length])
    layer = Embedding(vocab_size, 64, input_length=max_seq_length)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256, name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.2)(layer)
    layer = Dense(1, name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs, outputs=layer)
    return model
from tensorflow.keras.optimizers import RMSprop

model = RNN()
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])
WARNING:tensorflow:From /Users/leagenuit/opt/anaconda3/envs/fiddler/lib/python3.7/site-packages/tensorflow/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /Users/leagenuit/opt/anaconda3/envs/fiddler/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
inputs (InputLayer)          [(None, 150)]             0
_________________________________________________________________
embedding (Embedding)        (None, 150, 64)           64000
_________________________________________________________________
lstm (LSTM)                  (None, 64)                33024
_________________________________________________________________
FC1 (Dense)                  (None, 256)               16640
_________________________________________________________________
activation (Activation)      (None, 256)               0
_________________________________________________________________
dropout (Dropout)            (None, 256)               0
_________________________________________________________________
out_layer (Dense)            (None, 1)                 257
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0
=================================================================
Total params: 113,921
Trainable params: 113,921
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From /Users/leagenuit/opt/anaconda3/envs/fiddler/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
from tensorflow.keras.callbacks import EarlyStopping
model.fit(sequences_matrix, train_target, batch_size=128, epochs=5,
          validation_split=0.1, callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.001)])
Train on 22500 samples, validate on 2500 samples
Epoch 1/5
22500/22500 [==============================] - 18s 817us/sample - loss: 0.6660 - acc: 0.5887 - val_loss: 0.5008 - val_acc: 0.7616
Epoch 2/5
22500/22500 [==============================] - 17s 766us/sample - loss: 0.4221 - acc: 0.8160 - val_loss: 0.4245 - val_acc: 0.8088
Epoch 3/5
22500/22500 [==============================] - 17s 757us/sample - loss: 0.3724 - acc: 0.8448 - val_loss: 0.3524 - val_acc: 0.8480
Epoch 4/5
22500/22500 [==============================] - 17s 759us/sample - loss: 0.3549 - acc: 0.8520 - val_loss: 0.3649 - val_acc: 0.8380


<tensorflow.python.keras.callbacks.History at 0x7f9c3ffc9810>

Save Model and Schema

Next step, we need to save the model and any pre-processing step you had on the input features (for example Categorical encoder, Tokenization, ...).

import pathlib
import shutil
import pickle
import yaml
import tensorflow as tf

project_id = 'tutorial'
model_id = 'tf_ig_imdb'

# create temp dir
model_dir = pathlib.Path(model_id)
shutil.rmtree(model_dir, ignore_errors=True)
model_dir.mkdir()

# save model
tf.keras.experimental.export_saved_model(model, str(model_dir / 'saved_model'))

# save model schema
with open(model_dir / 'model.yaml', 'w') as yaml_file:
    yaml.dump({'model': model_info.to_dict()}, yaml_file)

# save tokenizer
with open(model_dir / 'tokenizer.pkl', 'wb') as tok_file:
    tok_file.write(pickle.dumps(tok))
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: None
INFO:tensorflow:Signatures INCLUDED in export for Train: ['train']
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
WARNING:tensorflow:Export includes no default signature!
INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: None
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: ['eval']
WARNING:tensorflow:Export includes no default signature!
INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.decay
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.learning_rate
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.momentum
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.rho
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-0.embeddings
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-2.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-2.bias
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-3.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-3.bias
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-1.cell.kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-1.cell.recurrent_kernel
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'rms' for (root).layer_with_weights-1.cell.bias
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/alpha/guide/checkpoints#loading_mechanics for details.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: tf_ig_imdb/saved_model/saved_model.pb

We need to import 2 wrappers for tensorflow. Those files are stored in the utils directory.

  • The tf_saved_model_wrapper.py file contains a wrapper to load and run a TF model from a saved_model path.
  • The tf_saved_model_wrapper_ig.py file contains a wrapper to support Integrated Gradients (IG) computation for a TF model loaded from a saved_model path.
files = ['utils/tf_saved_model_wrapper.py', 'utils/tf_saved_model_wrapper_ig.py']
for f in files:
    shutil.copy(f, model_dir)

Write package.py file

A wrapper is needed between Fiddler and the model. This wrapper can be used to translate the inputs and outputs to fit what the model expects and what Fiddler is able to consume. This file contains functions to transform the input, generate the baseline and get the attributions. More information can be found here

%%writefile tf_ig_imdb/package.py

import numpy as np
import re
import pathlib
import pickle
import logging
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from .tf_saved_model_wrapper_ig import TFSavedModelWrapperIg


PACKAGE_PATH = pathlib.Path(__file__).parent
SAVED_MODEL_PATH = PACKAGE_PATH / 'saved_model'
TOKENIZER_PATH = PACKAGE_PATH / 'tokenizer.pkl'

LOG = logging.getLogger(__name__)


class MyModel(TFSavedModelWrapperIg):
    def __init__(self, saved_model_path, sig_def_key, tokenizer_path,
                 target,
                 is_binary_classification=False,
                 output_key=None,
                 batch_size=8,
                 output_columns=[],
                 input_tensor_to_differentiable_layer_mapping={},
                 max_allowed_error=None):
        """
        Class to load and run the IMDB RNN model.
        See: TFSavedModelWrapper

        """
        super().__init__(saved_model_path, sig_def_key,
                         is_binary_classification=is_binary_classification,
                         output_key=output_key,
                         batch_size=batch_size,
                         output_columns=output_columns,
                         input_tensor_to_differentiable_layer_mapping=
                         input_tensor_to_differentiable_layer_mapping,
                         max_allowed_error=max_allowed_error)
        with open(tokenizer_path, 'rb') as handle:
            self.tokenizer = pickle.load(handle)
        self.max_seq_length = 150
        self.target = target

    def transform_input(self, input_df):
        """
        Transform the provided dataframe into one that complies with the input
        interface of the model.

        Overrides the transform_input method of TFSavedModelWrapper.
        """

        sequences = self.tokenizer.texts_to_sequences(input_df[self.target])
        sequences_matrix = sequence.pad_sequences(sequences,
                                                  maxlen=self.max_seq_length,
                                                  padding='post')

        return pd.DataFrame({'inputs': sequences_matrix.tolist()})

    def generate_baseline(self, input_df):

        input_tokens = input_df[self.target].apply(lambda x: '')
        sequences = self.tokenizer.texts_to_sequences(input_tokens)
        sequences_matrix = sequence.pad_sequences(sequences,
                                                  maxlen=self.max_seq_length,
                                                  padding='post')

        return pd.DataFrame({'inputs': sequences_matrix.tolist()})

    def project_attributions(self, input_df, transformed_input_df,
                             attributions):
        """
        Maps the transformed input to original input space so that the
        attributions correspond to the features of the original input.
        Overrides the project_attributions method of TFSavedModelWrapper.
        """
        segments = re.split(r'([ '+self.tokenizer.filters+'])', input_df[self.target].iloc[0])
        unpadded_input=[self.tokenizer.texts_to_sequences([x])[0] for x in input_df[self.target].values]
        word_tokens = self.tokenizer.sequences_to_texts([[x] for x in unpadded_input[0]])
        word_attributions = attributions['inputs'][0].astype('float').tolist()[:len(word_tokens)]

        # Let's walk segments and assign attributions to the components where
        # they match word_tokens, the token sequence consumed by the model; otherwise assign 0.
        i = 0
        final_attributions = []
        final_segments = []
        for segment in segments:
            if segment is not '':
                final_segments.append(segment)
                seg_low = segment.lower()
                if len(word_tokens)>i and seg_low == word_tokens[i]:
                    final_attributions.append(word_attributions[i])
                    i+=1
                else:
                    final_attributions.append(0)
        return {"embedding_input":[final_segments, final_attributions]}


def get_model():
    model = MyModel(
        SAVED_MODEL_PATH,
        tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY,
        TOKENIZER_PATH,
        target='sentence',
        is_binary_classification=True,
        batch_size=128,
        output_columns=['inputs'],
        input_tensor_to_differentiable_layer_mapping=
        {'inputs': 'embedding/embedding_lookup:0'},
        max_allowed_error=5)
    model.load_model()
    return model
Writing tf_ig_imdb/package.py

Upload Model

Now that we have all the parts that we need, we can go ahead and upload the model to the Fiddler platform. You can use the upload_model_package to upload this entire directory in one shot. We need the following for uploading a model:

  • The path to the directory
  • The project_id to which the model belongs
  • The model_id, which is the name you want to give the model. You can access it in Fiddler henceforth via this ID
  • The dataset which the model is linked to (optional)

In total, we will have a model.yaml, a *.pkl, and a package.py file within our model directory.

client.delete_model(project_id, model_id)
client.upload_model_package(model_dir, project_id, model_id)

Run Model

Now, let's test out our model by interfacing with the client and calling run model.

prediction_input = train_input[:10]
result = client.run_model(project_id, model_id, prediction_input)
result

Get Explanation

Let's get an explanation on a selected data point to better understand how our model came to the conclusion it did. We can do so by calling the run_explanation method. In this case, we will call for an explanation using 'ig'. More information on this method can be found here

selected_point = df.head(1)
project_id = 'tutorial'
model_id = 'tf_ig_imdb'

ex_ig = client.run_explanation(
    project_id=project_id,
    model_id=model_id,
    df=selected_point,
    dataset_id='imdb_rnn',
    explanations='ig')
ex_ig
Back to top