Vector Monitoring for Unstructured Data
While Fiddler calculates data drift at deployment time for numerical features that are stored in columns of the baseline dataset, many modern machine learning systems use input features that cannot be represented as a single number (e.g., text or image data). Such complex features are usually rather represented by high-dimensional vectors which are obtained by applying a vectorization method (e.g., text embeddings generated by NLP models). Furthermore, Fiddler users might be interested in monitoring a group of univariate features together and detecting data drift in multi-dimensional feature spaces.
In order to address the above needs, Fiddler provides vector monitoring capability which involves enabling users to define custom features, and a novel method for monitoring data drift in multi-dimensional spaces.
Defining Custom Features
Users can use the Fiddler client to define one or more custom features. Each custom feature is specified by a group of dataset columns that need to be monitored together as a vector. Once a list of custom features is defined and passed to Fiddler (the details of how to use the Fiddler client to define custom features are provided in the following.), Fiddler runs a clustering-based data drift detection algorithm for each custom feature and calculates a corresponding drift value between the baseline and the published events at the selected time period.
CF1 = fdl.CustomFeature.from_columns(['f1','f2','f3'], custom_name = 'vector1') CF2 = fdl.CustomFeature.from_columns(['f1','f2','f3'], n_clusters=5, custom_name = 'vector2')
Passing Custom Features List to Model Info
model_info = fdl.ModelInfo.from_dataset_info( dataset_info=dataset_info, dataset_id = DATASET_ID, features = data_cols, target='target', outputs='predicted_score', custom_features = [CF1,CF2] )
Quick Start for NLP Monitoring
Check out our Quick Start guide for NLP monitoring for a fully functional notebook example.
Updated 2 months ago