Fiddler Objects

fdl.DatasetInfo

For information on how to customize these objects, see Customizing Your Dataset Schema.

columns = [
    fdl.Column(
        name='feature_1',
        data_type=fdl.DataType.FLOAT
    ),
    fdl.Column(
        name='feature_2',
        data_type=fdl.DataType.INTEGER
    ),
    fdl.Column(
        name='feature_3',
        data_type=fdl.DataType.BOOLEAN
    ),
    fdl.Column(
        name='output_column',
        data_type=fdl.DataType.FLOAT
    ),
    fdl.Column(
        name='target_column',
        data_type=fdl.DataType.INTEGER
    )
]

dataset_info = fdl.DatasetInfo(
    display_name='Example Dataset',
    columns=columns
)

fdl.DatasetInfo.from_dataframe

import pandas as pd

df = pd.read_csv('example_dataset.csv')

dataset_info = fdl.DatasetInfo.from_dataframe(df=df, max_inferred_cardinality=100)

fdl.DatasetInfo.from_dict

import pandas as pd

df = pd.read_csv('example_dataset.csv')

dataset_info = fdl.DatasetInfo.from_dataframe(df=df, max_inferred_cardinality=100)

dataset_info_dict = dataset_info.to_dict()

new_dataset_info = fdl.DatasetInfo.from_dict(
    deserialized_json={
        'dataset': dataset_info_dict
    }
)

fdl.DatasetInfo.to_dict

import pandas as pd

df = pd.read_csv('example_dataset.csv')

dataset_info = fdl.DatasetInfo.from_dataframe(df=df, max_inferred_cardinality=100)

dataset_info_dict = dataset_info.to_dict()
{
    'name': 'Example Dataset',
    'columns': [
        {
            'column-name': 'feature_1',
            'data-type': 'float'
        },
        {
            'column-name': 'feature_2',
            'data-type': 'int'
        },
        {
            'column-name': 'feature_3',
            'data-type': 'bool'
        },
        {
            'column-name': 'output_column',
            'data-type': 'float'
        },
        {
            'column-name': 'target_column',
            'data-type': 'int'
        }
    ],
    'files': []
}


fdl.ModelInfo

| Input | Parameters | Type | Default | Description | | --- | --- | --- | --- | | display_name | str | | A display name for the model. | | input_type | fdl.ModelInputType | | A ModelInputType object containing the input type of the model. | | model_task | fdl.ModelTask | | A ModelTask object containing the model task. | | inputs | list | | A list of Column objects corresponding to the inputs (features) of the model. | | outputs | list | | A list of Column objects corresponding to the outputs (predictions) of the model. | | metadata | Optional [list] | None | A list of Column objects corresponding to any metadata fields. | | decisions | Optional [list] | None | A list of Column objects corresponding to any decision fields (post-prediction business decisions). | | targets | Optional [list] | None | A list of Column objects corresponding to the targets (ground truth) of the model. | | framework | Optional [str] | None | A string providing information about the software library and version used to train and run this model. | | description | Optional [str] | None | A description of the model. | | datasets | Optional [list] | None | A list of the dataset IDs used by the model. | | mlflow_params | Optional [fdl.MLFlowParams] | None | A MLFlowParams object containing information about MLFlow parameters. | | model_deployment_params | Optional [fdl.ModelDeploymentParams] | None | A ModelDeploymentParams object containing information about model deployment. | | artifact_status | Optional [fdl.ArtifactStatus] | None | An ArtifactStatus object containing information about the model artifact. | | preferred_explanation_method | Optional [fdl.ExplanationMethod] | None | An ExplanationMethod object that specifies the default explanation algorithm to use for the model. | | custom_explanation_names | Optional [list] | [ ] | A list of names that can be passed to the explanation_name _argument of the optional user-defined _explain_custom method of the model object defined in package.py. | | binary_classification_threshold | Optional [float] | .5 | The threshold used for classifying inferences for binary classifiers. | | ranking_top_k | Optional [int] | 50 | Used only for ranking models. Sets the top k results to take into consideration when computing performance metrics like MAP and NDCG. | | group_by | Optional [str] | None | Used only for ranking models. The column by which to group events for certain performance metrics like MAP and NDCG. | | fall_back | Optional [dict] | None | A dictionary mapping a column name to custom missing value encodings for that column. | | target_class_order | Optional [list] | None | A list denoting the order of classes in the target. This parameter is required in the following cases: - Binary classification tasks: If the target is of type string, you must tell Fiddler which class is considered the positive class for your output column. You need to provide a list with two elements. The 0th element by convention is considered the negative class, and the 1st element is considered the positive class. When your target is boolean, you don't need to specify this argument. By default Fiddler considers True as the positive class. In case your target is numerical, you don't need to specify this argument, by default Fiddler considers the higher of the two possible values as the positive class. - Multi-class classification tasks: You must tell Fiddler which class corresponds to which output by giving an ordered list of classes. This order should be the same as the order of the outputs. - Ranking tasks: If the target is of type string, you must provide a list of all the possible target values in the order of relevance. The first element will be considered as the least relevant grade and the last element from the list will be considered the most relevant grade. In the case your target is numerical, Fiddler considers the smallest value to be the least relevant grade and the biggest value from the list will be considered the most relevant grade. | | **kwargs | | | Additional arguments to be passed. |

inputs = [
    fdl.Column(
        name='feature_1',
        data_type=fdl.DataType.FLOAT
    ),
    fdl.Column(
        name='feature_2',
        data_type=fdl.DataType.INTEGER
    ),
    fdl.Column(
        name='feature_3',
        data_type=fdl.DataType.BOOLEAN
    )
]

outputs = [
    fdl.Column(
        name='output_column',
        data_type=fdl.DataType.FLOAT
    )
]

targets = [
    fdl.Column(
        name='target_column',
        data_type=fdl.DataType.INTEGER
    )
]

model_info = fdl.ModelInfo(
    display_name='Example Model',
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION,
    inputs=inputs,
    outputs=outputs,
    targets=targets
)

fdl.ModelInfo.from_dataset_info

import pandas as pd

df = pd.read_csv('example_dataset.csv')

dataset_info = fdl.DatasetInfo.from_dataframe(
    df=df
)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    features=[
        'feature_1',
        'feature_2',
        'feature_3'
    ],
    outputs=[
        'output_column'
    ],
    target='target_column',
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION
)

fdl.ModelInfo.from_dict

import pandas as pd

df = pd.read_csv('example_dataset.csv')

dataset_info = fdl.DatasetInfo.from_dataframe(
    df=df
)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    features=[
        'feature_1',
        'feature_2',
        'feature_3'
    ],
    outputs=[
        'output_column'
    ],
    target='target_column',
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION
)

model_info_dict = model_info.to_dict()

new_model_info = fdl.ModelInfo.from_dict(
    deserialized_json={
        'model': model_info_dict
    }
)

fdl.ModelInfo.to_dict

import pandas as pd

df = pd.read_csv('example_dataset.csv')

dataset_info = fdl.DatasetInfo.from_dataframe(
    df=df
)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    features=[
        'feature_1',
        'feature_2',
        'feature_3'
    ],
    outputs=[
        'output_column'
    ],
    target='target_column',
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION
)

model_info_dict = model_info.to_dict()
{
    'name': 'Example Model',
    'input-type': 'structured',
    'model-task': 'binary_classification',
    'inputs': [
        {
            'column-name': 'feature_1',
            'data-type': 'float'
        },
        {
            'column-name': 'feature_2',
            'data-type': 'int'
        },
        {
            'column-name': 'feature_3',
            'data-type': 'bool'
        },
        {
            'column-name': 'target_column',
            'data-type': 'int'
        }
    ],
    'outputs': [
        {
            'column-name': 'output_column',
            'data-type': 'float'
        }
    ],
    'datasets': [],
    'targets': [
        {
            'column-name': 'target_column',
            'data-type': 'int'
        }
    ],
    'custom-explanation-names': []
}


fdl.WeightingParams

Holds weighting information for class imbalanced models which can then be passed into a fdl.ModelInfo object. Please note that the use of weighting params requires the presence of model outputs in the baseline dataset.

import pandas as pd
import sklearn.utils
import fiddler as fdl

df = pd.read_csv('example_dataset.csv')
computed_weight = sklearn.utils.class_weight.compute_class_weight(
        class_weight='balanced',
        classes=np.unique(df[TARGET_COLUMN]),
        y=df[TARGET_COLUMN]
    ).tolist()
weighting_params =  fdl.WeightingParams(class_weight=computed_weight)
dataset_info = fdl.DatasetInfo.from_dataframe(df=df)

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    features=[
        'feature_1',
        'feature_2',
        'feature_3'
    ],
    outputs=['output_column'],
    target='target_column',
    weighting_params=weighting_params,
    input_type=fdl.ModelInputType.TABULAR,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION
)


fdl.ModelInputType

model_input_type = fdl.ModelInputType.TABULAR


fdl.ModelTask

Represents supported model tasks

model_task = fdl.ModelTask.BINARY_CLASSIFICATION


fdl.DataType

Represents supported data types.

data_type = fdl.DataType.FLOAT


fdl.Column

Represents a column of a dataset.

column = fdl.Column(
    name='feature_1',
    data_type=fdl.DataType.FLOAT,
    value_range_min=0.0,
    value_range_max=80.0
)


fdl.DeploymentParams

Supported from server version 23.1 and above with Model Deployment feature enabled.

deployment_params = fdl.DeploymentParams(
        image_uri="md-base/python/machine-learning:1.1.0",
        cpu=250,
        memory=512,
          replicas=1,
)

📘 What parameters should I set for my model?

Setting the right parameters might not be straightforward and Fiddler is here to help you.

The parameters might vary depending the number of input features used, the pre-processing steps used and the model itself.

This table is helping you defining the right parameters

  1. Surrogate Models guide

  1. User Uploaded guide

For uploading your artifact model, refer to the table above and increase the memory number, depending on your model framework and complexity. Surrogate models use lightgbm framework.

For example, an NLP model for a TEXT input might need memory set at 1024 or higher and CPU at 1000.

📘 Usage Reference

See the usage with:

Check more about the Model Deployment feature set.



fdl.ComparePeriod

Required when compare_to = CompareTo.TIME_PERIOD, this field is used to set when comparing against the same bin for a previous time period. Choose from the following:

import fiddler as fdl

client.add_alert_rule(
    name = "perf-gt-5prec-1hr-1d-ago",
    project_name = 'project-a',
    model_name = 'model-a',
    alert_type = fdl.AlertType.PERFORMANCE,
    metric = fdl.Metric.PRECISION,
    bin_size = fdl.BinSize.ONE_HOUR,
    compare_to = fdl.CompareTo.TIME_PERIOD,
    compare_period = fdl.ComparePeriod.ONE_DAY, <----
    warning_threshold = 0.05,
    critical_threshold = 0.1,
    condition = fdl.AlertCondition.GREATER,
    priority = fdl.Priority.HIGH,
    notifications_config = notifications_config
)
[AlertRule(alert_rule_uuid='9b8711fa-735e-4a72-977c-c4c8b16543ae',
           organization_name='some_org_name',
           project_id='project-a',
           model_id='model-a',
           name='perf-gt-5prec-1hr-1d-ago',
           alert_type=AlertType.PERFORMANCE,
           metric=Metric.PRECISION,
           priority=Priority.HIGH,
           compare_to='CompareTo.TIME_PERIOD,
           compare_period=ComparePeriod.ONE_DAY, <----
           compare_threshold=None,
           raw_threshold=None,
           warning_threshold=0.05,
           critical_threshold=0.1,
           condition=AlertCondition.GREATER,
           bin_size=BinSize.ONE_HOUR)]


fdl.AlertCondition

If condition = fdl.AlertCondition.GREATER/LESSER is specified, and an alert is triggered every time the metric value is greater/lesser than the specified threshold.

import fiddler as fdl

client.add_alert_rule(
    name = "perf-gt-5prec-1hr-1d-ago",
    project_name = 'project-a',
    model_name = 'model-a',
    alert_type = fdl.AlertType.PERFORMANCE,
    metric = fdl.Metric.PRECISION,
    bin_size = fdl.BinSize.ONE_HOUR,
    compare_to = fdl.CompareTo.TIME_PERIOD,
    compare_period = fdl.ComparePeriod.ONE_DAY,
    warning_threshold = 0.05,
    critical_threshold = 0.1,
    condition = fdl.AlertCondition.GREATER, <-----
    priority = fdl.Priority.HIGH,
    notifications_config = notifications_config
)
[AlertRule(alert_rule_uuid='9b8711fa-735e-4a72-977c-c4c8b16543ae',
           organization_name='some_org_name',
           project_id='project-a',
           model_id='model-a',
           name='perf-gt-5prec-1hr-1d-ago',
           alert_type=AlertType.PERFORMANCE, <---
           metric=Metric.PRECISION,
           priority=Priority.HIGH,
           compare_to='CompareTo.TIME_PERIOD,
           compare_period=ComparePeriod.ONE_DAY,
           compare_threshold=None,
           raw_threshold=None,
           warning_threshold=0.05,
           critical_threshold=0.1,
           condition=AlertCondition.GREATER, <-----
           bin_size=BinSize.ONE_HOUR)]


fdl.CompareTo

Whether the metric value is to be compared against a static value or the same time bin from a previous time period(set using compare_period[ComparePeriod]).

import fiddler as fdl

client.add_alert_rule(
    name = "perf-gt-5prec-1hr-1d-ago",
    project_name = 'project-a',
    model_name = 'binary_classification_model-a',
    alert_type = fdl.AlertType.PERFORMANCE,
    metric = fdl.Metric.PRECISION,
    bin_size = fdl.BinSize.ONE_HOUR,
    compare_to = fdl.CompareTo.TIME_PERIOD, <----
    compare_period = fdl.ComparePeriod.ONE_DAY,
    warning_threshold = 0.05,
    critical_threshold = 0.1,
    condition = fdl.AlertCondition.GREATER,
    priority = fdl.Priority.HIGH,
    notifications_config = notifications_config
)
[AlertRule(alert_rule_uuid='9b8711fa-735e-4a72-977c-c4c8b16543ae',
           organization_name='some_org_name',
           project_id='project-a',
           model_id='binary_classification_model-a',
           name='perf-gt-5prec-1hr-1d-ago',
           alert_type=AlertType.PERFORMANCE,
           metric=Metric.PRECISION,
           priority=Priority.HIGH,
           compare_to='CompareTo.TIME_PERIOD, <---
           compare_period=ComparePeriod.ONE_DAY,
           compare_threshold=None,
           raw_threshold=None,
           warning_threshold=0.05,
           critical_threshold=0.1,
           condition=AlertCondition.GREATER,
           bin_size=BinSize.ONE_HOUR)]


fdl.BinSize

**This field signifies the durations for which fiddler monitoring calculates the metric values **

import fiddler as fdl

client.add_alert_rule(
    name = "perf-gt-5prec-1hr-1d-ago",
    project_name = 'project-a',
    model_name = 'model-a',
    alert_type = fdl.AlertType.PERFORMANCE,
    metric = fdl.Metric.PRECISION,
    bin_size = fdl.BinSize.ONE_HOUR, <----
    compare_to = fdl.CompareTo.TIME_PERIOD,
    compare_period = fdl.ComparePeriod.ONE_DAY,
    warning_threshold = 0.05,
    critical_threshold = 0.1,
    condition = fdl.AlertCondition.GREATER,
    priority = fdl.Priority.HIGH,
    notifications_config = notifications_config
)
[AlertRule(alert_rule_uuid='9b8711fa-735e-4a72-977c-c4c8b16543ae',
           organization_name='some_org_name',
           project_id='project-a',
           model_id='model-a',
           name='perf-gt-5prec-1hr-1d-ago',
           alert_type=AlertType.PERFORMANCE,
           metric=Metric.PRECISION,
           priority=Priority.HIGH,
           compare_to='CompareTo.TIME_PERIOD,
           compare_period=ComparePeriod.ONE_DAY,
           compare_threshold=None,
           raw_threshold=None,
           warning_threshold=0.05,
           critical_threshold=0.1,
           condition=AlertCondition.GREATER,
           bin_size=BinSize.ONE_HOUR)] <-----


fdl.Priority

This field can be used to prioritize the alert rules by adding an identifier - low, medium, and high to help users better categorize them on the basis of their importance. Following are the Priority Enums:

import fiddler as fdl

client.add_alert_rule(
    name = "perf-gt-5prec-1hr-1d-ago",
    project_name = 'project-a',
    model_name = 'model-a',
    alert_type = fdl.AlertType.PERFORMANCE,
    metric = fdl.Metric.PRECISION,
    bin_size = fdl.BinSize.ONE_HOUR,
    compare_to = fdl.CompareTo.TIME_PERIOD,
    compare_period = fdl.ComparePeriod.ONE_DAY,
    warning_threshold = 0.05,
    critical_threshold = 0.1,
    condition = fdl.AlertCondition.GREATER,
    priority = fdl.Priority.HIGH, <---
    notifications_config = notifications_config
)
[AlertRule(alert_rule_uuid='9b8711fa-735e-4a72-977c-c4c8b16543ae',
           organization_name='some_org_name',
           project_id='project-a',
           model_id='model-a',
           name='perf-gt-5prec-1hr-1d-ago',
           alert_type=AlertType.PERFORMANCE,
           metric=Metric.PRECISION,
           priority=Priority.HIGH, <----
           compare_to='CompareTo.TIME_PERIOD,
           compare_period=ComparePeriod.ONE_DAY,
           compare_threshold=None,
           raw_threshold=None,
           warning_threshold=0.05,
           critical_threshold=0.1,
           condition=AlertCondition.GREATER,
           bin_size=BinSize.ONE_HOUR)]


fdl.Metric

Following is the list of metrics, with corresponding alert type and model task, for which an alert rule can be created.

import fiddler as fdl

client.add_alert_rule(
    name = "perf-gt-5prec-1hr-1d-ago",
    project_name = 'project-a',
    model_name = 'binary_classification_model-a',
    alert_type = fdl.AlertType.PERFORMANCE,
    metric = fdl.Metric.PRECISION, <----
    bin_size = fdl.BinSize.ONE_HOUR,
    compare_to = fdl.CompareTo.TIME_PERIOD,
    compare_period = fdl.ComparePeriod.ONE_DAY,
    warning_threshold = 0.05,
    critical_threshold = 0.1,
    condition = fdl.AlertCondition.GREATER,
    priority = fdl.Priority.HIGH,
    notifications_config = notifications_config
)
[AlertRule(alert_rule_uuid='9b8711fa-735e-4a72-977c-c4c8b16543ae',
           organization_name='some_org_name',
           project_id='project-a',
           model_id='binary_classification_model-a',
           name='perf-gt-5prec-1hr-1d-ago',
           alert_type=AlertType.PERFORMANCE,
           metric=Metric.PRECISION, <---
           priority=Priority.HIGH,
           compare_to='CompareTo.TIME_PERIOD,
           compare_period=ComparePeriod.ONE_DAY,
           compare_threshold=None,
           raw_threshold=None,
           warning_threshold=0.05,
           critical_threshold=0.1,
           condition=AlertCondition.GREATER,
           bin_size=BinSize.ONE_HOUR)]


fdl.AlertType

client.add_alert_rule(
    name = "perf-gt-5prec-1hr-1d-ago",
    project_name = 'project-a',
    model_name = 'model-a',
    alert_type = fdl.AlertType.PERFORMANCE, <---
    metric = fdl.Metric.PRECISION,
    bin_size = fdl.BinSize.ONE_HOUR,
    compare_to = fdl.CompareTo.TIME_PERIOD,
    compare_period = fdl.ComparePeriod.ONE_DAY,
    warning_threshold = 0.05,
    critical_threshold = 0.1,
    condition = fdl.AlertCondition.GREATER,
    priority = fdl.Priority.HIGH,
    notifications_config = notifications_config
)
[AlertRule(alert_rule_uuid='9b8711fa-735e-4a72-977c-c4c8b16543ae',
           organization_name='some_org_name',
           project_id='project-a',
           model_id='model-a',
           name='perf-gt-5prec-1hr-1d-ago',
           alert_type=AlertType.PERFORMANCE, <---
           metric=Metric.PRECISION,
           priority=Priority.HIGH,
           compare_to='CompareTo.TIME_PERIOD,
           compare_period=ComparePeriod.ONE_DAY,
           compare_threshold=None,
           raw_threshold=None,
           warning_threshold=0.05,
           critical_threshold=0.1,
           condition=AlertCondition.GREATER,
           bin_size=BinSize.ONE_HOUR)]


fdl.WindowSize

from fiddler import BaselineType, WindowSize

PROJECT_NAME = 'example_project'
BASELINE_NAME = 'example_rolling'
DATASET_NAME = 'example_validation'
MODEL_NAME = 'example_model'

client.add_baseline(
  project_id=PROJECT_NAME,
  model_id=MODEL_NAME,
  baseline_id=BASELINE_NAME,
  type=BaselineType.ROLLING_PRODUCTION,
  offset=WindowSize.ONE_MONTH, # How far back to set our window
  window_size=WindowSize.ONE_WEEK, # Size of the sliding window
)


fdl.CustomFeatureType



fdl.CustomFeature

This is the base class that all other custom features inherit from. It's flexible enough to accommodate different types of derived features. Note: All of the derived feature classes (e.g., Multivariate, VectorFeature, etc.) inherit from CustomFeature and thus have its properties, in addition to their specific ones.

# use from_columns helper function to generate a custom feature combining multiple numeric columns

feature = fdl.CustomFeature.from_columns(
    name='my_feature',
    columns=['column_1', 'column_2'],
    n_clusters=5
)

fdl.Multivariate

Represents custom features derived from multiple columns.

multivariate_feature = fdl.Multivariate(
    name='multi_feature',
    columns=['column_1', 'column_2']
)

fdl.VectorFeature

Represents custom features derived from a single vector column.

vector_feature = fdl.VectorFeature(
    name='vector_feature',
    column='vector_column'
)

fdl.TextEmbedding

Represents custom features derived from text embeddings.

text_embedding_feature = TextEmbedding(
    name='text_custom_feature',
    source_column='text_column',
    column='text_embedding',
    n_tags=10
)

fdl.ImageEmbedding

Represents custom features derived from image embeddings.

image_embedding_feature = fdl.ImageEmbedding(
    name='image_feature',
    source_column='image_url',
    column='image_embedding',
)


fdl.Enrichment (private preview)

  • Enrichments are custom features designed to augment data provided in events.

  • They add new computed columns to your published data automatically whenever defined.

  • The new columns generated are available for querying in analyze, charting, and alerting, similar to any other column.

# Automatically generating embedding for a column named “question”

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features=[
        fdl.Enrichment(
            name='question_embedding',
            enrichment='embedding',
            columns=['question'],
        ),
        fdl.TextEmbedding(
            name='question_cf',
            source_column='question',
            column='question_embedding',
        ),
    ]
)

Note

Enrichments are disabled by default. To enable them, contact your administrator. Failing to do so will result in an error during the add_model call.


Embedding (private preview)

  • Create an embedding for a string column using an embedding model.

  • Supports Sentence transformers and Encoder/Decoder NLP transformers from Hugging Face.

  • To enable set enrichment parameter toembedding.

  • For each embedding enrichment, if you want to monitor the embedding vector on fiddler you MUST create a corresponding TextEmbedding using the enrichment’s output column.

Requirements:

  • Access to Huggingface inference endpoint - https://api-inference.huggingface.co

  • Huggingface API token

Supported Models:

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
          name='Question Embedding', # name of the enrichment, will be the vector col
          enrichment='embedding',
          columns=['question'], # only one allowed per embedding enrichment, must be a text column in dataframe
          config={ # optional
            'model_name': ... # default: 'thenlper/gte-base'
            'pooling_method': ... # choose from '{first/last/mean}_token'. Only required if NOT using a sentence transformer
          }
      ),
      fdl.TextEmbedding(
        name='question_cf', # name of the text embedding custom feature
        source_column='question', # source - raw text
        column='Question Embedding', # the name of the vector - outpiut of the embedding enrichment
      ),
    ]
)

The above example will lead to generation of new column

  • FDL Question Embedding(vector) : embeddings corresponding to string column question

Note

In the context of Hugging Face models, particularly transformer-based models used for generating embeddings, the pooling_method determines how the model processes the output of its layers to produce a single vector representation for input sequences (like sentences or documents). This is crucial when using these models for tasks like sentence or document embedding, where you need a fixed-size vector representation regardless of the input length.


Centroid Distance (private preview)

  • Fiddler uses KMeans based system to determine which cluster a particular CustomFeature belongs to.

  • This Centroid Distance enrichment calculates the distance from the closest centroid calculated by model monitoring.

  • A new numeric column with distances to the closest centroid is added to the events table.

  • To enable set enrichment parameter tocentroid_distance.

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
        name='question_embedding',
        enrichment='embedding',
        columns=['question'],
      ),
      fdl.TextEmbedding(
          name='question_cf',
          source_column='question',
          column='question_embedding',
      ),
      fdl.Enrichment(
        name='Centroid Distance',
        enrichment='centroid_distance',
        columns=['question_cf'],
      ),
    ]
)

The above example will lead to generation of new column

  • FDL Centroid Distance (question_embedding)(float) : distance from the nearest K-Means centroid present in question_embedding

Note

Does not calculate membership for preproduction data, so you cannot calculate drift.


Personally Identifiable Information (private preview)

The PII (Personally Identifiable Information) enrichment is a critical tool designed to detect and flag the presence of sensitive information within textual data. Whether user-entered or system-generated, this enrichment aims to identify instances where PII might be exposed, helping to prevent privacy breaches and the potential misuse of personal data. In an era where digital privacy concerns are paramount, mishandling or unintentionally leaking PII can have serious repercussions, including privacy violations, identity theft, and significant legal and reputational damage.

Regulatory frameworks such as the General Data Protection Regulation (GDPR) in the European Union and the Health Insurance Portability and Accountability Act (HIPAA) in the United States underscore the necessity of safeguarding PII. These laws enforce strict guidelines on the collection, storage, and processing of personal data, emphasizing the need for robust measures to protect sensitive information.

The inadvertent inclusion of PII in datasets used for training or interacting with large language models (LLMs) can exacerbate the risks associated with data privacy. Once exposed to an LLM, sensitive information can be inadvertently learned by the model, potentially leading to wider dissemination of this data beyond intended confines. This scenario underscores the importance of preemptively identifying and removing PII from data before it is processed or shared, particularly in contexts involving AI and machine learning.

To mitigate the risks associated with PII exposure, organizations and developers can integrate the PII enrichment into their data processing workflows. This enrichment operates by scanning text for patterns and indicators of personal information, flagging potentially sensitive data for review or anonymization. By proactively identifying PII, stakeholders can take necessary actions to comply with privacy laws, safeguard individuals' data, and prevent the unintended spread of personal information through AI models and other digital platforms. Implementing PII detection and management practices is not just a legal obligation but a critical component of responsible data stewardship in the digital age.

  • To enable set enrichment parameter topii.

Requirements

  • Reachability to https://github.com/explosion/spacy-models/releases/download/ to download spacy models as required

List of PII entities

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
        name='Rag PII',
        enrichment='pii',
        columns=['question'], # one or more columns
        allow_list=['fiddler'], # Optional: list of strings that are white listed
        score_threshold=0.85, # Optional: float value for minimum possible confidence
      ),
    ]
)

The above example will lead to generation of new columns:

  1. FDL Rag PII (question) (bool) : whether any PII was detected

  2. FDL Rag PII (question) Matches (str) : what matches in raw text were flagged as potential PII (ex. ‘Douglas MacArthur,Korean’)

  3. FDL Rag PII (question) Entities (str) : what entites these matches were tagged as (ex. 'PERSON')

Note

PII enrichment is integrated with Presidio


Evaluate (private preview)

  • Calculates classic Metrics for evaluating QA results like Bleu, Rouge and Meteor.

  • To enable set enrichment parameter toevaluate.

  • Make sure the reference_col and prediction_col are set in the configof Enrichment.

Here is a summary of the three evaluation metrics for natural language generation:

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
        name='QA Evaluate',
        enrichment='evaluate',
        columns=['correct_answer', 'generated_answer'],
        config={
            'reference_col': 'correct_answer', # required
            'prediction_col': 'generated_answer', # required
            'metrics': ..., # optional, default - ['bleu', 'rouge' , 'meteor']
        }
      ),
    ]
)

The above example generates 6 new columns

  • FDL QA Evaluate (bleu)(float)

  • FDL QA Evaluate (rouge1)(float)

  • FDL QA Evaluate (rouge2)(float)

  • FDL QA Evaluate (rougel)(float)

  • FDL QA Evaluate (rougelsum)(float)

  • FDL QA Evaluate (meteor)(float)


Textstat (private preview)

Generates statistics on string columns.

  • To enable set enrichment parameter totextstat.

**Supported Statistics **

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
          name='Text Statistics',
          enrichment='textstat',
          columns=['question'],
          config={
          'statistics' : [
              'char_count',
              'dale_chall_readability_score',
            ]
          },
      ),
    ]
)

The above example leads to the creation of two additional columns

  • FDL Text Statistics (question) char_count(int) : character count of string in questioncolumn

  • FDL Text Statistics (question) dale_chall_readability_score(float) : readability score of string in question column


Sentiment (private preview)

Sentiment Analysis enrichment employs advanced natural language processing (NLP) techniques to gauge the emotional tone behind a body of text. This enrichment is designed to determine whether the sentiment of textual content is positive, negative, or neutral, providing valuable insights into the emotions and opinions expressed within. By analyzing the sentiment, this tool offers a powerful means to understand user feedback, market research responses, social media commentary, and any textual data where opinion and mood are significant.

Implementing Sentiment Analysis into your data processing allows for a nuanced understanding of how your audience feels about a product, service, or topic, enabling informed decision-making and strategy development. It's particularly useful in customer service and brand management, where gauging customer sentiment is crucial for addressing concerns, improving user experience, and building brand reputation.

The Sentiment enrichment uses NLTK's VADER lexicon to generate a score and corresponding sentiment for all specified columns. For each string column on which sentiment enrichment is enabled, two additional columns are added. To enable set enrichment parameter tosentiment.

Requirements

  • Reachability to www.nltk.org/nltk_data to download latest vader lexicon

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
          name='Question Sentiment',
          enrichment='sentiment',
          columns=['question'],
      ),
    ]
)

The above example leads to creation of two columns -

  • FDL Question Sentiment (question) compound(float): raw score of sentiment

  • FDL Question Sentiment (question) sentiment(string): one of positive, negative and neutral


Profanity (private preview)

The Profanity enrichment is designed to detect and flag the use of offensive or inappropriate language within textual content. This enrichment is essential for maintaining the integrity and professionalism of digital platforms, forums, social media, and any user-generated content areas. It helps ensure that conversations and interactions remain respectful and free from language that could be considered harmful or offensive to users.

In the digital space, where diverse audiences come together, the presence of profanity can lead to negative user experiences, damage brand reputation, and create an unwelcoming environment. Implementing a profanity filter is a proactive measure to prevent such outcomes, promoting a positive and inclusive online community.

Beyond maintaining community standards, the Profanity enrichment has practical implications for compliance with platform guidelines and legal regulations concerning hate speech and online conduct. Many digital platforms have strict policies against the use of profane or offensive language, making it crucial for content creators and moderators to actively monitor and manage such language.

By integrating the Profanity enrichment into their content moderation workflow, businesses and content managers can automate the detection of inappropriate language, significantly reducing manual review efforts. This enrichment not only helps in upholding community guidelines and legal standards but also supports the creation of safer and more respectful online spaces for all users.

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
            name='Profanity',
            enrichment='profanity',
            columns=['prompt', 'response'],
            config={'output_column_name': 'contains_profanity'},
        ),
    ]
)

The above example leads to creation of two columns -

  • FDL Profanity (prompt) contains_profanity(bool): to indicate if input contains profanity in the value of the prompt column

  • FDL Profanity (response) contains_profanity(bool): to indicate if input contains profanity in the value of the response column


Answer Relevance (private preview)

The Answer Relevance is a specialized enrichment designed to evaluate the pertinence of AI-generated responses to their corresponding prompts. This enrichment operates by assessing whether the content of a response accurately addresses the question or topic posed by the initial prompt, providing a simple yet effective binary outcome: relevant or not relevant. Its primary function is to ensure that the output of AI systems, such as chatbots, virtual assistants, and content generation models, remains aligned with the user's informational needs and intentions.

In the context of AI-generated content, ensuring relevance is crucial for maintaining user engagement and trust. Irrelevant or tangentially related responses can lead to user frustration, decreased satisfaction, and diminished trust in the AI's capabilities. The Answer Relevance metric serves as a critical checkpoint, verifying that interactions and content deliveries meet the expected standards of accuracy and pertinence.

This enrichment finds its application across a wide range of AI-driven platforms and services where the quality of the response directly impacts the user experience. From customer service bots answering inquiries to educational tools providing study assistance, the ability to automatically gauge the relevance of responses enhances the effectiveness and reliability of these services.

Incorporating the Answer Relevance enrichment into the development and refinement of AI models enables creators to iteratively improve their systems based on relevant feedback. By identifying instances where the model generates non-relevant responses, developers can adjust and fine-tune their models to better meet user expectations. This continuous improvement cycle is essential for advancing the quality and utility of AI-generated content, ensuring that it remains focused, accurate, and highly relevant to users' needs.

  • To enable set enrichment parameter toanswer_relevance.

Requirements:

  • This enrichment requires access to the OpenAI API, which may introduce latency due to network communication and processing time. Learn more about LLM based enrichments

  • OpenAI API access token MUST BE provided by the user.

answer_relevance_config = {
  'prompt' : 'prompt_col',
  'response' : 'response_col',
}

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
          name = 'Answer Relevance',
          enrichment = 'answer_relevance',
          columns = ['prompt_col', 'response_col'],
          config = answer_relevance_config,