- Create an embedding for a string column using an embedding model.
- Supports Sentence transformers and Encoder/Decoder NLP transformers from Hugging Face.
- To enable set enrichment parameter to
embedding
. - For each embedding enrichment, if you want to monitor the embedding vector on fiddler you MUST create a corresponding
TextEmbedding
using the enrichment’s output column.
Requirements:
- Access to Huggingface inference endpoint -
https://api-inference.huggingface.co
- Huggingface API token
Supported Models:
model_name | size | Type | pooling_method | Notes |
---|---|---|---|---|
BAAI/bge-small-en-v1.5 | small | Sentence Transformer | ||
sentence-transformers/all-MiniLM-L6-v2 | med | Sentence Transformer | ||
thenlper/gte-base | med | Sentence Transformer | (default) | |
gpt2 | med | Encoder NLP Transformer | last_token | |
distilgpt2 | small | Encoder NLP Transformer | last_token | |
EleuteherAI/gpt-neo-125m | med | Encoder NLP Transformer | last_token | |
google/bert_uncased_L-4_H-256_A-4 | small | Decoder NLP Transformer | first_token | Smallest Bert |
bert-base-cased | med | Decoder NLP Transformer | first_token | |
distilroberta-base | med | Decoder NLP Transformer | first_token | |
xlm-roberta-large | large | Decoder NLP Transformer | first_token | Multilingual |
roberta-large | large | Decoder NLP Transformer | first_token |
fdl.ModelInfo.from_dataset_info(
dataset_info=dataset_info,
display_name='llm_model',
model_task=fdl.core_objects.ModelTask.LLM,
custom_features = [
fdl.Enrichment(
name='Question Embedding', # name of the enerichment, will be the vector col
enrichment='embedding',
columns=['question'], # only one allowed per embedding enrichment, must be a text column in dataframe
config={ # optional
'model_name': ... # default: 'thenlper/gte-base'
'pooling_method': ... # choose from '{first/last/mean}_token'. Only required if NOT using a sentence transformer
}
),
fdl.TextEmbedding(
name='question_cf', # name of the text embedding custom feature
source_column='question', # source - raw text
column='Question Embedding', # the name of the vector - outpiut of the embedding enrichment
),
]
)
The above example will lead to generation of new column
FDL Question Embedding
(vector) : embeddings corresponding to string columnquestion
Note
In the context of Hugging Face models, particularly transformer-based models used for generating embeddings, the pooling_method determines how the model processes the output of its layers to produce a single vector representation for input sequences (like sentences or documents). This is crucial when using these models for tasks like sentence or document embedding, where you need a fixed-size vector representation regardless of the input length.