Embedding (beta)

  • Create an embedding for a string column using an embedding model.
  • Supports Sentence transformers and Encoder/Decoder NLP transformers from Hugging Face.
  • To enable set enrichment parameter toembedding.
  • For each embedding enrichment, if you want to monitor the embedding vector on fiddler you MUST create a corresponding TextEmbedding using the enrichment’s output column.

Requirements:

  • Access to Huggingface inference endpoint - https://api-inference.huggingface.co
  • Huggingface API token

Supported Models:

model_namesizeTypepooling_methodNotes
BAAI/bge-small-en-v1.5smallSentence Transformer
sentence-transformers/all-MiniLM-L6-v2medSentence Transformer
thenlper/gte-basemedSentence Transformer(default)
gpt2medEncoder NLP Transformerlast_token
distilgpt2smallEncoder NLP Transformerlast_token
EleuteherAI/gpt-neo-125mmedEncoder NLP Transformerlast_token
google/bert_uncased_L-4_H-256_A-4smallDecoder NLP Transformerfirst_tokenSmallest Bert
bert-base-casedmedDecoder NLP Transformerfirst_token
distilroberta-basemedDecoder NLP Transformerfirst_token
xlm-roberta-largelargeDecoder NLP Transformerfirst_tokenMultilingual
roberta-largelargeDecoder NLP Transformerfirst_token
fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
          name='Question Embedding', # name of the enerichment, will be the vector col
          enrichment='embedding', 
          columns=['question'], # only one allowed per embedding enrichment, must be a text column in dataframe
          config={ # optional
            'model_name': ... # default: 'thenlper/gte-base'
            'pooling_method': ... # choose from '{first/last/mean}_token'. Only required if NOT using a sentence transformer
          }
      ),
      fdl.TextEmbedding(
        name='question_cf', # name of the text embedding custom feature
        source_column='question', # source - raw text
        column='Question Embedding', # the name of the vector - outpiut of the embedding enrichment
      ),
    ]
)

The above example will lead to generation of new column

  • FDL Question Embedding(vector) : embeddings corresponding to string column question

Note

In the context of Hugging Face models, particularly transformer-based models used for generating embeddings, the pooling_method determines how the model processes the output of its layers to produce a single vector representation for input sequences (like sentences or documents). This is crucial when using these models for tasks like sentence or document embedding, where you need a fixed-size vector representation regardless of the input length.