Evaluate (beta)

Calculates classic Metrics for evaluating QA results like Bleu, Rouge and Meteor.
To enable set enrichment parameter toevaluate.
Make sure the reference_col and prediction_col are set in the configof Enrichment.

Here is a summary of the three evaluation metrics for natural language generation:

Metric	Description	Strengths	Limitations
bleu	Measures precision of word n-grams between generated and reference texts	Simple, fast, widely used	Ignores recall, meaning, and word order
rouge	Measures recall of word n-grams and longest common sequences	Captures more information than BLEU	Still relies on word matching, not semantic similarity
meteor	Incorporates recall, precision, and additional semantic matching based on stems and paraphrasing	More robust and flexible than BLEU and ROUGE	Requires linguistic resources and alignment algorithms

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
        name='QA Evaluate',
        enrichment='evaluate',
        columns=['correct_answer', 'generated_answer'],
        config={
            'reference_col': 'correct_answer', # required
            'prediction_col': 'generated_answer', # required
            'metrics': ..., # optional, default - ['bleu', 'rouge' , 'meteor']
        }
      ),
    ]
)

The above example generates 6 new columns

FDL QA Evaluate (bleu)(float)
FDL QA Evaluate (rouge1)(float)
FDL QA Evaluate (rouge2)(float)
FDL QA Evaluate (rougel)(float)
FDL QA Evaluate (rougelsum)(float)
FDL QA Evaluate (meteor)(float)