- Calculates classic Metrics for evaluating QA results like Bleu, Rouge and Meteor.
- To enable set enrichment parameter to
evaluate
. - Make sure the
reference_col
andprediction_col
are set in theconfig
of Enrichment.
Here is a summary of the three evaluation metrics for natural language generation:
Metric | Description | Strengths | Limitations |
---|---|---|---|
bleu | Measures precision of word n-grams between generated and reference texts | Simple, fast, widely used | Ignores recall, meaning, and word order |
rouge | Measures recall of word n-grams and longest common sequences | Captures more information than BLEU | Still relies on word matching, not semantic similarity |
meteor | Incorporates recall, precision, and additional semantic matching based on stems and paraphrasing | More robust and flexible than BLEU and ROUGE | Requires linguistic resources and alignment algorithms |
fdl.ModelInfo.from_dataset_info(
dataset_info=dataset_info,
display_name='llm_model',
model_task=fdl.core_objects.ModelTask.LLM,
custom_features = [
fdl.Enrichment(
name='QA Evaluate',
enrichment='evaluate',
columns=['correct_answer', 'generated_answer'],
config={
'reference_col': 'correct_answer', # required
'prediction_col': 'generated_answer', # required
'metrics': ..., # optional, default - ['bleu', 'rouge' , 'meteor']
}
),
]
)
The above example generates 6 new columns
FDL QA Evaluate (bleu)
(float)FDL QA Evaluate (rouge1)
(float)FDL QA Evaluate (rouge2)
(float)FDL QA Evaluate (rougel)
(float)FDL QA Evaluate (rougelsum)
(float)FDL QA Evaluate (meteor)
(float)