Evaluate (beta)

  • Calculates classic Metrics for evaluating QA results like Bleu, Rouge and Meteor.
  • To enable set enrichment parameter toevaluate.
  • Make sure the reference_col and prediction_col are set in the configof Enrichment.

Here is a summary of the three evaluation metrics for natural language generation:

MetricDescriptionStrengthsLimitations
bleuMeasures precision of word n-grams between generated and reference textsSimple, fast, widely usedIgnores recall, meaning, and word order
rougeMeasures recall of word n-grams and longest common sequencesCaptures more information than BLEUStill relies on word matching, not semantic similarity
meteorIncorporates recall, precision, and additional semantic matching based on stems and paraphrasingMore robust and flexible than BLEU and ROUGERequires linguistic resources and alignment algorithms
fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
        name='QA Evaluate',
        enrichment='evaluate',
        columns=['correct_answer', 'generated_answer'],
        config={
            'reference_col': 'correct_answer', # required
            'prediction_col': 'generated_answer', # required
            'metrics': ..., # optional, default - ['bleu', 'rouge' , 'meteor']
        }
      ),
    ]
)

The above example generates 6 new columns

  • FDL QA Evaluate (bleu)(float)
  • FDL QA Evaluate (rouge1)(float)
  • FDL QA Evaluate (rouge2)(float)
  • FDL QA Evaluate (rougel)(float)
  • FDL QA Evaluate (rougelsum)(float)
  • FDL QA Evaluate (meteor)(float)