Textstat (beta)

  • Generates statistics on string columns.
  • To enable set enrichment parameter totextstat.

Supported Statistics

StatisticDescriptionUsage
char_countTotal number of characters in text, including everything.Assessing text length, useful for platforms with character limits.
letter_countTotal number of letters only, excluding numbers, punctuation, spaces.Gauging text complexity, used in readability formulas.
miniword_countCount of small words (usually 1-3 letters).Specific readability analyses, especially for simplistic texts.
words_per_sentenceAverage number of words in each sentence.Understanding sentence complexity and structure.
polysyllabcountNumber of words with more than three syllables.Analyzing text complexity, used in some readability scores.
lexicon_countTotal number of words in the text.General text analysis, assessing overall word count.
syllable_countTotal number of syllables in the text.Used in readability formulas, measures text complexity.
sentence_countTotal number of sentences in the text.Analyzing text structure, used in readability scores.
flesch_reading_easeReadability score indicating how easy a text is to read (higher scores = easier).Assessing readability for a general audience.
smog_indexMeasures years of education needed to understand a text.Evaluating text complexity, especially for higher education texts.
flesch_kincaid_gradeGrade level associated with the complexity of the text.Educational settings, determining appropriate grade level for texts.
coleman_liau_indexGrade level needed to understand the text based on sentence length and letter count.Assessing readability for educational purposes.
automated_readability_indexEstimates the grade level needed to comprehend the text.Evaluating text difficulty for educational materials.
dale_chall_readability_scoreAssesses text difficulty based on a list of familiar words for average American readers.Determining text suitability for average readers.
difficult_wordsNumber of words not on a list of commonly understood words.Analyzing text difficulty, especially for non-native speakers.
linsear_write_formulaReadability formula estimating grade level of text based on sentence length and easy word count.Simplifying texts, especially for lower reading levels.
gunning_fogEstimates the years of formal education needed to understand the text.Assessing text complexity, often for business or professional texts.
long_word_countNumber of words longer than a certain length (often 6 or 7 letters).Evaluating complexity and sophistication of language used.
monosyllabcountCount of words with only one syllable.Readability assessments, particularly for simpler texts.
fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
          name='Text Statistics',
          enrichment='textstat',
          columns=['question'],
          config={
          'statistics' : [
              'char_count',
              'dale_chall_readability_score',
            ]
          },
      ),
    ]
)

The above example leads to the creation of two additional columns

  • FDL Text Statistics (question) char_count(int) : character count of string in
    questioncolumn
  • FDL Text Statistics (question) dale_chall_readability_score(float) : readability score of string in
    question column