Textstat (beta)

Generates statistics on string columns.
To enable set enrichment parameter totextstat.

Supported Statistics

Statistic	Description	Usage
char_count	Total number of characters in text, including everything.	Assessing text length, useful for platforms with character limits.
letter_count	Total number of letters only, excluding numbers, punctuation, spaces.	Gauging text complexity, used in readability formulas.
miniword_count	Count of small words (usually 1-3 letters).	Specific readability analyses, especially for simplistic texts.
words_per_sentence	Average number of words in each sentence.	Understanding sentence complexity and structure.
polysyllabcount	Number of words with more than three syllables.	Analyzing text complexity, used in some readability scores.
lexicon_count	Total number of words in the text.	General text analysis, assessing overall word count.
syllable_count	Total number of syllables in the text.	Used in readability formulas, measures text complexity.
sentence_count	Total number of sentences in the text.	Analyzing text structure, used in readability scores.
flesch_reading_ease	Readability score indicating how easy a text is to read (higher scores = easier).	Assessing readability for a general audience.
smog_index	Measures years of education needed to understand a text.	Evaluating text complexity, especially for higher education texts.
flesch_kincaid_grade	Grade level associated with the complexity of the text.	Educational settings, determining appropriate grade level for texts.
coleman_liau_index	Grade level needed to understand the text based on sentence length and letter count.	Assessing readability for educational purposes.
automated_readability_index	Estimates the grade level needed to comprehend the text.	Evaluating text difficulty for educational materials.
dale_chall_readability_score	Assesses text difficulty based on a list of familiar words for average American readers.	Determining text suitability for average readers.
difficult_words	Number of words not on a list of commonly understood words.	Analyzing text difficulty, especially for non-native speakers.
linsear_write_formula	Readability formula estimating grade level of text based on sentence length and easy word count.	Simplifying texts, especially for lower reading levels.
gunning_fog	Estimates the years of formal education needed to understand the text.	Assessing text complexity, often for business or professional texts.
long_word_count	Number of words longer than a certain length (often 6 or 7 letters).	Evaluating complexity and sophistication of language used.
monosyllabcount	Count of words with only one syllable.	Readability assessments, particularly for simpler texts.

fdl.ModelInfo.from_dataset_info(
    dataset_info=dataset_info,
    display_name='llm_model',
    model_task=fdl.core_objects.ModelTask.LLM,
    custom_features = [
      fdl.Enrichment(
          name='Text Statistics',
          enrichment='textstat',
          columns=['question'],
          config={
          'statistics' : [
              'char_count',
              'dale_chall_readability_score',
            ]
          },
      ),
    ]
)

The above example leads to the creation of two additional columns

FDL Text Statistics (question) char_count(int) : character count of string in
questioncolumn
FDL Text Statistics (question) dale_chall_readability_score(float) : readability score of string in
question column