- Generates statistics on string columns.
- To enable set enrichment parameter to
textstat
.
Supported Statistics
Statistic | Description | Usage |
---|---|---|
char_count | Total number of characters in text, including everything. | Assessing text length, useful for platforms with character limits. |
letter_count | Total number of letters only, excluding numbers, punctuation, spaces. | Gauging text complexity, used in readability formulas. |
miniword_count | Count of small words (usually 1-3 letters). | Specific readability analyses, especially for simplistic texts. |
words_per_sentence | Average number of words in each sentence. | Understanding sentence complexity and structure. |
polysyllabcount | Number of words with more than three syllables. | Analyzing text complexity, used in some readability scores. |
lexicon_count | Total number of words in the text. | General text analysis, assessing overall word count. |
syllable_count | Total number of syllables in the text. | Used in readability formulas, measures text complexity. |
sentence_count | Total number of sentences in the text. | Analyzing text structure, used in readability scores. |
flesch_reading_ease | Readability score indicating how easy a text is to read (higher scores = easier). | Assessing readability for a general audience. |
smog_index | Measures years of education needed to understand a text. | Evaluating text complexity, especially for higher education texts. |
flesch_kincaid_grade | Grade level associated with the complexity of the text. | Educational settings, determining appropriate grade level for texts. |
coleman_liau_index | Grade level needed to understand the text based on sentence length and letter count. | Assessing readability for educational purposes. |
automated_readability_index | Estimates the grade level needed to comprehend the text. | Evaluating text difficulty for educational materials. |
dale_chall_readability_score | Assesses text difficulty based on a list of familiar words for average American readers. | Determining text suitability for average readers. |
difficult_words | Number of words not on a list of commonly understood words. | Analyzing text difficulty, especially for non-native speakers. |
linsear_write_formula | Readability formula estimating grade level of text based on sentence length and easy word count. | Simplifying texts, especially for lower reading levels. |
gunning_fog | Estimates the years of formal education needed to understand the text. | Assessing text complexity, often for business or professional texts. |
long_word_count | Number of words longer than a certain length (often 6 or 7 letters). | Evaluating complexity and sophistication of language used. |
monosyllabcount | Count of words with only one syllable. | Readability assessments, particularly for simpler texts. |
fdl.ModelInfo.from_dataset_info(
dataset_info=dataset_info,
display_name='llm_model',
model_task=fdl.core_objects.ModelTask.LLM,
custom_features = [
fdl.Enrichment(
name='Text Statistics',
enrichment='textstat',
columns=['question'],
config={
'statistics' : [
'char_count',
'dale_chall_readability_score',
]
},
),
]
)
The above example leads to the creation of two additional columns
FDL Text Statistics (question) char_count
(int) : character count of string in
question
columnFDL Text Statistics (question) dale_chall_readability_score
(float) : readability score of string in
question
column