Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inflections and the Dale-Chall-Formula #150

Open
LKirst opened this issue Aug 11, 2021 · 5 comments
Open

Inflections and the Dale-Chall-Formula #150

LKirst opened this issue Aug 11, 2021 · 5 comments

Comments

@LKirst
Copy link
Contributor

LKirst commented Aug 11, 2021

The textstat implementation of the Dale-Chall-Formula classifies several words as difficult words that the original Dale-Chall-Formula would not. For example, Scotland, returned, giants, giant's, strongest are returned as part of textstat.difficult_words_list(text), even though the base forms return, giant, strong are all part of the easy words list.

Dale and Chall (1948, p. 38-49) suggest that the following word forms should be considered familiar:

  • names of persons and places
  • regular plurals and possessives of words on the list
  • the third-person, singular forms (s or ies from y), present-participle forms (ing), past-participle forms (n), and past-tense forms (ed or ied from y), when these are added to verbs appearing on the list
  • comparatives and superlatives of adjectives appearing on the list
  • adverbs familiar which are formed by adding ly to a word on the list

The complete list of rules can be found in Dale & Chall (1948).

I understand that most of these rules are not easy to implement for the textstat package, but to avoid confusion and maybe prompt users to check the list returned by textstat.difficult_words_list(text), the README could point out the deviation from the original Dale & Chall formula?

Source: Dale, E., & Chall, J. (1948). A Formula for Predicting Readability: Instructions. Educational Research Bulletin, 27(2), 37-54. Retrieved August 11, 2021, from http://www.jstor.org/stable/1473669

@alxwrd
Copy link
Member

alxwrd commented Aug 11, 2021

Hi @LKirst, thank you for raising this!

We currently have an open issue (#73) touching on difficult word usage. We currently have 4 methods/metrics that use difficult_words:

  • dale_chall_readability_score
  • gunning_fog
  • spache_readability
  • dale_chall_readability_score_v2

Maybe this area could do with a re-visit, and it doesn't make sense to use the same difficult_words method for everything.

@dogweather
Copy link

dogweather commented Jul 19, 2022

I believe this is a problem that stemming solves. E.g.:

  1. The Dale and Chall wordlist is converted to a set of the stems of the words.
  2. An input text's words are each mapped to their stem.
  3. Each word is then judged to be simple if its stem is in the Dale and Chall stem list. (As opposed to the word itself being present in the Dale and Chall word list.

@LKirst
Copy link
Contributor Author

LKirst commented Jul 19, 2022

Great idea. Could we separate regular inflection from irregular word formation using an NLTK stemmer?
Could you implement your solution?

@dogweather
Copy link

Great idea. Could we separate regular inflection from irregular word formation using an NLTK stemmer?
Could you implement your solution?

Totally — I'll start a PR. I'll look into what NLTK supports. I can imagine providing options for the kinds of inflections accepted.

@dogweather
Copy link

dogweather commented Jul 19, 2022

I found a good conversation of a similar idea implemented in Javascript:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants