Logic bug in difficult_words_list() #192

dogweather · 2022-07-20T05:27:16Z

I noticed this bug when I saw that scores change if I simply duplicate an input text to make it twice as long.

Each difficult word is only counted once, no matter how many times it occurs in the text. This is wrong, because the algorithms need to compute, e.g., difficult word count / total word count. This bug is causing many of the scores to be off. This tests for the problem:

def test_difficult_words_counts_duplicates():
    textstat.set_lang("en_US")
    twice_as_long = " ".join([long_test, long_test])
    result = textstat.difficult_words(twice_as_long)

    assert result == 2 * 55

The bug is here, where a set is used. Changing this to a tuple fixes it.

I wrote a PR: #193

This was referenced Jul 20, 2022

Bug fix: difficult_words_list #193

Closed

Investigate if all uses of difficult_words are correct #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logic bug in difficult_words_list() #192

Logic bug in difficult_words_list() #192

dogweather commented Jul 20, 2022 •

edited

Logic bug in difficult_words_list() #192

Logic bug in difficult_words_list() #192

Comments

dogweather commented Jul 20, 2022 • edited

dogweather commented Jul 20, 2022 •

edited