Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logic bug in difficult_words_list() #192

Open
dogweather opened this issue Jul 20, 2022 · 0 comments
Open

Logic bug in difficult_words_list() #192

dogweather opened this issue Jul 20, 2022 · 0 comments

Comments

@dogweather
Copy link

dogweather commented Jul 20, 2022

I noticed this bug when I saw that scores change if I simply duplicate an input text to make it twice as long.

Each difficult word is only counted once, no matter how many times it occurs in the text. This is wrong, because the algorithms need to compute, e.g., difficult word count / total word count. This bug is causing many of the scores to be off. This tests for the problem:

def test_difficult_words_counts_duplicates():
    textstat.set_lang("en_US")
    twice_as_long = " ".join([long_test, long_test])
    result = textstat.difficult_words(twice_as_long)

    assert result == 2 * 55

The bug is here, where a set is used. Changing this to a tuple fixes it.

I wrote a PR: #193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant