Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for others languages #94

Open
lmaczulajtys opened this issue Jun 24, 2019 · 11 comments
Open

Support for others languages #94

lmaczulajtys opened this issue Jun 24, 2019 · 11 comments

Comments

@lmaczulajtys
Copy link
Contributor

FOG index is also applicable for Polish language. The main difference is that in Polish, difficult words are usually 4-syllable and longer.

I suggest to add lang parameter to gunning_fog function. It will be passed to syllable_count and also used to select size od syllable_threshold.

Source:
https://pl.wikipedia.org/wiki/Indeks_czytelno%C5%9Bci_FOG (pl)
https://translate.google.com/translate?sl=pl&tl=en&u=https%3A%2F%2Fpl.wikipedia.org%2Fwiki%2FIndeks_czytelno%25C5%259Bci_FOG (en)
Unfortunatelly all sources about that are in Polish.

@alxwrd alxwrd changed the title Gunning FOG support for Polish language Support for others languages Jun 25, 2019
@alxwrd
Copy link
Member

alxwrd commented Jun 25, 2019

I've done a bit of research and it appears there are language variants to the formulas for a few languages.

Because of that, I think the:

syllable_threshold = 4 if lang == 'pl_PL' else 3

might not be a long term solution.

I will have a think about how textstat could handle other languages going forward. Something like:

import textstat
textstat.lang = "pl_PL"

All the current 'hardcoded' values for the formulas would need to be extracted and kept in a dict that could have new languages with their values added at a later stage.

langs = {
    "en_US": {
        "syllable_threshold": 3,
        etc...
    },
    "pl_PL": {
        "syllable_threshold": 3,
        etc...
    },
}

@alxwrd
Copy link
Member

alxwrd commented Jun 25, 2019

Based on #93, current language would also need to be passed to Pyphen.

@lmaczulajtys
Copy link
Contributor Author

lmaczulajtys commented Jun 26, 2019

Because methods results are cached by repoze.lru, I think we should do something like this:

import textstat
textstat.set_lang("en_US")

We should clear caches in set_lang().

@lmaczulajtys
Copy link
Contributor Author

Nice source of knowledge for flesh_reading_ease Yoast/YoastSEO.js#267

@GuillemGSubies
Copy link
Contributor

Any updates in #97 ? I would really appreciate if it got merged.

@GuillemGSubies
Copy link
Contributor

I'm interested in adding this list of frequencies (easy words) for Spanish language (it comes from the Spanish Language Academy). However I don't know how many of them I should add. For what I have seen, the English easy words you use here is 3k words more or less.

Any thoughts?

@alxwrd
Copy link
Member

alxwrd commented Jan 4, 2020

hi @GuillemGSubies, sorry I forgot to respond to this!

I'm happy for Spanish words to be added for Spanish language support. I'm not sure how many should be included though as I'm not sure of the original source of the English word list used in textstat. @shivam5992, I'm not sure if you remember?

I'm not sure if any of the papers that introduce the formulas that use "easy" or "difficult" words reference the source of easy words.

@GuillemGSubies
Copy link
Contributor

@alxwrd I created a PR to discuss my implementation #120. Should I add the source of the easy_words file? If so, how?

@alxwrd
Copy link
Member

alxwrd commented Jan 8, 2020

@GuillemGSubies I think if you just add the source here, for now, that would be good. I'm thinking over how to manage multiple languages going forward, including testing.

@GuillemGSubies
Copy link
Contributor

http://corpus.rae.es/lfrecuencias.html It is the Spanish language academy

@alxwrd
Copy link
Member

alxwrd commented Aug 20, 2021

Just to tie this in, with #167 Announcement: Textstat organisation other language support should get a bit better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants