-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for others languages #94
Comments
I've done a bit of research and it appears there are language variants to the formulas for a few languages. Because of that, I think the: syllable_threshold = 4 if lang == 'pl_PL' else 3 might not be a long term solution. I will have a think about how textstat could handle other languages going forward. Something like: import textstat
textstat.lang = "pl_PL" All the current 'hardcoded' values for the formulas would need to be extracted and kept in a dict that could have new languages with their values added at a later stage. langs = {
"en_US": {
"syllable_threshold": 3,
etc...
},
"pl_PL": {
"syllable_threshold": 3,
etc...
},
} |
Based on #93, current language would also need to be passed to Pyphen. |
Because methods results are cached by repoze.lru, I think we should do something like this: import textstat
textstat.set_lang("en_US") We should clear caches in |
Nice source of knowledge for |
Any updates in #97 ? I would really appreciate if it got merged. |
I'm interested in adding this list of frequencies (easy words) for Spanish language (it comes from the Spanish Language Academy). However I don't know how many of them I should add. For what I have seen, the English easy words you use here is 3k words more or less. Any thoughts? |
hi @GuillemGSubies, sorry I forgot to respond to this! I'm happy for Spanish words to be added for Spanish language support. I'm not sure how many should be included though as I'm not sure of the original source of the English word list used in textstat. @shivam5992, I'm not sure if you remember? I'm not sure if any of the papers that introduce the formulas that use "easy" or "difficult" words reference the source of easy words. |
@GuillemGSubies I think if you just add the source here, for now, that would be good. I'm thinking over how to manage multiple languages going forward, including testing. |
http://corpus.rae.es/lfrecuencias.html It is the Spanish language academy |
Just to tie this in, with #167 Announcement: Textstat organisation other language support should get a bit better. |
FOG index is also applicable for Polish language. The main difference is that in Polish, difficult words are usually 4-syllable and longer.
I suggest to add
lang
parameter togunning_fog
function. It will be passed tosyllable_count
and also used to select size odsyllable_threshold
.Source:
https://pl.wikipedia.org/wiki/Indeks_czytelno%C5%9Bci_FOG (pl)
https://translate.google.com/translate?sl=pl&tl=en&u=https%3A%2F%2Fpl.wikipedia.org%2Fwiki%2FIndeks_czytelno%25C5%259Bci_FOG (en)
Unfortunatelly all sources about that are in Polish.
The text was updated successfully, but these errors were encountered: