Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding inflections #14

Open
Vuizur opened this issue Aug 30, 2022 · 2 comments
Open

Adding inflections #14

Vuizur opened this issue Aug 30, 2022 · 2 comments

Comments

@Vuizur
Copy link

Vuizur commented Aug 30, 2022

Hello,

thank you very much for developing this cool project! I have been working on something similar, only not based on DBnary, but instead on the Wiktextract project. Compared to your project I only have [Language]-English dictionaries, but I got the idea that you could improve your dictionaries with very little code by adding the inflection data from kaikki.org. In my project I perform some very WIP post processing, so you could also in theory take my inflection data from the published TSVs (in some cases like Spanish they are a clear improvement, in others likely still a bit buggy).

Have a great day!

@karlb
Copy link
Owner

karlb commented Aug 30, 2022

That's interesting! I haven't noticed Wiktextract yet. I wonder what the Wiktextract and DBnary guys think of each other's work, since it overlaps at lot.

WikDict does have inflection data, but only for those languages where DBnary extracts it (English, German, French and Swedish, IIRC). Obviously, I would prefer to get all data from a single source rather than merging different source, which usually cause problems when joining and other inconsistencies.

I won't do anything with this right now, but I will keep an eye on Wiktextract/kaikki.org, as well as your project.

@Vuizur
Copy link
Author

Vuizur commented Aug 31, 2022

The Wiktextract author wrote a paper where he details the differences. I am also no expert, but I think the difference is that Wiktextract only processes the English Wiktionary, but in turn extracts more detailed information. I think its secret is that it expands the Lua code in the Wiktionary XML dump using the original Wiktionary template code (so that he gets the original inflections tables). He suspected that Dbnary only reimplemented some Lua code, leading to some bugs even in the English inflections.

(I don't know how hard it would be to integrate Wiktextract stuff into Dbnary, pretty interesting question.)

I would add the data maybe at the last step when creating the dictionary and not bother inserting them into the RDF database. The easiest way might be to put the file https://kaikki.org/dictionary/rawdata.html into something like a SQLITE database with an index on "word" if 13 GB is too large for loading it in RAM. And then simply get the inflections on demand when generating the dictionary.

I think the only bug with this approach that if inflections apply only to one part of speech, you might add them unnecessarily to all words with the same string. Another problem is removing stuff like pronouns from the inflections, sometimes you might have strings like "erholte sich" in the inflection for "erholen", for example, where you have to remove "sich" to make it findable by ebook reader lookups. This is something I am currently looking into as well.

Maybe in the future I will also feek motivated to start a pull request 😁.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants