-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokeniser-gramcheck-gt-desc.pmhfst is 211M #52
Comments
@flammie did look into memory consumption for pmhfst files a while ago. Maybe he has some ideas. |
Mm, there are some things that legit multiplied the automaton size, eg. upcase in 5e0bdaf. There aren't too many other commits in the history, but many are filling up alphabet and alphabet size can easily be a multiplier in tokeniser size, I was hoping list arcs fix it a bit but it wasn't too effective. I think there might be a way to automate this with git bisect especially if keeping the analyser_relabelled-blah size constant might reveal something more... |
Is this issue something we want to keep open? @flammie 's use of list arcs didn't help much, and my understanding is that the only thing left to do is a rewrite of parts of the To me this indicates that although the Hfst implementation is true to the original in linguistic features, it is not when it comes to implementation stuff that impacts memory consumption. And I believe this is a rather big omission on the Hfst part. At the same time it is a major effort to rewrite the code, so I suggest that we for the time being just accepts the situation as it is, and close this issue. Any thoughts? |
Well, it would be interesting to try to bisect and find out what commits were responsible for the jumps in size – are they all necessary, or could there be some low-hanging fruit? OTOH if there aren't currently plans to run it locally on phones or combine with other fst's then it's probably not a problem in practice, just an annoyance, so closing makes sense. |
Where did we go wrong?
The text was updated successfully, but these errors were encountered: