Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add support for 16-bit quantized LSTM models #4331

Open
lackner-codefy opened this issue Oct 22, 2024 · 6 comments
Open

Feature Request: Add support for 16-bit quantized LSTM models #4331

lackner-codefy opened this issue Oct 22, 2024 · 6 comments

Comments

@lackner-codefy
Copy link

Your Feature Request

For LSTM, there are currently fast 8-bit integer models, as well as the best models, probably using 32-bit floating point values.

While the fast models are indeed fast, they make a lot of errors in my specific use-case (with tesseract 5.3.0 and 5.4.1, mostly German language). I tested with the best models and they don't have this problem. However, they are also much slower, increasing the processing time considerably.

I'd like to have a better compromise between performance and accurracy. Something like a 16-bit integer model, which would (hopefully) still be pretty fast, but doesn't suffer from these random quality issues.

Would it be possible to implement support for 16-bit integer models? I'm aware that its not a trivial task since int_mode() is checked all over the place, and its also not trivial to write arch specific code to handle vector / matrix operations efficiently.

If its not within the scope of this project, what other tricks could be used to speed up the "best" model?

@amitdo
Copy link
Collaborator

amitdo commented Oct 22, 2024

While the fast models are indeed fast, they make a lot of errors in my specific use-case

The 'fast' models are not based on the 'best' models. They were trained with a smaller network and converted to int8.

There is an option to convert a 'best' model to an int8 model. This will give you a better accuracy compared to the 'fast' model.

@amitdo
Copy link
Collaborator

amitdo commented Oct 22, 2024

https://github.com/tesseract-ocr/tesseract/blob/main/doc/lstmtraining.1.asc

@stweil
Copy link
Member

stweil commented Oct 22, 2024

@lackner-codefy, did you also test with models from tessdata? Do they produce similar results as the "best" models?

And can you say more about your specific use case? For some use cases (especially for historic German prints) my models might be better than the official models: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/.

@stweil
Copy link
Member

stweil commented Oct 22, 2024

probably using 32-bit floating point values

Tesseract 4 used double precision (64 bit) values. The current "best" models therefore still provide 64 bit values which are converted to float (32 bit) by Tesseract 5 (unless it was built to use double).

@amitdo
Copy link
Collaborator

amitdo commented Oct 22, 2024

About the tessdata repo stweil mentioned. The models there are a comination of two models: A model for the legacy pcr engine and a lstm model based on the 'best' model that was converted to int8.

With that model you can use the command line option --oem 1 which will tell tesseract to only use the lstm model.

@lackner-codefy
Copy link
Author

@amitdo @stweil Thanks for all of your suggestions. Really appreciated! 🙏

  • I'll do some experiments to see if converting a best model to int8 gives some improvements.
  • I remember we were using tessdata before and that it performed worse. But for good measure, I'll verify that with another experiment.
  • Most of the documents in the collection are scans of printed documents. However, the quality of the scans can sometimes be quite poor. Some documents have a lot of JPEG artefacts and/or contain handwritten notes.
  • I'll check https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/. Since there seem to be multiple models, is there any in particular you had in mind? I'll make sure to use --oem ... to select the correct model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants