-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocrd-tesserocr-crop: 22.5h processing time #206
Comments
Thanks @jbarth-ubhd for the detailed report! Well, this is an extreme case to begin with: a huge image (65 MP), images with lots of fine strokes. Tesseract itself has no cropping, we only emulate that (as the processor says) by trying to find text regions. And Tesseract is quite prone to hallucinating text in such line drawings (since it was written for contemporary documents). Since it also likes to draw a full-sized image region all over the canvas as soon as there is a visible page frame, one needs to use sparse text mode, which is usually faster, but extremely slow in this case. The OCR-D wrapper cannot do much about that, I'm afraid (notice that in your log the huge time delay happens between calling the Tesseract API and starting to process its results). Here is how the sparse text layout analysis result looks like on the raw image: In my case, this took nearly 40h to compute. Clearly, most of these text regions are false positives. Perhaps what one can do is downsample the image to a reasonable resolution (say 200 DPI). But then all follow-up calculations (coordinates, derived images) have to compensate for that. (I have done this in ocrd_detectron2 once before.) |
BTW, running on the binarized image (as in your workflow), it takes even longer (77h), because the wolf binarization cannot cope with the black border (which it inverts), so even more FP are found: So, as a rule, when doing binarization, and you might still have black borders, do not use Downsampling by 4 ( So, don't use Also, don't run with too huge images. Downsample before importing, as we cannot expect processors to do that themselves for now. Perhaps we should open an issue in core for the general scenario of early downsampling (as a derived image) and then re-using that image instead of the original (with adapted coordinate system), which will in turn depend on PAGE being extended with AlternativeImage scale attributes, though. |
@kba what's your opinion on this? |
Processing the image in OCR-D-IMG in
https://digi.ub.uni-heidelberg.de/diglitData/v/valentini1714bd2_-_0000036v_aqv_Tabula_ROEM_0007.zip
took about 22.5h @ Core i7-4790 CPU 3.60GHz — workflow see below &
run-docker.sh
:The text was updated successfully, but these errors were encountered: