Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd-tesserocr-crop: 22.5h processing time #206

Open
jbarth-ubhd opened this issue Apr 6, 2024 · 3 comments
Open

ocrd-tesserocr-crop: 22.5h processing time #206

jbarth-ubhd opened this issue Apr 6, 2024 · 3 comments

Comments

@jbarth-ubhd
Copy link

jbarth-ubhd commented Apr 6, 2024

Processing the image in OCR-D-IMG in
https://digi.ub.uni-heidelberg.de/diglitData/v/valentini1714bd2_-_0000036v_aqv_Tabula_ROEM_0007.zip
took about 22.5h @ Core i7-4790 CPU 3.60GHz — workflow see below & run-docker.sh:

docker-ocrd ocrd workspace init
docker-ocrd ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif

docker-ocrd ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-IMG -O OCR-D-001
docker-ocrd ocrd-tesserocr-crop -I OCR-D-001 -O OCR-D-002
docker-ocrd ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-002 -O OCR-D-003
docker-ocrd ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004
docker-ocrd ocrd-tesserocr-recognize -P find_tables true -P segmentation_level region -P textequiv_level word -P model frak2021 -I OCR-D-004 -O OCR-D-OCR
jb@xxx:~/valentini1714bd2/0000036v_aqv_Tabula_ROEM_0007> ls -1rtd OCR-D* | awk '{printf "echo %s:\nls -l %s\n",$1,$1}'|bash |grep -v insg
OCR-D-IMG:
-rwxrwx--- 1 jb jb 196697460 Apr  5 12:28 00001.tif
OCR-D-001:
-rw-r--r-- 1 root jb 3972601 Apr  5 12:56 OCR-D-001_00001-BIN_wolf.png
-rw-r--r-- 1 root jb    1113 Apr  5 12:56 OCR-D-001_00001.xml
OCR-D-002:
-rw-r--r-- 1 root jb 3876965 Apr  6 11:25 OCR-D-002_00001.IMG-CROP.png
-rw-r--r-- 1 root jb    2009 Apr  6 11:25 OCR-D-002_00001.xml
OCR-D-003:
-rw-r--r-- 1 root jb    2309 Apr  6 11:25 OCR-D-003_00001.xml
-rw-r--r-- 1 root jb 3972601 Apr  6 11:25 OCR-D-IMG_00001-BIN_wolf.png
OCR-D-004:
-rw-r--r-- 1 root jb 3876965 Apr  6 11:27 OCR-D-004_00001.IMG-DESKEW.png
-rw-r--r-- 1 root jb    3260 Apr  6 11:27 OCR-D-004_00001.xml
OCR-D-OCR:
-rw-r--r-- 1 root jb 6028333 Apr  6 11:36 OCR-D-OCR_00001.IMG-BIN.png
-rw-r--r-- 1 root jb  929386 Apr  6 11:36 OCR-D-OCR_00001.xml

Preview

@bertsky
Copy link
Collaborator

bertsky commented Apr 10, 2024

Thanks @jbarth-ubhd for the detailed report!

Well, this is an extreme case to begin with: a huge image (65 MP), images with lots of fine strokes. Tesseract itself has no cropping, we only emulate that (as the processor says) by trying to find text regions. And Tesseract is quite prone to hallucinating text in such line drawings (since it was written for contemporary documents). Since it also likes to draw a full-sized image region all over the canvas as soon as there is a visible page frame, one needs to use sparse text mode, which is usually faster, but extremely slow in this case. The OCR-D wrapper cannot do much about that, I'm afraid (notice that in your log the huge time delay happens between calling the Tesseract API and starting to process its results).

Here is how the sparse text layout analysis result looks like on the raw image:
tesscli_sparsetext

In my case, this took nearly 40h to compute. Clearly, most of these text regions are false positives.

Perhaps what one can do is downsample the image to a reasonable resolution (say 200 DPI). But then all follow-up calculations (coordinates, derived images) have to compensate for that. (I have done this in ocrd_detectron2 once before.)

@bertsky
Copy link
Collaborator

bertsky commented Apr 16, 2024

BTW, running on the binarized image (as in your workflow), it takes even longer (77h), because the wolf binarization cannot cope with the black border (which it inverts), so even more FP are found:
tesscli_bin_sparsetext

So, as a rule, when doing binarization, and you might still have black borders, do not use wolf, but sauvola or sbb.

Downsampling by 4 (convert -scale 25%) to 195 DPI does help: processing time is cut to just a few seconds each, and results are equally non/usable:
tesscli25%_sparsetext

tesscli25%_bin_sparsetext

So, don't use ocrd-tesserocr-crop on material which has next to no text (but ocrd-anybaseocr-crop or eynollah instead).

Also, don't run with too huge images. Downsample before importing, as we cannot expect processors to do that themselves for now.

Perhaps we should open an issue in core for the general scenario of early downsampling (as a derived image) and then re-using that image instead of the original (with adapted coordinate system), which will in turn depend on PAGE being extended with AlternativeImage scale attributes, though.

@bertsky
Copy link
Collaborator

bertsky commented Apr 16, 2024

Perhaps we should open an issue in core for the general scenario of early downsampling (as a derived image) and then re-using that image instead of the original (with adapted coordinate system), which will in turn depend on PAGE being extended with AlternativeImage scale attributes, though.

@kba what's your opinion on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants