-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with table recognition #134
Comments
That's because there are no good table processors in OCR-D yet. But you'd also have to include the existing ones in your workflow in the first place! Here's my take on this example:
|
This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help. |
@jbarth-ubhd, by Back to the issue: the core problem is still making Tesseract (currently the only table detector in OCR-D) actually detect a table region for that page. As explained above, this only works if input is not binarized (normalized or not). Now, with your dewarped JPEG, I cannot get a table at all anymore. Probably because of the corners clipped to white. But if apply In summary, we have to
|
No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable) |
That sounds interesting. I had that use-case, too. See my report on probing various unperspective and dewarp tools for suitability in OCR-D. Back then you said you were using mzucker's tool. Is that still the case, or did you write your own? |
this one: https://github.com/jbarth-ubhd/blitzDrt |
Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too? |
Done: MIT.
Am 01.02.21 um 17:06 schrieb Stefan Weil:
…
Jochen, great that you published that oldy now on GitHub. Do you want
to add a license file, too?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#134 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHJ32U4MFDMPDQC5PS44G5DS43GQRANCNFSM4PNPZIAQ>.
|
With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
See the following image as an example:
The result is as follows:
OCR-D-TXT_catalog46muse_0564.txt
This is the used workfow:
The text was updated successfully, but these errors were encountered: