Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for detecting special symbols like bullet points, check boxes, etc. #1570

Open
parthpatel002 opened this issue Apr 29, 2024 · 1 comment
Labels

Comments

@parthpatel002
Copy link

馃殌 The feature

Currently, all text detection models in docTR do not seem to identify special characters like bullet points, check boxes, etc. (likely because the training data is so) - attaching sample outputs (clipped screenshots of document pages). We should be able to detect these symbols to increase detection coverage to all text present on a page.
Bullet points:
image
Checkboxes:
image

Motivation, pitch

Symbols like bullet points, checkboxes, etc. form an integral part of the text content of many types of documents in general and OCR should be able to detect as well as recognize these symbols to increase coverage to all text present on the page.

Alternatives

No response

Additional context

No response

@parthpatel002 parthpatel002 added the type: enhancement Improvement label Apr 29, 2024
@felixdittrich92
Copy link
Contributor

Hi @parthpatel002 馃憢,

with bullet points i agree (but as you already mentioned there are no such samples in the pretraining dataset).
About the check boxes i think this would be more a topic for document layout parsing :)

But we have on the roadmap to pretrain on https://huggingface.co/datasets/pixparse/pdfa-eng-wds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants