Support for detecting special symbols like bullet points, check boxes, etc. #1570

parthpatel002 · 2024-04-29T12:47:12Z

🚀 The feature

Currently, all text detection models in docTR do not seem to identify special characters like bullet points, check boxes, etc. (likely because the training data is so) - attaching sample outputs (clipped screenshots of document pages). We should be able to detect these symbols to increase detection coverage to all text present on a page.
Bullet points:

Checkboxes:

Motivation, pitch

Symbols like bullet points, checkboxes, etc. form an integral part of the text content of many types of documents in general and OCR should be able to detect as well as recognize these symbols to increase coverage to all text present on the page.

Alternatives

No response

Additional context

No response

felixdittrich92 · 2024-05-03T13:57:35Z

Hi @parthpatel002 👋,

with bullet points i agree (but as you already mentioned there are no such samples in the pretraining dataset).
About the check boxes i think this would be more a topic for document layout parsing :)

But we have on the roadmap to pretrain on https://huggingface.co/datasets/pixparse/pdfa-eng-wds

parthpatel002 added the type: enhancement Improvement label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for detecting special symbols like bullet points, check boxes, etc. #1570

Support for detecting special symbols like bullet points, check boxes, etc. #1570

parthpatel002 commented Apr 29, 2024

felixdittrich92 commented May 3, 2024

Support for detecting special symbols like bullet points, check boxes, etc. #1570

Support for detecting special symbols like bullet points, check boxes, etc. #1570

Comments

parthpatel002 commented Apr 29, 2024

🚀 The feature

Motivation, pitch

Alternatives

Additional context

felixdittrich92 commented May 3, 2024