You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, all text detection models in docTR do not seem to identify special characters like bullet points, check boxes, etc. (likely because the training data is so) - attaching sample outputs (clipped screenshots of document pages). We should be able to detect these symbols to increase detection coverage to all text present on a page.
Bullet points:
Checkboxes:
Motivation, pitch
Symbols like bullet points, checkboxes, etc. form an integral part of the text content of many types of documents in general and OCR should be able to detect as well as recognize these symbols to increase coverage to all text present on the page.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
with bullet points i agree (but as you already mentioned there are no such samples in the pretraining dataset).
About the check boxes i think this would be more a topic for document layout parsing :)
🚀 The feature
Currently, all text detection models in docTR do not seem to identify special characters like bullet points, check boxes, etc. (likely because the training data is so) - attaching sample outputs (clipped screenshots of document pages). We should be able to detect these symbols to increase detection coverage to all text present on a page.
Bullet points:
Checkboxes:
Motivation, pitch
Symbols like bullet points, checkboxes, etc. form an integral part of the text content of many types of documents in general and OCR should be able to detect as well as recognize these symbols to increase coverage to all text present on the page.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: