-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Tesseract OCR for a specific document #360
Comments
Could you please elaborate on what you are trying to achieve by training a specific document (type)? What do you expect to change compared to using the existing models? |
Thank you for your response @stefan6419846 . I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful. |
I have not tried it, but I would argue that better preprocessing on your side (feeding Tesseract with specific ROIs with appropriate preprocessing per ROI instead of the whole page, ...) might be easier and sufficient. |
hello,maybe you can use jtessboxeditor.but it is heavy workload. |
I have recently started learning and experimenting with Tesseract OCR. I have done a training for a new font using the tesstrain.
Now my use case is that I want to train Tesseract 5 for a specific document attached below.
I have found some articles and tutorials about training for new font or new language but I couldn't find something about training for a custom document.
Is it possible to train Tesseract 5 for my document? If yes, please give me some guidelines on how to proceed with this and if I need any other tools other than Tesseract itself to prepare training data.
I have Tesseract 5 installed on Ubuntu 22.04.
The text was updated successfully, but these errors were encountered: