Training Tesseract OCR for a specific document #360

mumarsyal · 2023-11-07T11:17:45Z

I have recently started learning and experimenting with Tesseract OCR. I have done a training for a new font using the tesstrain.

Now my use case is that I want to train Tesseract 5 for a specific document attached below.

I have found some articles and tutorials about training for new font or new language but I couldn't find something about training for a custom document.

Is it possible to train Tesseract 5 for my document? If yes, please give me some guidelines on how to proceed with this and if I need any other tools other than Tesseract itself to prepare training data.

I have Tesseract 5 installed on Ubuntu 22.04.

stefan6419846 · 2023-11-07T11:21:10Z

Could you please elaborate on what you are trying to achieve by training a specific document (type)? What do you expect to change compared to using the existing models?

mumarsyal · 2023-11-07T11:28:04Z

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

stefan6419846 · 2023-11-08T07:36:53Z

I have not tried it, but I would argue that better preprocessing on your side (feeding Tesseract with specific ROIs with appropriate preprocessing per ROI instead of the whole page, ...) might be easier and sufficient.

linxyu1 · 2023-11-15T02:44:44Z

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

hello,maybe you can use jtessboxeditor.but it is heavy workload.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Tesseract OCR for a specific document #360

Training Tesseract OCR for a specific document #360

mumarsyal commented Nov 7, 2023

stefan6419846 commented Nov 7, 2023

mumarsyal commented Nov 7, 2023

stefan6419846 commented Nov 8, 2023

linxyu1 commented Nov 15, 2023

Training Tesseract OCR for a specific document #360

Training Tesseract OCR for a specific document #360

Comments

mumarsyal commented Nov 7, 2023

stefan6419846 commented Nov 7, 2023

mumarsyal commented Nov 7, 2023

stefan6419846 commented Nov 8, 2023

linxyu1 commented Nov 15, 2023