Poor performance of fine-Tuned OCR recognition model #1782

stevemanavalan · 2024-11-18T11:13:12Z

stevemanavalan
Nov 18, 2024

I am trying to fine-tune the OCR recognition model (crnn_vgg16_bn) and i have already looked at discussions from #1677 and #1366. I have tried the following:

create synthetic training dataset (~1M samples) using Synthtiger using a collection of real words and synthetically generated text.
Trained and experimented with multiple checkpoints ( epochs : 20, 50 till 200)
In order to reduce bias towards punctuation marks, i also tried creating a dataset without augmentation and samples with few punctuations in the synthtiger corpus.

However the model performs poorly when i set scale as 1 in DocumentFile.from_pdf(bytes_pdf, scale=1) during inference, when compared with https://huggingface.co/tilman-rassy/doctr-crnn-vgg16-bn-fascan-v1 trained on ~100K samples. However when the scale is set to 2 during inference, both the models performs in a comparable manner. I have also tried reducing the quality of images and also adding noise to the dataset generated using synthtiger with no visible improvements to the recognition model.
Can someone please assist me in improving my recognition model performance?

Training args:
python references/recognition/train_pytorch.py crnn_vgg16_bn --train_path "train_set" --val_path "val_set" --epochs 100 --vocab german --name doctr_crnn_vgg16_bn --pretrained --b 400 --wb --font "1942.ttf,FreeSans.ttf,LiberationMono-BoldItalic.ttf,LiberationMono-Italic.ttf,rm_typerighter.ttf,FreeMono.ttf,FreeSerif.ttf,LiberationMono-Bold.ttf,LiberationMono-Regular.ttf"

felixdittrich92 · 2024-11-21T12:29:38Z

felixdittrich92
Nov 21, 2024
Maintainer

Hey @stevemanavalan 👋,

You used the main SynthTiger repo to generate the dataset ?
It's nothing official yet, was only for my own experiments but i modified SynthTiger a bit to generate a multilingual dataset.

Maybe give it a try:
https://github.com/felixdittrich92/synthtiger/tree/doctr-modified

branch: doctr-modified

Generated dataset can be downloaded here

And the model i fine tuned with it: https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1

Best,
Felix

PS: the font arg has no effect if you provide a train and val path :) font is only required if you use the integrated WordGenerator

3 replies

stevemanavalan Nov 27, 2024
Author

Hi @felixdittrich92,

Thanks a lot for the dataset provided.

I tried using both the original synthtiger and also your modified fork with varying training data sizes for various epoch ranges.

Finally I trained a crnn_vgg16_bn model with the dataset that you provided with ~1M samples for 50 epochs. When using fast_base as the detection model, the final output seems to contain additional characters and punctuations which are not present in the original document. I tried setting the model.det_predictor.model.postprocessor.bin_thresh parameter and although it makes minor differences , the issue with punctuations still seems to persist.

To reduce the bias towards punctuations i tried :

creating a new synthetically generated dataset with words from dictionary and number combinations just like Problems with training recognition model for vocab "German" #1366
check the character_count from labels.json and manually removed some punctuations and balanced the frequency of characters
create a new synthetic dataset:
a. using labels.json generated in step 1 as corpus,
b. without augmentation
c. without character addition step in https://github.com/felixdittrich92/synthtiger/blob/doctr-modified/examples/synthtiger/template.py (lines 237-247)
Train the model with this balanced synthetically generated set.

Although the model improves over https://huggingface.co/tilman-rassy/doctr-crnn-vgg16-bn-fascan-v1 for normal words, it still add punctuations and characters not present in original document.

Thanks,
Steve

Is there anything else that i could do during synthetic data generation to improve the model?

felixT2K Nov 27, 2024

Hey @stevemanavalan 👋🏼,

Yeah you are not the first one reporting this .. I think we should create an external repo to work on this (synth reco data gen) unfortunately i don't have the time actually 😅

Have you tried to reduce the samples containing punctuations in the training set to a minimum ? Does it change anything ?

Best would be to train on synth data end eval on real samples (also if it's not as much).

I think for fine tuning 50 epochs are to much the model will forget what it learned from the pretraining (if we don't freeze) - so 1-3 epochs should be enough - also if the loss / metrics looks worse - that's the reason why real validation data would be better

Best,
Felix

stevemanavalan Nov 28, 2024
Author

Hi @felixT2K,

Thanks a lot for your quick response.

I tried the approach with ~800k train samples provided by @felixdittrich92 and ~20k real samples for validation by using https://github.com/doc-analysis/XFUND/releases/tag/v1.0 german dataset. However i did not observe any improvements in OCR, other than metrics improving as below:

Epoch 2/5 - Validation loss: 0.212828 (Exact: 89.12% | Partial: 89.92%)
Training loss: 0.507874:
Validation loss decreased 0.212828 --> 0.210114: saving state...

I tried with a new training set with minimal number of punctuations and the character count is
char_count_labels.json. The dataset consists of ~100k samples for train and ~20k for validation and the model was trained for 100 epochs. The additional puncutations has been replaced mostly by character "i" and sometimes by other random characters. The metrics are as follows

Epoch 98/100 - Validation loss: 0.173564 (Exact: 64.72% | Partial: 69.53%)
Training loss: 0.301038
Validation loss decreased 0.173482 --> 0.173318: saving state...

@felixdittrich92 The synthtiger fork modified for doctr removes "transform", "style", "shape" and "noise" properties for the synthetically generated samples. Is this supposed to improve OCR performance ?

In general can you please describe the expected best practices in fine tuning the model:

ideal train and validation set sample size
number of epochs for fine-tuning base on the sample size
whether freeze-backbone makes sense for fine-tuning

Thanks a lot for your time.
Steve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of fine-Tuned OCR recognition model #1782

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Poor performance of fine-Tuned OCR recognition model #1782

stevemanavalan Nov 18, 2024

Replies: 1 comment · 3 replies

felixdittrich92 Nov 21, 2024 Maintainer

stevemanavalan Nov 27, 2024 Author

felixT2K Nov 27, 2024

stevemanavalan Nov 28, 2024 Author

stevemanavalan
Nov 18, 2024

Replies: 1 comment 3 replies

felixdittrich92
Nov 21, 2024
Maintainer

stevemanavalan Nov 27, 2024
Author

stevemanavalan Nov 28, 2024
Author