How do we add new OCR models, and what do we use to train them? #1217

kyrieb-ekat · 2024-10-11T16:18:41Z

This is sort of a two pronged but interrelated issue, stemming from some of the issues discussed in #1215. In short OCR does badly on the exceptional pages in MS73, and we might be able to address this with a new OCR model for MS73. In the text alignment job there is an option to select which model you would like to use; the current OCR model we are using is beneath the accepted range of correction (i.e. it's coming up with such bizarre things it's faster to transcribe by hand at this point).

Looking into this it appears that, once we migrated from the separate text alignment repo to update to python 3, and then added the text alignment everything to rodan main, we also changed what we used for OCR. Previously we used OCRopus, now it looks like we're using some combination of Calamari and OCRopus. OCRopus is no longer supported, and the version of Calamari we are using has some items which are no longer up to date and not supported. Fortunately Calamari did launch a new version and update some things so I think we can still use it, we just need to use the most up to date version.

Which brings us back to the "MS73 OCR is terrible" issue: once we train the new model, how to we add it as an option for users to toggle to in the text alignment job? It looks like we might just be able to add it to the models folder, but I'm not sure how to reflect this as an option in the actual job. Paging @homework36! (This is the trajectory of the problem and a recap of the issue we discussed; I've found all the relevant packages etc we can use, we just need to find a way to get them on staging). Also looping in @JoyfulGen , for any institutional knowledge you might bring re: the salzinnes training, if you remember any of those discussion.

In short:

I think we need a new/better MS73 specific OCR model
Once we've trained this model, how do we put it up?

The text was updated successfully, but these errors were encountered:

kyrieb-ekat added Type: MAINTENANCE priority: Medium help-wanted Help :^ labels Oct 11, 2024

kyrieb-ekat assigned homework36 and kyrieb-ekat Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we add new OCR models, and what do we use to train them? #1217

How do we add new OCR models, and what do we use to train them? #1217

kyrieb-ekat commented Oct 11, 2024

How do we add new OCR models, and what do we use to train them? #1217

How do we add new OCR models, and what do we use to train them? #1217

Comments

kyrieb-ekat commented Oct 11, 2024