Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we add new OCR models, and what do we use to train them? #1217

Open
kyrieb-ekat opened this issue Oct 11, 2024 · 0 comments
Open

How do we add new OCR models, and what do we use to train them? #1217

kyrieb-ekat opened this issue Oct 11, 2024 · 0 comments

Comments

@kyrieb-ekat
Copy link

This is sort of a two pronged but interrelated issue, stemming from some of the issues discussed in #1215. In short OCR does badly on the exceptional pages in MS73, and we might be able to address this with a new OCR model for MS73. In the text alignment job there is an option to select which model you would like to use; the current OCR model we are using is beneath the accepted range of correction (i.e. it's coming up with such bizarre things it's faster to transcribe by hand at this point).

Looking into this it appears that, once we migrated from the separate text alignment repo to update to python 3, and then added the text alignment everything to rodan main, we also changed what we used for OCR. Previously we used OCRopus, now it looks like we're using some combination of Calamari and OCRopus. OCRopus is no longer supported, and the version of Calamari we are using has some items which are no longer up to date and not supported. Fortunately Calamari did launch a new version and update some things so I think we can still use it, we just need to use the most up to date version.

Which brings us back to the "MS73 OCR is terrible" issue: once we train the new model, how to we add it as an option for users to toggle to in the text alignment job? It looks like we might just be able to add it to the models folder, but I'm not sure how to reflect this as an option in the actual job. Paging @homework36! (This is the trajectory of the problem and a recap of the issue we discussed; I've found all the relevant packages etc we can use, we just need to find a way to get them on staging). Also looping in @JoyfulGen , for any institutional knowledge you might bring re: the salzinnes training, if you remember any of those discussion.

In short:

  1. I think we need a new/better MS73 specific OCR model
  2. Once we've trained this model, how do we put it up?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants