OCR on PDF Files - Some images are not Processed #2360

hpcnetworks · 2024-11-11T22:22:19Z

hpcnetworks
Nov 11, 2024

Hi there, good evening!

I have processed a PDF file with hundreds os pages, with a lot of scanned documents in it. OCR worked fine on most of these scanned documents (images). Unfortunetely, some images have not been processed by PCR module.

Analysing log files I found this:
2024-11-11 18:20:36 [WARN] [parsers.misc.PDFTextParser] Plugin JPEG2000 not found, JPX images will not be decoded from PDFs. You can download it from https://mvnrepository.com/artifact/com.github.jai-imageio/jai-imageio-jpeg2000/1.3.0 and put it in plugins folder. Warn: that plugin is worse to decode JPX outside of PDFs!
ModuleNotFoundError: No module named 'numpy'
ModuleNotFoundError: No module named 'numpy'

I downloaded version 1.4.0 of the plugin and copied it into plugin folder, but no success.

Can you help me undestand what is happening?

Thanks in advance,

Bruno Costa

wladimirleite · 2024-11-11T23:49:19Z

wladimirleite
Nov 11, 2024
Collaborator

Hi Bruno!
Can you share a sample file to reproduce the issue?

8 replies

lfcnassif Nov 12, 2024
Maintainer

Yes, it can be related, but without a sample file to investigate, it's not possible to be sure.

wladimirleite Nov 12, 2024
Collaborator

Bruno, not sure if I understood correctly, is it a single PDF file, with multiple pages and some images inside that file were OCR'ed but others weren't?

You may try to process other PDFs with multiple pages/images, without sensitive content, and check if you find the same behavior in sample files.

hpcnetworks Nov 12, 2024
Author

Yes, a PDF file with more than 12000 Pages. Lots of images in It. Most of the images have been normally OCR'ed, but I found only this imagem not OCR'ed so far...

I Will try other PDF files as an attempt to reproduce Thais behavior.

hpcnetworks Nov 12, 2024
Author

Hi Naciff!

You are right! The processImagesInPDFs parameter was set to false. Turning it to true the images were extracted and OCRed.

Can you let me know the difference between an embeded image and a redered image on a PDF file? This knowledge may let me decide whem activate this parameter or not, since it increases a lot the processing time.

Thanks again

Bruno

lfcnassif Nov 13, 2024
Maintainer

Actually I meant the opposite, to disable processImagesInPDFs if it was enabled...

For purely scanned documents, generally, keeping that option disabled is better, because, sometimes, each page can be composed by several concatenated images, and extracting each image individually may cut words or text lines entirely if they are on image borders. On the other hand, for textual PDFs with some embedded images, in the middle of text, turning that option on is needed if you want to OCR those small images.

For IPED, embedded and rendered images are the same, ie, those that show up in each page. But PDFs can also have attachments, and they can also be images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR on PDF Files - Some images are not Processed #2360

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

OCR on PDF Files - Some images are not Processed #2360

hpcnetworks Nov 11, 2024

Replies: 1 comment · 8 replies

wladimirleite Nov 11, 2024 Collaborator

lfcnassif Nov 12, 2024 Maintainer

wladimirleite Nov 12, 2024 Collaborator

hpcnetworks Nov 12, 2024 Author

hpcnetworks Nov 12, 2024 Author

lfcnassif Nov 13, 2024 Maintainer

hpcnetworks
Nov 11, 2024

Replies: 1 comment 8 replies

wladimirleite
Nov 11, 2024
Collaborator

lfcnassif Nov 12, 2024
Maintainer

wladimirleite Nov 12, 2024
Collaborator

hpcnetworks Nov 12, 2024
Author

hpcnetworks Nov 12, 2024
Author

lfcnassif Nov 13, 2024
Maintainer