python-OCR

In accounting, working with thousands of vendors is quite challenging when it comes to search invoices by invoice number between scanned documents.

Text invoices contain variety of information such as product names, VAT, product prices, vendor or customer names, tax information, the date of the transaction etc. The process of reading text from images is called Object Character Recognition since characters in images are essentially treated as objects.

In this repository, i have gone trough some ways de convert pdf to images using python. The, we can read text from these images. A little further content extraction is not provided here

#Prerequistes

Tesseract: https://github.com/tesseract-ocr/tesseract
ImageMagick: https://github.com/ImageMagick/ImageMagick
ghostscript: https://www.ghostscript.com/download/gsdnld.html

#Bibliographie

https://hypi.io/2019/10/29/reading-text-from-invoice-images-with-python/
invoice2data, a python library: https://medium.com/version-1/my-experience-extracting-invoice-data-using-invoice2data-in-python-1c6450fa001f
https://datascience.stackexchange.com/questions/33231/using-python-and-machine-learning-to-extract-information-from-an-invoice-inital
using pdf2image ans easyocr: https://www.youtube.com/watch?v=bcmEMcEzV9M
some insight of my solution: https://datascience.stackexchange.com/questions/33231/using-python-and-machine-learning-to-extract-information-from-an-invoice-inital
crreate rectangles in the pdf file using ocrmypdf: https://www.youtube.com/watch?app=desktop&v=glJi3LBgn9U
a similar project in C#: https://github.com/robela/OCR-Invoice
veryfi: an excellent API to get informations from an invoice or receipt: https://www.linkedin.com/pulse/extract-data-from-receipt-invoice-5-lines-ofcode-dmitry-birulia/?articleId=6676262454793764864
Others: https://hypi.io/2019/10/29/reading-text-from-invoice-images-with-python/;

#More ressources

projects: https://aihubprojects.com/handwriting-recognition-using-cnn-ai-projects/
https://nanonets.com/blog/handwritten-character-recognition/
https://aihubprojects.com/handwriting-recognition-using-cnn-ai-projects/
https://datascience.stackexchange.com/questions/33231/using-python-and-machine-learning-to-extract-information-from-an-invoice-inital

#more on tesseract https://learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/ https://learnopencv.com/category/text-recognition/

#datasets

https://www.kaggle.com/dromosys/handwriting-recognition-cnn
- https://www.kaggle.com/tejasreddy/iam-handwriting-top50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

python-OCR

Files

README.md

Latest commit

History

README.md

File metadata and controls

python-OCR