A collection of about 12k Marathi word images with corresponding labels, useful for Devanagari Optical Character Recognition.
There is a lack of publicly available datasets at word/line level for Devanagari character recognition. We created this dataset containing Marathi vocabulary of ~12k word images and thier corresponding text labels encoded in utf-8 format. Words are segmented from Marathi books in PDF and .epub format, available at http://www.esahity.com/ . We used 12 books from different genres to include diversity in vocabulary and font variation. It also removed the dependancy on domain specific words and redundant Marathi numerals. The dataset is based on IAM Handwriting Dataset
We created this dataset using pytesseract which is a wrapper for Google Tesseract-OCR engine. The challenge with using Tesseract-OCR for Indic languages is that, it is trained using the same approach as European languages. It fails to recognize compound words (common in Devanagari script) which are consonant-vowel sequences represented as a single unit. Devanagari script also contains various diacritics written with the characters. To eliminate these errors and inconsistencies in the predicted output, we manually correct the text labels. The images are resized and stored in JPEG format with a resolution of 96 dpi horizontally and vertically, so that they can be fed directly to neural nets. The length of words varies from 2 characters upto as long as 15-20 characters. We also apply pre-processing techniques like binarization and image thresholding using OpenCV and PIL, for cleaner images.
The file images.zip
contains the Marathi words images and labels.txt
contains the corresponding text.
This dataset can be leveraged to improve the existing OCR systems, see Train Tesseract 4.0. More widely, it can be used to train hybrid CNN-LSTM models from scratch, see Text Recognition System using TensorFlow.