Marathi-OCR-Dataset

A collection of about 12k Marathi word images with corresponding labels, useful for Devanagari Optical Character Recognition.

Description

There is a lack of publicly available datasets at word/line level for Devanagari character recognition. We created this dataset containing Marathi vocabulary of ~12k word images and thier corresponding text labels encoded in utf-8 format. Words are segmented from Marathi books in PDF and .epub format, available at http://www.esahity.com/ . We used 12 books from different genres to include diversity in vocabulary and font variation. It also removed the dependancy on domain specific words and redundant Marathi numerals. The dataset is based on IAM Handwriting Dataset

We created this dataset using pytesseract which is a wrapper for Google Tesseract-OCR engine. The challenge with using Tesseract-OCR for Indic languages is that, it is trained using the same approach as European languages. It fails to recognize compound words (common in Devanagari script) which are consonant-vowel sequences represented as a single unit. Devanagari script also contains various diacritics written with the characters. To eliminate these errors and inconsistencies in the predicted output, we manually correct the text labels. The images are resized and stored in JPEG format with a resolution of 96 dpi horizontally and vertically, so that they can be fed directly to neural nets. The length of words varies from 2 characters upto as long as 15-20 characters. We also apply pre-processing techniques like binarization and image thresholding using OpenCV and PIL, for cleaner images.

Usage

The file images.zip contains the Marathi words images and labels.txt contains the corresponding text. This dataset can be leveraged to improve the existing OCR systems, see Train Tesseract 4.0. More widely, it can be used to train hybrid CNN-LSTM models from scratch, see Text Recognition System using TensorFlow.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitattributes		.gitattributes
README.md		README.md
sub-sub.zip		sub-sub.zip
words.txt		words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marathi-OCR-Dataset

Description

Usage

About

Releases

Packages

sayalighodekar/Marathi-OCR-Dataset

Folders and files

Latest commit

History

Repository files navigation

Marathi-OCR-Dataset

Description

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages