Arshasb

Persian OCR dataset

In this repository, Arshasb (ancient Iranian name[ اَرشاسب ]) Persian OCR dataset is located.
This dataset contains 33,000 pages of Persian text, of which 7,000 pages have been published for free.
The words that are placed next to each other are interdependent and represent one subject.
More precisely, the placement of the words is meaningful, and this helps to use NLP models in the OCR process.
In this dataset, the position of each word is precisely labeled. Look at this sample:

Download

There are 100 samples of this dataset in Arshasb_samples.tar.gz
You can download Arshasb dataset with 7k pages in this link (~730M)
Also, if you want a 33,000-page dataset, contact me by hubare.ra[at]gmail.com .[Not free]

Detail

The number of unique words with the removal of numbers and punctuation is 97498. In the 7k version, this number is reduced to 40911 unique words.
The content of this dataset includes public and news texts.
This dataset uses Far_ketab font. [website]
For each page in this dataset, a subfolder with the same name as the page has been created.
Each subfolder contains 4 files, for example in subfolder 00001 we have:
- 1.page_00001.png [ Page image ]
- 2.label_00001.xlsx [ The exact location of each word on the page ]
- 3.fulltext_00001.txt [ Full text in page ]
- 4.line_00001.xlsx [ The exact location of each line on the page ]
- Introducing label_xxxx.xlsx columns:
  - 1.word
  - 2.line [show index-line word]
  - 3.point(1-2-3-4) [show location of each word]

Sample code for reading label_xxxx.xlsx

import pandas as pd
label = pd.read_excel('Arshasb_7k/00001/label_00001.xlsx')
data = []
for j in range(len(label)):
    #read word
    word = label['word'][j]
    #read index_line word
    index_line = label['line'][j]
    #read points
    point1 = eval(label['point1'][j])
    point2 = eval(label['point2'][j])
    point3 = eval(label['point3'][j])
    point4 = eval(label['point4'][j])
    data.append({'number':j , 'word':word, 'line':index_line ,'point1':point1,'point2':point2,'point3':point3,'point4':point4})

Donation

I try to publish free Persian datasets in github. Your financial support will encourage me.
Donation link : https://www.coffeete.ir/persiandataset

https://www.patreon.com/persiandataset
If you are in Iran, contact me by hubare.ra[at]gmail.com for donation.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Arshasb_samples.tar.gz		Arshasb_samples.tar.gz
LICENSE		LICENSE
README.md		README.md
fig1.png		fig1.png
page_08734.png		page_08734.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arshasb

Download

Detail

Sample code for reading label_xxxx.xlsx

Donation

About

Releases

Packages

License

persiandataset/Arshasb

Folders and files

Latest commit

History

Repository files navigation

Arshasb

Download

Detail

Sample code for reading label_xxxx.xlsx

Donation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages