Skip to content

persiandataset/Arshasb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arshasb

Persian OCR dataset

  • In this repository, Arshasb (ancient Iranian name[ اَرشاسب ]) Persian OCR dataset is located.
  • This dataset contains 33,000 pages of Persian text, of which 7,000 pages have been published for free.
  • The words that are placed next to each other are interdependent and represent one subject.
  • More precisely, the placement of the words is meaningful, and this helps to use NLP models in the OCR process.
  • In this dataset, the position of each word is precisely labeled. Look at this sample:

Download

  • There are 100 samples of this dataset in Arshasb_samples.tar.gz
  • You can download Arshasb dataset with 7k pages in this link (~730M)
  • Also, if you want a 33,000-page dataset, contact me by hubare.ra[at]gmail.com .[Not free]

Detail

  • The number of unique words with the removal of numbers and punctuation is 97498. In the 7k version, this number is reduced to 40911 unique words.

  • The content of this dataset includes public and news texts.

  • This dataset uses Far_ketab font. [website]

  • For each page in this dataset, a subfolder with the same name as the page has been created.

  • Each subfolder contains 4 files, for example in subfolder 00001 we have:

    • 1.page_00001.png [ Page image ]

    • 2.label_00001.xlsx [ The exact location of each word on the page ]

    • 3.fulltext_00001.txt [ Full text in page ]

    • 4.line_00001.xlsx [ The exact location of each line on the page ]

    • Introducing label_xxxx.xlsx columns:

      • 1.word
      • 2.line [show index-line word]
      • 3.point(1-2-3-4) [show location of each word]

Sample code for reading label_xxxx.xlsx

import pandas as pd
label = pd.read_excel('Arshasb_7k/00001/label_00001.xlsx')
data = []
for j in range(len(label)):
    #read word
    word = label['word'][j]
    #read index_line word
    index_line = label['line'][j]
    #read points
    point1 = eval(label['point1'][j])
    point2 = eval(label['point2'][j])
    point3 = eval(label['point3'][j])
    point4 = eval(label['point4'][j])
    data.append({'number':j , 'word':word, 'line':index_line ,'point1':point1,'point2':point2,'point3':point3,'point4':point4})

Donation

I try to publish free Persian datasets in github. Your financial support will encourage me.
Donation link : https://www.coffeete.ir/persiandataset

https://www.patreon.com/persiandataset
If you are in Iran, contact me by hubare.ra[at]gmail.com for donation.