Skip to content

Latest commit

 

History

History
19 lines (11 loc) · 1021 Bytes

README.md

File metadata and controls

19 lines (11 loc) · 1021 Bytes

Farsi Wiki Dataset

This dataset has been extracted from the Farsi Wikipedia dump (fawiki-20181001-corpus.xml.bz2) available at linguatools. This file contains the articles textual content along with the categories of each article (if available). Every line of this dataset represent a single Farsi Wikipedia page and has the following format:

text.[categories]
  • Catetories is a list of category names assigned to each article.

  • Text of each article has been somewhat cleaned (i.e. html tags removed, etc), but should be further preprocessed.

License

This file has been derived from an XML version of the original Wikipedia and are therefore made available under the same license as Wikipedia itself: Creative Commons Attribution-ShareAlike.

Contributions

If you are interested in contributing, please send me an email at mallahyari@georgiasouthern.edu.