Farsi Wiki Dataset

This dataset has been extracted from the Farsi Wikipedia dump (fawiki-20181001-corpus.xml.bz2) available at linguatools. This file contains the articles textual content along with the categories of each article (if available). Every line of this dataset represent a single Farsi Wikipedia page and has the following format:

text.[categories]

Catetories is a list of category names assigned to each article.
Text of each article has been somewhat cleaned (i.e. html tags removed, etc), but should be further preprocessed.

License

This file has been derived from an XML version of the original Wikipedia and are therefore made available under the same license as Wikipedia itself: Creative Commons Attribution-ShareAlike.

Contributions

If you are interested in contributing, please send me an email at mallahyari@georgiasouthern.edu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Farsi Wiki Dataset

License

Contributions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Farsi Wiki Dataset

License

Contributions