GitHub

ErrCorp

ErrCorp is a tool for automated generation of error-annotated corpora from Wikipedia sites. Such corpus contains the newest versions of articles with marked errors obtained from their editing history.

The script itself operates in situ, no additional files are created during processing (except the situation when the dump is located online and needs to be downloaded first). It is also unpretentious to memory as it processes input page by page.

Install

pip install mwclient
pip install intervaltree
pip install python-Levenshtein

Usage

Download and process pages through MediaWiki action API:

-a "Astronomie; Biologie; Fyzika;" -l "cs" -f "se" -r

Process pages from local dump:

-p ../cswiki.xml.bz2 -l "cs" -f "txt" -r -m

For more info check wiki

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
confs		confs
res		res
.gitignore		.gitignore
.landscape.yml		.landscape.yml
ErrorClassifier.py		ErrorClassifier.py
ErrorExtractor.py		ErrorExtractor.py
Exporter.py		Exporter.py
PageProcessor.py		PageProcessor.py
PostProcessor.py		PostProcessor.py
README.md		README.md
UnicodeHack.py		UnicodeHack.py
Utils.py		Utils.py
WikiDownload.py		WikiDownload.py
WikiExtractor.py		WikiExtractor.py
main.py		main.py
unitok.py		unitok.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ErrCorp

Install

Usage

About

Releases

Packages

Languages

jirkle/ErrCorp

Folders and files

Latest commit

History

Repository files navigation

ErrCorp

Install

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages