This is work in progress!
This project provides simple tools to obtain (popular) text corpora that are used for benchmarks and tests.
We do not host any of the corpora. We just provide an easy way to get and/or compute them. Please visit the websites of the corpora for further information.
- The Pizza & Chili Corpus
- Lightweight Corpus
- Random number generation
- Word based alphabet computation
Use make download
to download all files in the download configs, make random
to generate random strings as defined in the config and make processing
to build all preprocessing tools.