Script and code related to collecting and curating UNESCO data.
Code related to extracting and curating data for the Courier corpus.
Prerequisites:
- Java. Must be in path.
Usage:
extract_text_pdfbox.py [-h] files output-folder
Prerequisites:
- Poppler. Install with:
sudo apt install poppler-utils
- Tesseract OCR. Install with:
sudo apt install tesseract-ocr
Usage:
extract_text_tesseract.py [-h] [--first-page FIRST_PAGE] [-l LAST_PAGE] [-d DPI] [--fmt FMT] files output-folder
Current corpus stats:
Documents: 671
Artcles:
in metadata index 8313
- of type article 7639
- in english 7612
Pages: 27336
Script and code related to collecting (scraping) SSI (legal instruments) corpus data from the UNESCO website. This is the first of three text corpora of UNESCO documents.
index = GetIndexUrls()
for item in index:
pageHtml = getHtmlPage(index.url)
conventionText = ConventionParser().extract(pageHtml)
storeText(genFilename(item), conventionText)