unesco_data_collection

Script and code related to collecting and curating UNESCO data.

Courier

Code related to extracting and curating data for the Courier corpus.

Extracting with PDFBox

Prerequisites:

Java. Must be in path.

Usage:

extract_text_pdfbox.py [-h] files output-folder

Extracting with Tesseract

Prerequisites:

Poppler. Install with: sudo apt install poppler-utils
Tesseract OCR. Install with: sudo apt install tesseract-ocr

Usage:

extract_text_tesseract.py [-h] [--first-page FIRST_PAGE] [-l LAST_PAGE] [-d DPI] [--fmt FMT] files output-folder

Current corpus stats:

Documents:			671
Artcles:
in metadata index		8313
	 - of type article	7639
	 - in english		7612
Pages:				27336

Legal instruments

Script and code related to collecting (scraping) SSI (legal instruments) corpus data from the UNESCO website. This is the first of three text corpora of UNESCO documents.

Main loop:

index = GetIndexUrls()
for item in index:
    pageHtml = getHtmlPage(index.url)
    conventionText = ConventionParser().extract(pageHtml)
    storeText(genFilename(item), conventionText)

Name		Name	Last commit message	Last commit date
Latest commit History 801 Commits
.vscode		.vscode
courier		courier
data		data
docker		docker
legal_instruments		legal_instruments
purgatory		purgatory
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unesco_data_collection

Courier

Extracting with PDFBox

Extracting with Tesseract

Legal instruments

Main loop:

About

Releases

Packages

Contributors 4

Languages

License

inidun/inidun_courier

Folders and files

Latest commit

History

Repository files navigation

unesco_data_collection

Courier

Extracting with PDFBox

Extracting with Tesseract

Legal instruments

Main loop:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages