This repository contains TEI files of 19th and 20th exhibition catalogs.
Those files were created thanks to this pipeline:
Segmentation and transcription were done in eScriptorium, using models trained with Kraken on datasets from here and here.
Python data extraction which transformed the ALTO4 files extracted from eScriptorium to TEI files is accessible here.
Manual correction is done between each step of the pipeline.
Since the Layout analysis has been corrected for each catalogs, ALTO4 files extracted from eScriptorium can then be used to train a more efficient segmentation model.
The TEI files were built in order to stick to the ODD done by Caroline Corbières.
This repository presents, for each catalog, images, alto4 files extracted from eScriptorium, TEI and csv file.
The css file àffichage_TEI.css
allows you to correct the TEI files more easily.
Documents have been encoded by Juliette Janes, intern of the Artl@s project, with the help of Simon Gabay under the supervision of Béatrice Joyeux-Prunel.
Images from catalogs published prior 1920 and transcriptions are CC-BY.
The other images are extracts of catalogs published after 1920 and are the intellectual property of their productor.
Juliette Janes, Simon Gabay, Béatrice Joyeux-Prunel, TEICatalogs: Corpus of encoded 19th and 20th catalogs, 2021, Paris: ENS Paris https://github.com/Juliettejns/TEIcatalogs/
If you have any questions or remarks, please contact juliette.janes@chartes.psl.eu and simon.gabay@unige.ch.