This repository contains LinkedPipes ETL transformation pipelines used to transform the [Czech Business Registry] from its XML representation.
- First, a proper representation in RDF according to the Semantic Government Vocabulary is created.
- Second, the Semantic Government Vocabulary representation is transformed to the European Business Graph vocabulary representation used in the STIRData project.
The source data was available in the Czech National Open Data Portal. However, the publisher deployed a new data catalog, which does not provide metadata to be harvested by the national catalog at the moment. The datasets, available directly from the data publisher, are split by year - historical per-year snapshots + actual year, region (actually regional court managing that part of the registry) and full/valid records, where in the valid records, some of the deprecated information about no longer existing companies is missing.
Each dataset has 4 distributions: XML, CSV (ugly with embedded JSON values), Zipped XML and Zipped CSV. We use the XML representation for further transformation.
- The pipeline accesses the https://data.gov.cz/sparql SPARQL endpoint of the Czech National Open Data Portal and searches for the datasets of the business registry - all regions, year 2021, full version, specifically their XML version
- The files are converted to initial RDF using XSLT transformations
- The RDF is further refined based on different aspects of the company records. The blue parts are common for all company types. The yellow-green parts are specific to joint-stock companies, the pink parts are specific to the limited liability companies. The green parts on top generate codelists from the data.
- The full ontological representation is then compacted to form a LOD-style RDF version of the data. This is the version to be published by the Business registry, if we succeed in convincing them. Therefore, this is the source of the transformation to the STIRData model.
- The whole process takes approx. 20 hours to complete, depending on the used hardware.
This pipeline pipeline converts the result of the previous one to the STIRData specification, so far using a single SPARQL query. Then, the mapping to NUTS codes via a Czech cadastre dataset is done using Federated SPARQL query to the Charles University RDF version of the Czech cadastre, resulting in a mapping in RDF TriG. Finally, the results of the transformation are dumped to RDF TriG. Also, an HDT dump is created and the Linked Data Fragments server gets pinged to reload the HDT file. The whole process takes approx. 3 hours to complete.
An additional dataset from the Czech Statistical Office (see in data.gov.cz and data.europa.eu), a classification of the Czech companies using the Czech extension to NACE codes CZ-NACE, is processed using an additional pipeline. It is mapping Czech companies to the CZ-NACE codes, which themselves are published by the Czech Statistical Office (see in data.gov.cz and data.europa.eu) and processed using another pipeline, resulting in RDF and SKOS version of the classification (see data.gov.cz, data.europa.eu).
Customizable pipelines are available for loading the produced data into Apache Jena Fuseki and OpenLink Virtuoso.
The dataset is registered in the STIRData data catalog, the Czech National Open Data Catalog and the Official portal for European data. It is distributed as RDF TriG dump, HDT dump, SPARQL endpoint powered by Apache Jena Fuseki, SPARQL endpoint powered by OpenLink Virtuoso and as Linked Data Fragments endpoint.