Skip to content

File Structure

Jan Ehmueller edited this page Jul 28, 2017 · 5 revisions

docs

Contains diagrams about our pipelines and the keynote of the bachelor's podium.

evaluation

Contains evaluation results created for our bachelor's theses.

implisense_files

Submodule of the implisense repo containing the source code of the ImpliSense import.

lib

Contains local jars used as dependency and added to the fat jar. Currently only contains the CoreNLP with Michael's model.

luigi_pipeline

Contains the Luigi source code.

mapbox

Contains the Mapbox python component.

project

Contains sbt related files (added plugins and sbt version).

scripts

Contains scripts used to deploy jobs and automate the CI.

src

Data models are in the sub packages model for the packages (and not listed).

  • main
    • resources: files used in jobs
      • configs: config files for jobs and normalizations
    • scala: source code
      • de/hpi/ingestion: main package
        • curation: curation related jobs (e.g. commit job)
        • dataimport: classes used for imports of every data source
          • dbpedia: DBpedia import and transformations
          • kompass: Kompass import and transformations
          • spiegel: import of Spiegel Online articles
          • wikidata: Wikidata import ans transformations
          • wikipedia: import of the Wikipedia
        • datalake: Subject related classes, import into the datalake and the CSV export for the neo4j
        • datamerge: jobs for merging new datasources and connecting relations to master nodes
        • deduplication: classes and jobs for the deduplication
          • blockingschemes: blocking schemes used for the blocking
          • similarity: similarity measures used for the reduplication
        • framework: traits for the SparkJob framework
        • graphxplore: jobs working with the business graph
        • implicits: classes containing self written implicits (e.g. implicits for collections)
        • textmining: jobs for the Named Entity Linking and Relation Extraction
          • kryo: serializers needed to serialize the trie
          • nel: jobs needed to perform NEL on newspaper articles
          • preprocessing: creation of the knowledge base (i.e. transforming Wikipedia)
          • re: jobs for the Relation Extraction
          • tokenizer: trait for and implementations of tokenizers
        • versioncontrol: jobs for restoring and diffing versions
  • test
    • resources: files used for the tests
    • scala: test source code