KGrEaT is a framework built to evaluate the performance impact of knowledge graphs (KGs) on multiple downstream tasks. To that end, the framework implements various algorithms to solve tasks like classification, regression, or recommendation of entities. The impact of a given KG is measured by using its information as background knowledge for solving the tasks. To compare the performance of different KGs on downstream tasks, a fixed experimental setup with the KG as the only variable is used.
The hardware requirements of the framework are dominated by the embedding generation step (see DGL-KE framework for details). To compute embeddings for KGs with the size of DBpedia or YAGO, we recommend to use a CPU and have at least 100GB of RAM. As of now, the datasets are moderate in size and the implemented algorithms are quite efficient. Hence, the execution of tasks does not consume a large amount of resources.
- In the project root, create a conda environment with:
conda env create -f environment.yaml
- Activate the new environment with
conda activate kgreat
- Install dependencies with
poetry install
- Make sure that the
kgreat
environment is activated when using the framework!
- Create a new folder under
kg
which will contain all data related to the graph (input files, configuration, intermediate representations, results, logs). Note that the name of the folder will serve as identifier for the graph throughout the framework. - In the folder of your KG:
- Create a sub-folder
data
. Put the RDF files of the KG in this folder (supported file types are NT, TTL, TSV). You may want to create a download script similar to the existing KGs. - Create a file
config.yaml
with the evaluation configuration of your KG. You can find explanations for all configuration parameters in theexample_config.yaml
file of the root directory.
- Create a sub-folder
In the following you will prepare and run the three stages Mapping
, Preprocessing
, and Task
. As the later stages are dependent on the earlier ones, they must be run in this order.
First, pull the docker images of all stages. Make sure that your config.yaml
is already configured correctly, as the manager only pulls images of the steps defined in the config. In the root directory of the project, run the following commands:
python . <your-kg-identifier> pull
We then run the prepare
action which initializes required files for the actual stages. In particular, we create a entity_mapping.tsv
file which contains all the URIs and labels of entities to be mapped.
python . <your-kg-identifier> prepare
Then we run the actual stages:
python . <your-kg-identifier> run
The results of the evaluation runs are put in a result
folder within your KG directory. The framework creates one TSV result file and one log file per task.
You can use the result_analysis.ipynb
notebook to explore and compare the results of one or more KGs.
If you want to trigger individual stages or steps, you can do so by supplying them as optional arguments. You can trigger steps by supplying the ID of the step as defined in the config.yaml
. Here are some examples:
Running only the preprocessing stage:
python . <your-kg-identifier> run --stage preprocessing
Running the RDF2vec embedding generation step of the preprocessing stage:
python . <your-kg-identifier> run --stage preprocessing --step embedding-rdf2vec
Running two specific classification tasks (i.e., steps of the Task
stage):
python . <your-kg-identifier> run --stage task --step dm-aaup_classification dm-cities_classification
Contributions to the framework are highly welcome, and we appreciate pull requests for additional datasets, tasks, matchers, preprocessors, etc.! Here's how you can extend the framework:
To add a dataset for an existing task type, create a folder in the dataset
directory with at least the following data:
Dockerfile
Setup of the docker container including all relevant preparations (import code, install dependencies, ..).dataset
Dataset in a format of your choice. Have a look atshared/dm/utils/dataset.py
for already supported dataset formatsentities.tsv
Labels and URIs of the dataset entities that have to be mapped to the input KGREADME.md
A file describing the dataset as well as any deviations from the general task API
To run a task using the new dataset you have to add an entry in your config.yaml
file where you define an identifier as well as necessary parameters for your task. Don't forget to update the example_config.yaml
with information about the new dataset/task!
To define a new task type, add the code to a subfolder below shared
. If your task type uses Python, you can put it below shared/dm
and reuse the utility functions in shared/dm/util
.
The only information a task retrieves is the environment variable KGREAT_STEP
which it can use to identify its configuration in the config.yaml
of the KG.
Results should be written in the result/run_<run_id>
folder of the KG using the existing format.
To define a new mapper, add the code to a subfolder below shared/mapping
. The mapper should be self-contained and should define its own Dockerfile
(see existing mappers for examples).
A mapper should fill gaps in the source
column of the entity_mapping.tsv
file in the KG folder (i.e., load the file, fill gaps, update the file).
To use the mapper, add a respective entry to the mapping section of your config.yaml
.
To define a new preprocessing method, add the code to a subfolder below shared/preprocessing
. The preprocessing method should be self-contained and should define its own Dockerfile
(see existing preprocessors for examples).
A preprocessing step can use any data contained in the KG folder and persist artifacts in the same folder. These artifacts may then be used by subsequent preprocessing steps or by tasks.
To use the preprocessing method, add a respective entry to the preprocessing section of your config.yaml
.