This repository contains the source code for the paper SAP Signavio Academic Models: A Large Process Model Dataset
by Diana Sola, Christian Warmuth, Bernhard Schäfer, Peyman Badakhshan, Jana-Rebecca Rehse, and Timotheus Kampik.
Link to the paper: https://arxiv.org/abs/2208.12223 (pre-print)
Link to the dataset: https://zenodo.org/record/7012043
The example code in this repository is licensed as follows. Note that a different license applies to the dataset itself!
Copyright (c) 2022 by SAP.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The following license applies to the SAP-SAM dataset.
Copyright (c) 2022 by SAP.
SAP grants to Recipient a non-exclusive copyright license to the Model Collection to use the Model Collection for Non-Commercial Research purposes of evaluating Recipient’s algorithms or other academic research artefacts against the Model Collection. Any rights not explicitly granted herein are reserved to SAP. For the avoidance of doubt, no rights to make derivative works of the Model Collection is granted and the license granted hereunder is for Non-Commercial Research purposes only.
"Model Collection" shall mean all files in the archive (which are JSON, XML, or other representation of business process models or other models).
"Recipient" means any natural person receiving the Model Collection.
"Non-Commercial Research" means research solely for the advancement of knowledge whether by a university or other learning institution and does not include any commercial or other sales objectives.
@misc{SAP-SAM-paper,
doi = {10.48550/ARXIV.2208.12223},
url = {https://arxiv.org/abs/2208.12223},
author = {Sola, Diana and Warmuth, Christian and Schäfer, Bernhard and Badakhshan, Peyman and Rehse, Jana-Rebecca and Kampik, Timotheus},
keywords = {Other Computer Science (cs.OH), Software Engineering (cs.SE), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {SAP Signavio Academic Models: A Large Process Model Dataset},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
or
@dataset{SAP-SAM-dataset,
author = {Kampik, Timotheus and Warmuth, Christian and Sola, Diana and Schäfer, Bernhard and Axworthy, Liz and Ivarsson, Erica and
Ouda, Karim and Eickhoff, David},
title = {SAP Signavio Academic Models},
month = aug,
year = 2022,
publisher = {Zenodo},
version = {0.5.1},
doi = {10.5281/zenodo.6964944},
url = {https://doi.org/10.5281/zenodo.6964944}
}
You need to download the dataset and place it into the folder ./data/raw
such that the models are in ./data/raw/sap_sam_2022/models
.
It is also possible to run the analysis on any
.sgx
files (Signavio workspace exports). Place the files in./data/raw/sap_sam_2022/models
and the conversion will be performed automatically.
To get started on Mac or Windows, we provide a dependency setup with poetry
.
Make sure poetry is installed on your system with poetry --version
. If not, run pip poetry install
.
To install the dependencies, do to the root of the cloned repository, type this line in the terminal, and press enter:
poetry install
It is important to note that you should have the latest stable version of
python
orpython3
installed on your machine, and not a pre-release one (trypython --version
). The current latest stable version is3.12.5
(as of August 2024).
After executing the script, you should be able to setup the kernel:
poetry run python -m ipykernel install --user --name=sap-sam-kernel
Then, to open the project, simply type:
poetry run jupyter notebook
Alternatively, a conda setup is possible.
We provide two conda environment.yml files that can be used to create a new environment and install the required dependencies:
environment.yml
: contains the abstract dependencies (pandas, numpy, ...).environment-lock.yml
: contains versions for all dependencies and the transitive dependencies to ensure reproducible results.
You can use the following conda command to create the environment:
conda env create -f environment.yml
or
conda env create -f environment-lock.yml
We provide a tutorial Jupyter Notebook that illustrates the dataset format in more detail and shows how to use the csv parsers developed in ./src
.
The properties Jupyter Notebook gives an overview of selected properties of the dataset.
The SAP-SAM dataset contains 103 csv files with a rough size of 38 GB of process models (see modeling notations of the models below).
- csv columns:
- Revision ID: Unique identifier for model revision
- Model ID: Unique identifier for model
- Organization ID: Unique identifier for organization this model originates from
- Datetime: Date and time of creation
- Model JSON: JSON containing model information
- Description: Description of model (typically empty)
- Name: Model name
- Type: Model type (duplicate and less specific than namespace)
- Namespace: Stencilset/modeling notation (e.g. BPMN, DMN, UML,...)
- Number of models: 1,021,471
- Number of models by modeling notation:
Modeling notation | Frequency |
---|---|
BPMN 2.0 | 618,807 |
Value Chain | 194,078 |
DMN 1.0 | 98,286 |
EPC | 32,369 |
BPMN 1.0 | 15,643 |
UML 2.2 Class | 14,953 |
Petri Net | 11,207 |
ArchiMate 2.1 | 10,956 |
UML Use Case | 10,228 |
Organigram | 4,568 |
BPMN 2.0 Choreography | 4,096 |
BPMN 2.0 Conversation | 2,788 |
FMC Block Diagram | 1,398 |
CMMN 1.0 | 999 |
CPN | 385 |
Journey Map | 287 |
YAWL 2.2 | 238 |
Process Documentation Template | 86 |
jBPM 4 | 76 |
XForms | 20 |
Chen Notation | 3 |
In order to remove personal first and last names, emails or in some cases matriculations numbers (which users have added in non-compliance with the T&Cs), we have applied a simple replacement script. In particular, we have replaced - to the extent possible - emails, names, and (matriculation) numbers with the following dummy values:
Context | Dummy |
---|---|
Email Dummy | jane.doe@dummy.com |
Name Dummy | Jane Doe |
Matriculation/Number Dummy | 12345678 |
├── data
│ ├── interim <- Intermediate data that has been transformed.
│ └── raw <- The raw dataset should be placed in this folder.
├── notebooks <- Jupyter notebooks.
├── reports
│ └── figures <- Generated graphics and figures used in the paper.
├── src
│ └── sapsam <- Source code and dictionaries for use in this project.
├── LICENSE <- License that applies to the example code in this repository.
├── README.md <- The top-level README for developers using this project.
├── environment-lock.yml <- Contains versions for all dependencies and the transitive dependencies to ensure reproducible results.
├── environment.yml <- Contains the abstract dependencies (pandas, numpy, ...).
└── setup.py <- Makes project pip installable (pip install -e .) such that src can be imported.