Reproduction code for VLDB 2023 EAB paper on LDBC SNB BI

This repository contains the source code for a partial reproduction of the VLDB 2023 paper "The LDBC Social Network Benchmark: Business Intelligence Workload", authored by Szárnyas et al. The paper was published at VLDB's Experiment, Analysis, & Benchmark track, and presents the LDBC SNB BI workload, an analytical workload that targets data processing systems with graph processing capabilities (e.g. path finding). The paper includes two sets of experiments. These execute the full LDBC SNB BI workload on two database management systems. One set of experiments uses the Umbra RDBMS and runs on a single-node setup. The other set uses the TigerGraph graph DBMS and runs on a multi-node setup.

In the following, we document the details of the reproduction, including background information and instructions on how to run our scripts. For questions, please reach out to Gábor Szárnyas (gabor.szarnyas@ldbcouncil.org).

Scope of reproduction

The scope of the reproduction covers the experiments performed on the Umbra RDBMS using scale factors 30, 100, and 300. Successful execution of this code is expected to reproduce columns 2–4 of Table 4 (page 9).

The experiments for the TigerGraph graph DBMS are out of scope for this reproduction. We decided to omit them due to their distributed infrastructure setup (using 4-48 AWS EC2 instances) and high execution costs (estimated to be $1,849.97 for the SF10,000 experiments). That said, we would like to highlight that the TigerGraph system's implementation of the LDBC SNB BI workload was submitted to LDBC's official auditing process and it passed the audit successfully in April 2023. The full disclosure reports are available:

Note: the official LDBC audits used benchmark setups that are different from the ones used in the paper's experiments. The differences stem from the selection of cloud provider: the audited results were executed in AWS instead of the Google Cloud used in the paper's TigerGraph experiments. While similar instances were selected in both clouds, there are a number of small differences between the CPU, memory, and disk types, which have an impact on performance.

Estimated effort of reproduction

We estimate the required effort for reproduction as follows. The preparation of the infrastructure takes 30-60 minutes. The execution of the experiments requires a total time of 6 hours. During this time the scripts proceed automatically without the need for user interaction. Examining the outputs and tearing down the infrastructure require an additional 30 minutes.

Data sets

The data sets used in the experiments can be generated with the LDBC Spark Datagen v0.5.0. To help the adoption of the benchmark, we generated these data sets for scale factors up to SF10,000. These are linked in the BI repository and available for the public: the data sets can be downloaded for free (i.e. there are no egress charges) and without the need for authentication. The reproduction scripts use these pre-generated data sets.

Instructions

Start an m6id.32xlarge instance on AWS EC2 using the Amazon Machine Image (AMI) Ubuntu 22.04 LTS (64-bit (x86)). The default root volume size (8 GB) is sufficient. Set an SSH key under the "Key pair (login)" section.
Log in to the machine using your key pair.
Open a tmux session and run:
```
cd ${HOME}
git clone https://github.com/ldbc/ldbc-snb-bi-vldb2023-reproduction
cd ldbc-snb-bi-vldb2023-reproduction
./prepare-instance.sh
```
This will install the required Ubuntu packages, create a RAID0 disk and mount it under /data. Then, it will clone the LDBC SNB BI repository (v1.0.3.1).
Log out and log in again (this is required by Docker).
Open a tmux session and run:
```
cd ${HOME}/ldbc-snb-bi-vldb2023-reproduction
./prepare-benchmark.sh && ./run-benchmark.sh
```
These scripts will download the required artifacts (Umbra Docker image, data sets), generate the query substitution parameters, and run the benchmark.

Note: if there are no errors, these scripts will download the artifacts and perform three benchmarks runs (SF30, SF100, and SF300) without requiring user interaction.
The results are saved in the /data/ldbc_snb_bi/umbra/umbra-results.zip file:
- output/: query outputs
- logs/: execution logs
- scoring/: benchmark scores
For each scale factor ${SF}, the scoring/runtimes-umbra-sf${SF}.csv file contains the results included in Table 4 of the paper from line 1 (power@SF score) to line 42 (n_{throughput batches}). The results are expected to match those given in column 2 (SF30), column 3 (SF100), and column 4 (SF300).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
prepare-benchmark.sh		prepare-benchmark.sh
prepare-instance.sh		prepare-instance.sh
run-benchmark.sh		run-benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproduction code for VLDB 2023 EAB paper on LDBC SNB BI

Scope of reproduction

Estimated effort of reproduction

Data sets

Instructions

About

Releases 1

Packages

Languages

ldbc/ldbc-snb-bi-vldb2023-reproduction

Folders and files

Latest commit

History

Repository files navigation

Reproduction code for VLDB 2023 EAB paper on LDBC SNB BI

Scope of reproduction

Estimated effort of reproduction

Data sets

Instructions

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages