Preprint: https://arxiv.org/abs/2309.03044
Postprint: https://ieeexplore.ieee.org/document/10301266
This artifact contains all data (including the data gathering step), code, and scripts required to run the paper's experiment to reproduce the results. The structure of folders and files is as follows:
This folder contains all scripts and code required (specific to this paper) to re-run the training and testing of our models (including classic models, CodeBERT, ConcatInline, and ConcatCLS). The structure of this folder is:
+-- data (contains paper full dataset and preprocessing step script)
| +-- preprocess.sh (splitting dataset and scaling values)
+-- dataset (contains a small subset of the dataset after preprocessing for the getting started section)
+-- models
| +-- code_metrics (contains code for training and testing our classic models)
| +-- train_test.sh (training and testing the models)
| +-- code_representation
| +-- codebert
| +-- CodeBertModel.py (code for CodeBERT model)
| +-- ConcatInline.py (code ConcatInline model)
| +-- ConcatCLS.py (code ConcatCLS model)
| +-- train.sh (script for training the models)
| +-- inference.sh (script for testing the models)
| +-- evaluation
| +-- evaluation.py (evaluation metrics)
+-- utils (constant file)
The data
folder contains bugs from Defects4tJ and Bugs.jar datasets. This folder contains a preprocessing script that unify bug severity values, scale the source code metrics and create train
, val
, and test
splits.
Running this script using bash preprocessing.sh
command generates 6 files containing train
, val
, tests
splits in jsonl
(compatible with CodeBERT experiments) and csv
(compatible with source code metrics experiments) formats.
Files available in the dataset
folder represent data for the getting started section (small subset of data). For reproducing paper results the generated files in the data
folder should be copied to the dataset
folder that is used by the model training scripts.
This folder contains all code and scripts for all of the experiments including classic models, CodeBERT models, ConcatInline, and ConcatCLS.
This folder contains all required code to gather the data including issue scraping, method extraction, and metric extraction. While this step is out of this paper's scope, the required step to reproduce the data is available in this instruction. While there are many directories/files in this folder, the following tree shows the structure of 3 files that need to be run.
+-- issue_scraper
| +-- main.py
+-- MetricsExtractor
| +-- method_extractor
| +-- MethodExtractorMain.java
| +-- metric_extractor
| +-- MetricCalculatorMain.java
For Getting Started:
- Operating System: The provided artifact is tested on Linux (20.04.6 LTS) and macOS (Ventura 13.5).
- GPU: It is better to have a GPU for running experiments on GPU otherwise it may take a long time.
- CPU/RAM: There is no strict minimum on these.
- Python: Python 3 is required.
This section only sets up the artifact and validates its general functionality based on a small example data (complete dataset for the classic models, but the first 50 rows for CodeBERT models).
-
Clone the repository
git@github.com:EhsanMashhadi/ISSRE2023-BugSeverityPrediction.git
-
Install dependencies (using
requirements.txt
file) or manually :
pip install pandas==1.4.2
pip install jira
pip install beautifulsoup4
pip install lxml
pip install transformers==4.18.0
pip install torch==1.11.0
This should be enough for running on CPU, but install the next for running on GPUpip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install scikit-learn==1.1.1
pip install xgboost==1.6.1
pip install seaborn==0.11.2
- Adding the project root folder to the
PYTHONPATH
export PYTHONPATH=$PYTHONPATH:*/rootpath/you/clone/the/project*/experiments
- e.g.,
export PYTHONPATH=$PYTHONPATH:/Users/ehsan/workspace/ISSRE2023-BugSeverityPrediction/experiments
- RQ1:
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_metrics
bash train_test.sh
- Results are generated in the
log
folder
- RQ2:
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert
- Set
CodeBERT
as themodel_arch
parameter's value intrain.sh
andinference.sh
files. bash train.sh
for training the modelbash inference.sh
for evaluating the model with thetest
split- Results are generated in the
log
folder
- RQ3:
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert
- Set
ConcatInline
orConcatCLS
as themodel_arch
parameter's value intrain.sh
andinference.sh
files. bash train.sh
for training the modelbash inference.sh
for evaluating the model with thetest
split- Results are generated in the
log
folder
- Clone the repository
git@github.com:EhsanMashhadi/ISSRE2023-BugSeverityPrediction.git
- Install dependencies (You may need to change the torch version for running on your GPU/CPU)
- Experiments:
- It is better to install these dependencies on a virtual env (you can also use requirements.txt)
pip install pandas==1.4.2
pip install jira
pip install beautifulsoup4
pip install lxml
pip install transformers==4.18.0
pip install torch==1.11.0
This should be enough for running on CPU, but install the next for running on GPUpip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install scikit-learn==1.1.1
pip install xgboost==1.6.1
pip install seaborn==0.11.2
- Adding the project root folder to the
PYTHONPATH
export PYTHONPATH=$PYTHONPATH:*/rootpath/you/clone/the/project*/experiments
- e.g.,
export PYTHONPATH=$PYTHONPATH:/Users/ehsan/workspace/ISSRE2023-BugSeverityPrediction/experiments
- Running data preprocessing
cd ISSRE2023-BugSeverityPrediction/experiments/data
bash preprocessing.sh
- Copy generated
jsonl
andcsv
files into the dataset folder
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_metrics
bash train_test.sh
- Results are generated in the
log
folder
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert
- Set
CodeBERT
as themodel_arch
parameter's value intrain.sh
file bash train.sh
for training the modelbash inference.sh
for evaluating the model with thetest
split- Results are generated in the
log
folder
cd ISSRE2023-BugSeverityPrediction/experiments/models/code_representation/codebert
- Set
ConcatInline
orConcatCLS
as themodel_arch
parameter's value intrain.sh
file bash train.sh
for training the modelbash inference.sh
for evaluating the model with thetest
split- Results are generated in the
log
folder
- You can change/add different hyperparameters/configs in
train.sh
andinference.sh
files.
- Check the
CUDA
andPyTorch
compatibility - Assign the correct values for
CUDA_VISIBLE_DEVICES
,gpu_rank
, andworld_size
based on your GPU numbers in all scripts. - Run on CPU by removing the
gpu_rank
, andworld_size
options in all scripts. - Refer to the CodeBERT Repo for finding common issue.
The tools below should be installed and configured correctly, otherwise, this step won't work. It may take a long time to do this step and can be skipped (recommended).
- Java: Java 18 is required (only for running data gathering step).
- Git: (brew, apt, ... based on your OS)
- SVN: (brew, apt, ... based on your OS)
- Defects4J (Follow all the steps in the provided installation guide).
- Bugs.jar (You must install this in the
data_gathering
directory).
cd ISSRE2023-BugSeverityPrediction/data_gathering/issue_scraper
python main.py
For the below steps, it can be easier to use gradlew
or simply open by IntelliJ IDEA to run Java files
-
cd ISSRE2023-BugSeverityPrediction/data_gathering/MetricsExtractor/src/main/java/software/ehsan/severityprediction/method_extractor
-
run MethodExtractorMain.java
-
cd ISSRE2023-BugSeverityPrediction/data_gathering/MetricsExtractor/src/main/java/software/ehsan/severityprediction/metric_extractor
-
run MetricCalculatorMain.java