All of our files have a tag at the beginning, relating to a section of our project:
- DG: Data Generation - code used to generate and process the data used in our models
- ILP: Inductive Logic Programming - code used to generate ILP rules and background knowledge and code used to run ILP systems and generate a rule from this
- ML: Machine Learning - code used to create any of our other machine learning models (stacking model and neural networks)
Each of these three sections is described in detail below.
Three notebooks must be ran on Google Colab in order to have enough RAM. These are:
- ML-adj-matrix-conv-neural-network.ipynb
- ML-adj-feature-matrix-conv-neural-network.ipynb
- ML-evaluating-final-models.ipynb
Some notebooks have a "Colab Setup" section. These should setup the Google colab environment with the correct source files and data. They don't need to (and shouldn't) be run on your own machine. If a notebook contains this section, this indicates that the notebook requires a large amount of RAM, though they should run in a colab VM with 25gb RAM.
We started with two datasets containing labelled examples of vulnerabilities in C/C++:
- Juliet Software Assurance Dataset :: The initial processing of this dataset was done in the "Exploring Juliet" notebook.
- Draper VDISC Dataset :: The initial processing of this dataset was done in the "Exploring VDISC" notebook.
The result of these two notebooks is a Pandas dataframe for each dataset, each normalised to a simliar structure. The resultant files are ../data/juliet_split.csv.gz, 'data/vdisc_train.csv.gz', 'data/vdisc_test.csv.gz', and 'data/vdisc_validate.csv.gz'.
We then focused on buffer overflow examples in the Juliet dataset. This subset is generated by the ./DG-bug-picking-for-ILP.ipynb and saved to ../data/buffer_overflow_data.csv.gz.
From here onwards, the data processing is split into:
- ./preprocess_code.py prepares the data for our machine learning models. It uses clang to generate an abstract syntax tree for each datapoint. It then generates graph (graph2vec) and node (node2vec) embeddings.
- ./DG-generating-adjacency-feature-matrix.ipynb prepares the data for the machine learning models which use the adjacency and feature matrix representations.
- ./DG-generate-minimal-ilp-dataset.ipynb prepares the data for ILP using Joern code property graphs. This uses the
joern-cfg-to-prolog.scala
script to convert our code property graph into a set of Prolog facts.
We started by handcrafting rules and backrgound knowledge for a small set of examples. This work was done using Prolo and the Metagol ILP system. The result of this work can be found in the LP-handcrafted-ilp-rules-for-metagol.pl
.
Using the Prolog representation of our ILP dataset generate by ./DG-generate-minimal-ilp-dataset.ipynb we then generate Progol scripts using a variety of different settings and representations, whilst analsysing their effectiveness:
- ILP-joern-ey-into-progol.ipynb
- ILP-joern-ey-into-progol-tree-tag.ipynb
- ILP-progol-tag-alloc-and-write-nodes.ipynb
- ILP-progol-tag-alloc-and-write-nodes-force.ipynb
We use the graph_visualisation.py
script to visualise the resulting rules output by Progol (and Aleph) from the above notebooks (this is turn uses the ILP-joern_cfg_to_dot.scala
script).
We performed further investigation into ILP systems using the Aleph ILP system. This work was done in the ILP-joern-ey-into-aleph.ipynb notebook.
During this time, we took a reverse-engineering approach to find an ideal Progol rule. This allowed us to ensure our background knowledge was sufficiently expressive. This work was done in the ILP-checking-an-ideal-rule-in-prolog.ipynb.
We developed the following models, in chronological order:
- ML-dense-neural-network-graph2vec.ipynb :: construction of a baseline feed-foward neural network using the graph2vec embedding generated via preprocess_code.py.
- ML-conv-neural-network-graph2vec.ipynb :: construction of convolutional neural network using the graph2vec embedding generated via preprocess_code.py
- ML-ml-model-comparison-and-stacking-binary.ipynb :: construction of our stacking models.
- ML-adj-matrix-conv-neural-network.ipynb :: construction of convolutional neural network using the adjacency matrix representation (standard and random padding) generated via DG-generating-adjacency-feature-matrix.ipynb.
- ML-adj-feature-matrix-conv-neural-network.ipynb :: construction of convolutional neural network using the adjacency and feature matrices representation (standard and random padding) generated via DG-generating-adjacency-feature-matrix.ipynb.
- ML-adj-matrix-visualisation.ipynb :: visualising the adjacency matrix representation of source code.
- ML-evaluating-final-models.ipynb :: evaluation of machine learning models.
These notebooks contain experiments with other models, evaluation and visualisations.
- ML-old-baseline-model-comparison_all_data.ipynb, ML-old-baseline-model-final-all-data.ipynb :: construction of dense feed forward neural network on graph2vec embeddings of the entire Juliet dataset involving multiple bug types.
- ML-visualisation-comparing-model-predictions.m :: t-SNE embeddings of machine learning models' predictions.
- ML-node2vec-naive-model.ipynb :: construction of dense feed forward neural network on node2vec embeddings. The node2vec data generation could be found in this repo.
- ML-adj-matrix-dense-neural-network :: construction of dense feed forward neural network on adjacency matrix representation.
[Uncompleted] In the following notebooks, we experimented with out of sample performance with the VDISC dataset: