Skip to content

Latest commit

 

History

History
76 lines (50 loc) · 7.72 KB

README.md

File metadata and controls

76 lines (50 loc) · 7.72 KB

All of our files have a tag at the beginning, relating to a section of our project:

  • DG: Data Generation - code used to generate and process the data used in our models
  • ILP: Inductive Logic Programming - code used to generate ILP rules and background knowledge and code used to run ILP systems and generate a rule from this
  • ML: Machine Learning - code used to create any of our other machine learning models (stacking model and neural networks)

Each of these three sections is described in detail below.

Three notebooks must be ran on Google Colab in order to have enough RAM. These are:

Some notebooks have a "Colab Setup" section. These should setup the Google colab environment with the correct source files and data. They don't need to (and shouldn't) be run on your own machine. If a notebook contains this section, this indicates that the notebook requires a large amount of RAM, though they should run in a colab VM with 25gb RAM.

Data Generation

We started with two datasets containing labelled examples of vulnerabilities in C/C++:

  1. Juliet Software Assurance Dataset :: The initial processing of this dataset was done in the "Exploring Juliet" notebook.
  2. Draper VDISC Dataset :: The initial processing of this dataset was done in the "Exploring VDISC" notebook.

The result of these two notebooks is a Pandas dataframe for each dataset, each normalised to a simliar structure. The resultant files are ../data/juliet_split.csv.gz, 'data/vdisc_train.csv.gz', 'data/vdisc_test.csv.gz', and 'data/vdisc_validate.csv.gz'.

We then focused on buffer overflow examples in the Juliet dataset. This subset is generated by the ./DG-bug-picking-for-ILP.ipynb and saved to ../data/buffer_overflow_data.csv.gz.

From here onwards, the data processing is split into:

  1. ./preprocess_code.py prepares the data for our machine learning models. It uses clang to generate an abstract syntax tree for each datapoint. It then generates graph (graph2vec) and node (node2vec) embeddings.
  2. ./DG-generating-adjacency-feature-matrix.ipynb prepares the data for the machine learning models which use the adjacency and feature matrix representations.
  3. ./DG-generate-minimal-ilp-dataset.ipynb prepares the data for ILP using Joern code property graphs. This uses the joern-cfg-to-prolog.scala script to convert our code property graph into a set of Prolog facts.

ILP

We started by handcrafting rules and backrgound knowledge for a small set of examples. This work was done using Prolo and the Metagol ILP system. The result of this work can be found in the LP-handcrafted-ilp-rules-for-metagol.pl.

Using the Prolog representation of our ILP dataset generate by ./DG-generate-minimal-ilp-dataset.ipynb we then generate Progol scripts using a variety of different settings and representations, whilst analsysing their effectiveness:

  1. ILP-joern-ey-into-progol.ipynb
  2. ILP-joern-ey-into-progol-tree-tag.ipynb
  3. ILP-progol-tag-alloc-and-write-nodes.ipynb
  4. ILP-progol-tag-alloc-and-write-nodes-force.ipynb

We use the graph_visualisation.py script to visualise the resulting rules output by Progol (and Aleph) from the above notebooks (this is turn uses the ILP-joern_cfg_to_dot.scala script).

We performed further investigation into ILP systems using the Aleph ILP system. This work was done in the ILP-joern-ey-into-aleph.ipynb notebook.

During this time, we took a reverse-engineering approach to find an ideal Progol rule. This allowed us to ensure our background knowledge was sufficiently expressive. This work was done in the ILP-checking-an-ideal-rule-in-prolog.ipynb.

ML

We developed the following models, in chronological order:

ML Periphery Experiments

These notebooks contain experiments with other models, evaluation and visualisations.

[Uncompleted] In the following notebooks, we experimented with out of sample performance with the VDISC dataset: