Skip to content

Latest commit

 

History

History
33 lines (28 loc) · 2.33 KB

File metadata and controls

33 lines (28 loc) · 2.33 KB

file-type-detection-by-byte-blocks

In this project, we detect file types based on the bytes that constitute them. We use the first, body, and last blocks of bytes on the disk to account for all possible scenarios and train the FFNN, CNN, GRU, and LSTM models. Afterward, we make predictions and evaluate the performance of each model. The experimental computer uses an SSD, where each block size is 4KB, equivalent to 4096 bytes. The selected blocks vary in nature: the first and last blocks may contain headers and trailers for certain file types, whereas the body block presents a greater challenge, as it may lack the distinct patterns often found in the other blocks.

Dataset

The dataset used for this project consists of files that can be downloaded here. Alternatively, a web-scraping script has been implemented to download the dataset in "toolkit/scrape.py".

Working dir

  • Data visualisation: A dedicated notebook that manages dataset download and sampling, creates and analyses visualisations, performs feature extraction, and helps interpret data trends, validate assumptions, and communicate insights effectively.
  • Models random search: A specific notebook designed for hyperparameter optimisation using random search, enabling us to efficiently explore a range of parameter values and improve model performance.
  • Venv: A Python virtual environment used to isolate the dependencies installed for the project, useful for avoiding conflicting library versions.
  • Requirements: A text file listing the required dependencies to install in the project’s virtual environment.
  • HPS results: A folder to save the models’ hyperparameter search results for each model addressed.
  • Toolkit: A python package developed for the project.
  • Govdocs1: A folder containing the dataset to be used in the project, consisting of files of mixed types.
  • Systems 1-6: Separate notebooks that focus on model training and evaluation, providing a structured approach to experimenting with different algorithms and hyperparameters.

image

Best accuracy scores

image