Problem:
Identifying the compiler family and optimization level is a crucial phase for malware analysis and reverse engineering. Cracking binary files for extracting provenance information supports a faster detection of malware files.
Methodologies:
- Feature engineering was carried out by Strings and the Ndisasm disassembler through Linex command-line.
- Feature selection through ANOVA and Chi-squared was implemented.
- Feature pre-processing including data balancing and standardization were deployed.
- Logistic Regression, Support Vector Machines (SVM), Multi-Layer Perceptron (MLP), Decision tree, AdaBoost classifier, Random forest, and ensemble learning were exploited for the two classification tasks.
- Optimization classification problem was tested over deep learning.
Results:
The best test accuracy of 100% was achieved by the stacking model for the classification of the compiler family, and 85.9% for the optimization level by the deep learning model.
BinComp compiler fingerprinting dataset. https://github.com/BinSigma/BinComp/tree/master/Dataset.
Disassembled and strings csv files are available upon request.
Request Compiler Provenance CSV Dataset!
Mohamed Elahl - Hassan Mohamed - Karim Youssef - Doha ElHady