- Malware detection is the process of ascertaining the presence of malware on a system or determining whether a program is malicious or harmless so that the system can be protected or recovered from any effects caused by the malicious code .
- As the number of legitimate users of the Internet increases, so do the opportunities for cybercriminals to gain from manufacturing malware.
- This is the reason that prompted the authors of the article we investigated to develop a model for predicting whether a PE file is malicious or benign by methods of deep
learning and group learning.
- We implemented the idea in the models and tried to slightly improve the results, which we did manage to do eventually.
- We used the dataset of the research from Kaggle.
- The data contains 19611 rows, and 79 columns
- Following the research work - we used PCA to reduce the number of columns to 55, as determined by the researchers.
- Before that, we carried out our own research and found that in advance we could not refer to 4 columns ('Name', 'Machine’, 'TimeDateStamp', and the target label 'Malware’ that mustn’t be reduced) that represent general or not significant information so that it does not constitute an impact on the data
- Gaussian Naïve Bayes,
- Decision Tree,
- Random Forest,
- AdaBoost,
- Gradient Boosting - The next stage was implementing 3 DL models: 1. MLP with 1 hidden layer 2. MLP with 2 hidden layers 3. 1D CNN In the last stage they implemented an ensemble learning model by implementing the previous 3 dl models as the first stage, and on top of these results – machine learning models were implemented as the final stage.
Metalearner | our | their |
---|---|---|
Decision Tree | 0.99981 | 0.989 |
Random Forest | 0.99981 | 0.9924 |
Extra Trees | 0.99981 | 1 |
KNN | 0.98266 | 0.979 |
LDA | 0.97705 | 0.98 |
AdaBoost | 0.97673 | 0.982 |
SVM | 0.97654 | 0.982 |
Logistic | 0.97642 | 0.981 |
SGD | 0.97508 | 0.979 |
Passive | 0.97444 | 0.978 |
Gaussian | 0.97291 | 0.972 |
QDA | 0.96577 | 0.973 |