The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can harm both consumers and enterprises in numerous ways.
With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security. As part of their strategy, Microsoft challenges the data science community to develop techniques to predict if a machine will soon be hit with malware. Kagglers are provided with an unprecedented dataset to encourage progress in predicting malware occurrences.
The aim of this project is to predict the probability of a Windows machine being infected by various families of malware based on different properties of that machine.
- Users can input process parameters and receive a prediction of potential malware infection.
- Users can view performance metrics of different machine learning models.
- Users can visualize data to gain insights into machine behavior.
- The model should maintain high precision, recall, and F1 scores.
- The web application must be responsive and user-friendly.
- The web application should be secure and ensure data protection.
- Built using Flask and deployed using Docker.
- Minimal deployment costs.
- Integration with external applications or data sources.
- Detailed equipment diagnostics.
The dataset provided by Microsoft contains almost 8 million records based on 83 features. However, due to system capacity, 100,000 records were used for this project. Important features include App version, Windows Version, Anti-virus Version, Operating System, Storage, Display used, and RAM. The dataset is publicly available on Kaggle.
- The dataset is balanced and includes missing values.
- It is considered a challenging task in machine learning for classification, clustering, and feature selection.
To enhance model performance, the following steps were undertaken:
- Removed features with over 90% missing values or zero variance.
- Engineered features based on App version, Windows version, Anti-virus version, and other machine properties.
- Handled missing data and scaled numerical features using MinMax scaling.
- Used frequency encoding and ordinal encoding for categorical features.
- Split the dataset using a 75/25 ratio for training and testing.
For a detailed description of preprocessing, refer to the [data preprocessing notebook](notebook/Malware Detection.ipynb).
The project's success will be measured by:
- Precision, recall, and F1 scores.
- Ease of use and responsiveness of the web application.
- Ability to accurately identify Windows systems vulnerable to malware.
- Data Ingestion: Load and divide data into training and testing sets.
- Data Validation: Check for data drift and validation.
- Data Preprocessing: Clean, encode, and scale data.
- Model Building: Train multiple models and select the best one using performance metrics.
- Model Evaluation: Analyze precision, recall, F1-score, and AUC.
- Model Pusher: Push the best model into production.
- Web Application: Develop a Flask-based app to handle predictions.
- CI/CD Pipeline: Automate deployment using GitHub Actions for continuous integration and deployment.
Model | Accuracy |
---|---|
LogisticRegression | 0.637 |
DecisionTreeClassifier | 0.569 |
RandomForestClassifier | 0.639 |
GaussianNB | 0.600 |
SVC | 0.638 |
XGBClassifier | 0.632 |
LGBMClassifier | 0.647 |
Model | Accuracy | F1-Score | Concordance | AUC |
---|---|---|---|---|
LGBMClassifier | 0.647 | 0.64 | 0.699 | 0.699 |
Key features identified by the model:
conda create -n env python=3.8
conda activate env
pip install -r requirements.txt
git clone https://github.com/Bhardwaj-Saurabh/Microsoft_malware_detection.git
cd Microsoft_malware_detection
python main.py
python app.py
This project predicts the probability of a Windows machine getting infected by various families of malware. Key steps involved exploratory data analysis (EDA), feature engineering, model selection, and deploying a Flask-based web application.
One key challenge was handling high cardinality in categorical features, which led to data leakage. Methods to improve testing accuracy include:
β’ Better imputation of missing values for categorical data.
β’ Focus on features directly impacting model performance.
β’ Use larger datasets and experiment with neural networks.
License: MIT License
Author: Saurabh Bhardwaj
Contact: LinkedIn
Special thanks to Microsoft, Windows Defender ATP Research, and Kaggle for providing the data and platform to build this project.
Data Source: The dataset is publicly available on Kaggle.