Skip to content

An end to end Data Science project for malware detection based on microsoft dataset provided on kaggle.

License

Notifications You must be signed in to change notification settings

Bhardwaj-Saurabh/Microsoft_malware_detection

Repository files navigation

🚨 Microsoft Malware Detection: Data Science Project 🚨

Malware


πŸ“ 1. Problem Statement

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can harm both consumers and enterprises in numerous ways.

With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security. As part of their strategy, Microsoft challenges the data science community to develop techniques to predict if a machine will soon be hit with malware. Kagglers are provided with an unprecedented dataset to encourage progress in predicting malware occurrences.


🎯 2. Goal

The aim of this project is to predict the probability of a Windows machine being infected by various families of malware based on different properties of that machine.


βš™οΈ 3. Requirements & Constraints

3.1 Functional Requirements

  • Users can input process parameters and receive a prediction of potential malware infection.
  • Users can view performance metrics of different machine learning models.
  • Users can visualize data to gain insights into machine behavior.

3.2 Non-functional Requirements

  • The model should maintain high precision, recall, and F1 scores.
  • The web application must be responsive and user-friendly.
  • The web application should be secure and ensure data protection.

3.3 Constraints

  • Built using Flask and deployed using Docker.
  • Minimal deployment costs.

3.4 Out-of-scope

  • Integration with external applications or data sources.
  • Detailed equipment diagnostics.

πŸ“Š 4. Data Description

The dataset provided by Microsoft contains almost 8 million records based on 83 features. However, due to system capacity, 100,000 records were used for this project. Important features include App version, Windows Version, Anti-virus Version, Operating System, Storage, Display used, and RAM. The dataset is publicly available on Kaggle.

  • The dataset is balanced and includes missing values.
  • It is considered a challenging task in machine learning for classification, clustering, and feature selection.

Key Data Features

Malware Malware Malware


🧹 5. Data Preprocessing:

To enhance model performance, the following steps were undertaken:

  1. Removed features with over 90% missing values or zero variance.
  2. Engineered features based on App version, Windows version, Anti-virus version, and other machine properties.
  3. Handled missing data and scaled numerical features using MinMax scaling.
  4. Used frequency encoding and ordinal encoding for categorical features.
  5. Split the dataset using a 75/25 ratio for training and testing.

For a detailed description of preprocessing, refer to the [data preprocessing notebook](notebook/Malware Detection.ipynb).


πŸ“ 6. Performance Metrics

The project's success will be measured by:

  • Precision, recall, and F1 scores.
  • Ease of use and responsiveness of the web application.
  • Ability to accurately identify Windows systems vulnerable to malware.

πŸ› οΈ 7. Methodology

Malware

The following steps were followed in the MLOps pipeline:

  1. Data Ingestion: Load and divide data into training and testing sets.
  2. Data Validation: Check for data drift and validation.
  3. Data Preprocessing: Clean, encode, and scale data.
  4. Model Building: Train multiple models and select the best one using performance metrics.
  5. Model Evaluation: Analyze precision, recall, F1-score, and AUC.
  6. Model Pusher: Push the best model into production.
  7. Web Application: Develop a Flask-based app to handle predictions.
  8. CI/CD Pipeline: Automate deployment using GitHub Actions for continuous integration and deployment.

πŸ† 8. Results

Model Accuracy
LogisticRegression 0.637
DecisionTreeClassifier 0.569
RandomForestClassifier 0.639
GaussianNB 0.600
SVC 0.638
XGBClassifier 0.632
LGBMClassifier 0.647

8.1 Model Comparison

Malware

8.2 Final Metrics

Model Accuracy F1-Score Concordance AUC
LGBMClassifier 0.647 0.64 0.699 0.699

Malware Malware

8.3 Feature Importance

Key features identified by the model:

Malware


βš™οΈ 9. Dependencies and Requirements

9.1 Create a Conda environment

conda create -n env python=3.8

9.2 Activate the environment

conda activate env

9.3 Install dependencies

pip install -r requirements.txt

πŸš€ 10. Usage Instructions

10.1 Clone the repository

git clone https://github.com/Bhardwaj-Saurabh/Microsoft_malware_detection.git

10.2 Change the directory

cd Microsoft_malware_detection

10.3 Run the training pipeline

python main.py

10.4 Run the web app

python app.py

Malware

🏁 11. Conclusion

This project predicts the probability of a Windows machine getting infected by various families of malware. Key steps involved exploratory data analysis (EDA), feature engineering, model selection, and deploying a Flask-based web application.

One key challenge was handling high cardinality in categorical features, which led to data leakage. Methods to improve testing accuracy include:

β€’	Better imputation of missing values for categorical data.
β€’	Focus on features directly impacting model performance.
β€’	Use larger datasets and experiment with neural networks.

πŸ’Ό 12. License, Author & Acknowledgments

License: MIT License

Author: Saurabh Bhardwaj

Contact: LinkedIn

πŸ› οΈ 13. Acknowledgments:

Special thanks to Microsoft, Windows Defender ATP Research, and Kaggle for providing the data and platform to build this project.

Data Source: The dataset is publicly available on Kaggle.

About

An end to end Data Science project for malware detection based on microsoft dataset provided on kaggle.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages