🚨 Microsoft Malware Detection: Data Science Project 🚨

📝 1. Problem Statement

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can harm both consumers and enterprises in numerous ways.

With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security. As part of their strategy, Microsoft challenges the data science community to develop techniques to predict if a machine will soon be hit with malware. Kagglers are provided with an unprecedented dataset to encourage progress in predicting malware occurrences.

🎯 2. Goal

The aim of this project is to predict the probability of a Windows machine being infected by various families of malware based on different properties of that machine.

⚙️ 3. Requirements & Constraints

3.1 Functional Requirements

Users can input process parameters and receive a prediction of potential malware infection.
Users can view performance metrics of different machine learning models.
Users can visualize data to gain insights into machine behavior.

3.2 Non-functional Requirements

The model should maintain high precision, recall, and F1 scores.
The web application must be responsive and user-friendly.
The web application should be secure and ensure data protection.

3.3 Constraints

Built using Flask and deployed using Docker.
Minimal deployment costs.

3.4 Out-of-scope

Integration with external applications or data sources.
Detailed equipment diagnostics.

📊 4. Data Description

The dataset provided by Microsoft contains almost 8 million records based on 83 features. However, due to system capacity, 100,000 records were used for this project. Important features include App version, Windows Version, Anti-virus Version, Operating System, Storage, Display used, and RAM. The dataset is publicly available on Kaggle.

The dataset is balanced and includes missing values.
It is considered a challenging task in machine learning for classification, clustering, and feature selection.

Key Data Features

🧹 5. Data Preprocessing:

To enhance model performance, the following steps were undertaken:

Removed features with over 90% missing values or zero variance.
Engineered features based on App version, Windows version, Anti-virus version, and other machine properties.
Handled missing data and scaled numerical features using MinMax scaling.
Used frequency encoding and ordinal encoding for categorical features.
Split the dataset using a 75/25 ratio for training and testing.

For a detailed description of preprocessing, refer to the [data preprocessing notebook](notebook/Malware Detection.ipynb).

📏 6. Performance Metrics

The project's success will be measured by:

Precision, recall, and F1 scores.
Ease of use and responsiveness of the web application.
Ability to accurately identify Windows systems vulnerable to malware.

🛠️ 7. Methodology

The following steps were followed in the MLOps pipeline:

Data Ingestion: Load and divide data into training and testing sets.
Data Validation: Check for data drift and validation.
Data Preprocessing: Clean, encode, and scale data.
Model Building: Train multiple models and select the best one using performance metrics.
Model Evaluation: Analyze precision, recall, F1-score, and AUC.
Model Pusher: Push the best model into production.
Web Application: Develop a Flask-based app to handle predictions.
CI/CD Pipeline: Automate deployment using GitHub Actions for continuous integration and deployment.

🏆 8. Results

Model	Accuracy
LogisticRegression	0.637
DecisionTreeClassifier	0.569
RandomForestClassifier	0.639
GaussianNB	0.600
SVC	0.638
XGBClassifier	0.632
LGBMClassifier	0.647

8.1 Model Comparison

8.2 Final Metrics

Model	Accuracy	F1-Score	Concordance	AUC
LGBMClassifier	0.647	0.64	0.699	0.699

8.3 Feature Importance

Key features identified by the model:

⚙️ 9. Dependencies and Requirements

9.1 Create a Conda environment

conda create -n env python=3.8

9.2 Activate the environment

conda activate env

9.3 Install dependencies

pip install -r requirements.txt

🚀 10. Usage Instructions

10.1 Clone the repository

git clone https://github.com/Bhardwaj-Saurabh/Microsoft_malware_detection.git

10.2 Change the directory

cd Microsoft_malware_detection

10.3 Run the training pipeline

python main.py

10.4 Run the web app

python app.py

🏁 11. Conclusion

This project predicts the probability of a Windows machine getting infected by various families of malware. Key steps involved exploratory data analysis (EDA), feature engineering, model selection, and deploying a Flask-based web application.

One key challenge was handling high cardinality in categorical features, which led to data leakage. Methods to improve testing accuracy include:

•	Better imputation of missing values for categorical data.
•	Focus on features directly impacting model performance.
•	Use larger datasets and experiment with neural networks.

💼 12. License, Author & Acknowledgments

License: MIT License

Author: Saurabh Bhardwaj

Contact: LinkedIn

🛠️ 13. Acknowledgments:

Special thanks to Microsoft, Windows Defender ATP Research, and Kaggle for providing the data and platform to build this project.

Data Source: The dataset is publicly available on Kaggle.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.vscode		.vscode
__pycache__		__pycache__
config		config
images		images
logs		logs
microsoft.egg-info		microsoft.egg-info
notebook		notebook
saved_models/1698007444		saved_models/1698007444
src		src
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚨 Microsoft Malware Detection: Data Science Project 🚨

📝 1. Problem Statement

🎯 2. Goal

⚙️ 3. Requirements & Constraints

3.1 Functional Requirements

3.2 Non-functional Requirements

3.3 Constraints

3.4 Out-of-scope

📊 4. Data Description

Key Data Features

🧹 5. Data Preprocessing:

📏 6. Performance Metrics

🛠️ 7. Methodology

The following steps were followed in the MLOps pipeline:

🏆 8. Results

8.1 Model Comparison

8.2 Final Metrics

8.3 Feature Importance

⚙️ 9. Dependencies and Requirements

9.1 Create a Conda environment

9.2 Activate the environment

9.3 Install dependencies

🚀 10. Usage Instructions

10.1 Clone the repository

10.2 Change the directory

10.3 Run the training pipeline

10.4 Run the web app

🏁 11. Conclusion

💼 12. License, Author & Acknowledgments

🛠️ 13. Acknowledgments:

About

Releases

Packages

Languages

License

Bhardwaj-Saurabh/Microsoft_malware_detection

Folders and files

Latest commit

History

Repository files navigation

🚨 Microsoft Malware Detection: Data Science Project 🚨

📝 1. Problem Statement

🎯 2. Goal

⚙️ 3. Requirements & Constraints

3.1 Functional Requirements

3.2 Non-functional Requirements

3.3 Constraints

3.4 Out-of-scope

📊 4. Data Description

Key Data Features

🧹 5. Data Preprocessing:

📏 6. Performance Metrics

🛠️ 7. Methodology

The following steps were followed in the MLOps pipeline:

🏆 8. Results

8.1 Model Comparison

8.2 Final Metrics

8.3 Feature Importance

⚙️ 9. Dependencies and Requirements

9.1 Create a Conda environment

9.2 Activate the environment

9.3 Install dependencies

🚀 10. Usage Instructions

10.1 Clone the repository

10.2 Change the directory

10.3 Run the training pipeline

10.4 Run the web app

🏁 11. Conclusion

💼 12. License, Author & Acknowledgments

🛠️ 13. Acknowledgments:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages