Disease Prediction Project
This machine learning project comes from the Applied Machine Learning course I took in Fall 2020.
The goal is to predict whether or not a patient has a certain unspecified disease. This is a binary classification problem.
Provided by the professor the course, the training dataset has 49,000 rows and 12 columns. Methodology:
I discussed the potential data quality issues I identified about the dataset and how I applied various data preprocessing techniques to cope with those issues and performed Exploratory Data Analysis (EDA). Whenever appropriate, I enhanced my EDA with the effective data visualization.
I applied a list of machine learning algorithms covered in the course to the training data and construct disease diagnosis models. I also performed extensive model experiments with hyper-parameters’ tuning.
The first jupyter notebook has NBC, KNN, linear SVM, non-linear SVM, Random Forest and Gradient Boosting Machine. The second jupyter notebook has Logistic Regression, Artificial Neural Network/Deep Learning and Decision Tree.
After building the classification models, I applied them to the test dataset (Disease Prediction Testing.csv) provided to predict if each person in the testing dataset has the disease.