Diabetes classification using KNN

Objectives:

In this project our goal is to Predict the onset of diabetes based on diagnostic measures.

Dataset:

About Data:

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Implementation:

Libraries: sklearn Matplotlib pandas seaborn NumPy Scipy

A few glimpses of EDA:

Features of dataset:

Data Imputation:

diabetes_data_copy['Glucose'].fillna(diabetes_data_copy['Glucose'].mean(), inplace = True)
diabetes_data_copy['BloodPressure'].fillna(diabetes_data_copy['BloodPressure'].mean(), inplace = True)
diabetes_data_copy['SkinThickness'].fillna(diabetes_data_copy['SkinThickness'].median(), inplace = True)
diabetes_data_copy['Insulin'].fillna(diabetes_data_copy['Insulin'].median(), inplace = True)
diabetes_data_copy['BMI'].fillna(diabetes_data_copy['BMI'].median(), inplace = True)

Plotting feartures after imputation:

Model Training and Evaluation:

KNN

for i in range(1,15):

    knn = KNeighborsClassifier(i)
    knn.fit(X_train,y_train)

Max test score 76.5625 % and k = [11]

Plotting Decision Regions:

Confusion Matrix:

The confusion matrix is a technique used for summarizing the performance of a classification algorithm i.e. it has binary outputs.

Results:

Classification Report:

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. We have got 0.788 precision which is pretty good.

Precision = TP/TP+FP

Recall (Sensitivity) is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the passengers that truly survived, how many did we label? A recall greater than 0.5 is good.

Recall = TP/TP+FN

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

F1 Score = 2(Recall Precision) / (Recall + Precision)

ROC-AUC:

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

Optimizations

Scaling: It is always advisable to bring all the features to the same scale for applying distance based algorithms like KNN.
We can imagine how the feature with greater range with overshadow or dimenish the smaller feature completely and this will impact the performance of all distance based model as it will give higher weightage to variables which have higher magnitude.
Cross Validation: When model is split into training and testing it can be possible that specific type of data point may go entirely into either training or testing portion. This would lead the model to perform poorly. Hence over-fitting and underfitting problems can be well avoided with cross validation techniques. Stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.

For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's.

Hyperparameter Tuning:

Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

from sklearn.model_selection import GridSearchCV
parameters_grid = {"n_neighbors": np.arange(0,50)}
knn= KNeighborsClassifier()
knn_GSV = GridSearchCV(knn, param_grid=parameters_grid, cv = 5)
knn_GSV.fit(X, y)
print("Best Params" ,knn_GSV.best_params_)
print("Best score" ,knn_GSV.best_score_)

Best Params {'n_neighbors': 25}
Best score 0.7721840251252015

Lessons Learned

Data Imputation Handling Outliers Feature Engineering Classification Models Parameter Optimization

References:

Skewness Scaling Rescaling the data in ML using scikit lerarn Confusion Matrix Classification Report

Feedback

If you have any feedback, please reach out at pradnyapatil671@gmail.com

🚀 About Me

Hi, I'm Pradnya! 👋

I am an AI Enthusiast and Data science & ML practitioner

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
output		output
Diabetes classification using KNN.ipynb		Diabetes classification using KNN.ipynb
README.md		README.md
diabetes.csv		diabetes.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes classification using KNN

Objectives:

Dataset:

About Data:

Implementation:

A few glimpses of EDA:

Features of dataset:

Data Imputation:

Plotting feartures after imputation:

Model Training and Evaluation:

KNN

Plotting Decision Regions:

Confusion Matrix:

Classification Report:

ROC-AUC:

Optimizations

Lessons Learned

References:

Feedback

🚀 About Me

Hi, I'm Pradnya! 👋

About

Languages

Pradnya1208/Diabetes-classification-using-KNN

Folders and files

Latest commit

History

Repository files navigation

Diabetes classification using KNN

Objectives:

Dataset:

About Data:

Implementation:

A few glimpses of EDA:

Features of dataset:

Data Imputation:

Plotting feartures after imputation:

Model Training and Evaluation:

KNN

Plotting Decision Regions:

Confusion Matrix:

Classification Report:

ROC-AUC:

Optimizations

Lessons Learned

References:

Feedback

🚀 About Me

Hi, I'm Pradnya! 👋

About

Topics

Resources

Stars

Watchers

Forks

Languages