The script starts by importing necessary libraries (pandas
, numpy
, seaborn
, matplotlib.pyplot
) and reading a CSV file into a DataFrame (df
).
Basic exploration of the dataset using head()
, describe()
, and checking for missing values using isnull().sum()
.
One-hot encoding is performed on categorical variables using pd.get_dummies()
.
Missing values are imputed using the k-nearest neighbors algorithm (KNNImputer
from sklearn.impute
).
Features are scaled using MinMaxScaler
, and the dataset is split into training and testing sets.
Several classification models are chosen (KNeighborsClassifier
, GaussianNB
, DecisionTreeClassifier
, and RandomForestClassifier
) for initial testing.
Classification reports are generated for each model to evaluate their performance on the imbalanced dataset.
The script uses the Synthetic Minority Over-sampling Technique (SMOTE) to oversample the minority class.
The same models are re-trained and evaluated on the oversampled dataset.
Random under-sampling is performed to balance the class distribution.
The models are re-trained and evaluated on the undersampled dataset.
The SMOTEENN technique, which combines SMOTE and Edited Nearest Neighbours (ENN), is applied.
The models are re-trained and evaluated on the combined dataset.
- The script provides classification reports for each model after different resampling techniques.
- It highlights that resampling techniques, particularly SMOTEENN, improve the model's ability to identify cases positive for stroke.