ML Algorithms from scratch Python (Learning Repository)

This repository contains the implementations of popular Machine Learning Algorithms in python without using in-built libraries.

Perceptron Learning Algorithm (perceptron.py)
Adaptive Boosting Algorithm (adaBoost.py)
K-Means Clustering (kmeans.py)
K Nearest Neighbour Classifier (knn.py)
Support Vector Machine Classifier (svm.py)

Datasets Used

Breast_cancer_data.csv
linearly-separable-dataset.csv

Perceptron Learning Algorithm

Dataset: linearly-separable-dataset.csv / Breast_cancer_data.csv
Mode: erm / kfold (For cross validation, number of folds = 10 by default)
Epochs: Integer variable

Observations

The algorithm terminated on its own after I ran some epochs for the linearly-separable dataset. The algorithm gets converged when the training error (empirical risk minimization) comes out to be 0. This algorithm terminated because perceptron algorithm handles only linearly separable datasets.
The heuristic that I took for the termination of the algorithm is by comparing the training error of four consecutive epochs. When the error remains constant then I terminate it.
For the breast_cancer_dataset, the algorithm did not converge i.e. it did not terminate as intended.
As per the experiments, breast_cancer_dataset is not linearly separable so this is the reason why the algorithm did not terminate.
Training Error did not converge to 0, so we can infer that the algorithm for this dataset will not converge since breast_cancer_data is not linearly separable.
The runtime of linearly-separable is less compared to breast_cancer_data since breast_cancer_data is not linearly separable.

Adaptive Boosting Algorithm

Mode: erm / kfold (For cross validation, number of folds = 10 by default) / plot (for plotting graph b/w ERM and Validation error vs Rounds)
Rounds: Integer variable

Observations

As you can see that the plot (Figure 1) below clearly depicts that the validation loss is at all times more than the empirical loss.
This behaviour shows that the training folds yield less loss compared to the test folds. According to the figure, I would recommend 15-20 rounds(epochs) to clearly show the difference between the empirical and validation loss.

K-Means Clustering

dataset: Breast_cancer_data.csv
distance: default='euclidean' otherwise it takes values as 'cosine', 'manhattan'

Observations

My k-means algorithm converges when each data point in the dataset has the same cluster allocation for consecutive 2 epochs. I have even plotted a graph of difference in clusters vs epochs. This graph shows the number of data points which have different clusters in two consecutive epochs.

Euclidean

Manhattan

Cosine

I observed that when k=2 i.e. when we have to divide the dataset into 2 clusters, each cluster can clearly divide the dataset into 2 clusters based on the diagnosis label. This is shown in the table below. I have calculated the percentage of data points having 0 as diagnosis in both the clusters and similarly, diagnosis of 1 in both the clusters.

K Nearest Neighbour Classifier

Required arguments:

dataset: Breast_cancer_data.csv
k: number of neighbours
epochs: Integer variable

Observations

I have normalized the dataset to prevent biasing of features and I have split the dataset into 80% training and 20% testing.
Evaluation of dataset is done by predicting the label of each test instance.
Prediction of each instance is done by calculating the euclidean distance of each test instance with all the data points in training data.
Then sorting the distance in ascending order and extracting the labels of the first k data points.
The final label of test instance is the majority label of its k nearest neighbours that I extracted in the previous step.
This is how I evaluated all my testing instances.
Accuracy is calculated by comparing the predicted label and actual label of the respective testing instance.
Obtained an accuracy of around 89% - 93%.
To generalize the accuracy, I have done this for a number of epcochs, then calculated the average accuracy over epochs.

Support Vector Machine Classifier

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane.
In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.
In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Breast_cancer_data.csv		Breast_cancer_data.csv
README.md		README.md
RiskvsRounds.png		RiskvsRounds.png
adaBoost.py		adaBoost.py
cosine.png		cosine.png
euclidiean.png		euclidiean.png
kmeans.py		kmeans.py
knn.py		knn.py
linearly-separable-dataset.csv		linearly-separable-dataset.csv
manhattan.png		manhattan.png
perceptron.py		perceptron.py
svm.png		svm.png
svm.py		svm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Algorithms from scratch Python (Learning Repository)

Datasets Used

Perceptron Learning Algorithm

Observations

Adaptive Boosting Algorithm

Observations

K-Means Clustering

Observations

K Nearest Neighbour Classifier

Observations

Support Vector Machine Classifier

About

Releases

Packages

Languages

shanujshekhar/ML_Algorithms_from_scratch_python

Folders and files

Latest commit

History

Repository files navigation

ML Algorithms from scratch Python (Learning Repository)

Datasets Used

Perceptron Learning Algorithm

Observations

Adaptive Boosting Algorithm

Observations

K-Means Clustering

Observations

K Nearest Neighbour Classifier

Observations

Support Vector Machine Classifier

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages