Included in this repo are some interesting data manipulation and modelling projects that I worked on over the last few months. All analysis was performed in Python 3 (Jupyter Notebook). Below is a brief introduction to each of the projects included.
For more information on the individual projects including some interesting finds during exploratory analysis, please go into the sub-folders. Also looking to improve existing code and extend current functionality so if anyone has got interesting ideas or suggestions for future work, please do let me know!
-
Analysis on United Kingdoms road safety and traffic demographics dataset obtained from UK Traffic Dataset - Kaggle with the following key goals:
- Identify common factors responsible for higher accident rates through various feature engineering techniques
- Carry out a restrospective study of the historical dataset and perform descriptive analysis (Tableau, Power BI and Excel Power Pivot)
- Attempt to correct an imbalanced target class (SMOTE, Cluster Centroid, Tomek Links)
- Perform hyper-paramter tuning using GridsearchCV (scikit-learn python package) to enhance predictive power of several supervised learning models (KNN, SVM, Naive Bayes, Logistic Regression, Random Forest, Gradient Boost - Scikit-learn)
-
Analyze several thousand tweets collected using Twitters Streaming API in JSON format to perform sentiment analysis and classify them into sub categories for a more general consensus. The topic for this NLP project was the 106th #Greycup/#greycup held in Edmonton in November, 2018. Key analytic goals:
- Perform a clean data pull from Twitter and transform data for analysis in python (Tweepy)
- Various descriptive and time series analysis for insights (matplotlib (Basemap), Mapboxgl)
- Build predictive models to classify sentiment of a tweet (Naive Bayes, SVM - Linear/Polynomial)