This is my capstone project for the Data Science Immersive at General Assembly.
In this project, my goals are:
- Set up real-time Data Collection process and Data Infrastructure
- Examine different Natural Language Processing tools on collected tweets
- Create A|B Testing Model on similarity comparison
- Use Time Series Modeling to catch the trends
- Tuning hyperparameters for model improvement
To test my framework:
-
I collected and cleaned over 1.5 million tweets from uing TwitterStream API
/lib/get_tweets.py
-
Create scheduled and on demand LSA processing for text vectrozation
/ipynb/01_Fit_pipeline_TfiDf_SVD.ipynb
-
Event and Trend Detection using Cosine Similarity and ARIMA modeling
Event extracting using TFIDF and SVD /ipynb/03_Tweets_Modeling_CosineSim_AB_Test_SVD.ipynb Hashtag Time Series Modeling /ipynb/05_Hashtags_Modeling_WhatsTrending.ipynb
- Python
- TwitterStream API
- postgres
- redis
- NLP (LSA): SAPCY | TFIDF |SVD | Count Vectorizer
- Cosine Similarity
- ARIMA Modeling