Edward Reed, Danny Lee, and Nahshon Williams project 2 for the Big Data October 2020 branch
Requirements:
- Create a Spark Application that processes Twitter data
- Your project 2 pitch should involve some analysis of twitter data. This can be the central feature. Your final application should work, to some extent, with both streaming and historical data.
- Send me a link to a git repo, have someone in the group manage git (but they can ask for help from me)
- Produce a one or more .jar files for your analysis. Multiple smaller jar files are preferred.
- Slidedeck available at Google Slides
- Bring a simple slide deck providing an overview of your results. You should present your results, a high level overview of the process used to achieve those results, and any assumptions and simplifications you made on the way to those results.
- I may ask you to run an analysis on the day of the presentation, so be prepared to do so.
- We'll have 20 minutes per group, so make sure your presentation can be covered in that time, focusing on the parts of your analysis you find most interesting.
- Include a link to your github repository at the end of your slides
- Apache Spark
- Spark SQL
- YARN
- HDFS/Google Cloud Storage
- Google Dataproc
- Scala 2.12.10
- Sbt
- Git + GitHub
- Presentations will take place on Monday, 11/23
Q1: Can we identify patterns in trending topics being popular in one location and then moving to others?
Q2: If so, can we identify separate regions for these patterns? ex.: North American patterns, East Asian patterns, etc.
Q3: What about global scale patterns for specific events? ex.: Burning of Notre Dame-how did this topic travel according to twitter usage?
- To stream Twitter trend data using Spark streaming, clone he
Project-2 @ 5748ae0
repository. - To convert Twitter trend API response to "trend objects" for analysis, clone the
transformer-pj2 @ 9359f89
repository. - To do analysis on the "trend objects", clone this repository and cd into
Simplified-trending
repository.
There is a rather large (409K) list of trend objects available as processed_trends_full_dataset.csv
if you would like to move straight to analaysis. Just copy the file into the Simplified-trending directory and designate it as the input file.
There are links to sample (unprocessed data) below.
- Sample data (gDrive): sample-trend-data-11_13-19_20202.tar.gz 823kb gzipped, 13mb unzipped.
- Larger sample data set (gDrive): sample-trend-data_larger_response_set.tar.gz 2mb gzipped, 54mb unzipped.
- Consolidated all data to one file (gDrive): sample_trend_data_whole_set.tar.gz 3.2mb, 67mb unzipped.