Skip to content

Latest commit

 

History

History
50 lines (39 loc) · 3.07 KB

README.md

File metadata and controls

50 lines (39 loc) · 3.07 KB

Revature_project_2

Edward Reed, Danny Lee, and Nahshon Williams project 2 for the Big Data October 2020 branch

Project 2

Requirements:

  • Create a Spark Application that processes Twitter data
  • Your project 2 pitch should involve some analysis of twitter data. This can be the central feature. Your final application should work, to some extent, with both streaming and historical data.
  • Send me a link to a git repo, have someone in the group manage git (but they can ask for help from me)
  • Produce a one or more .jar files for your analysis. Multiple smaller jar files are preferred.

Presentations

  • Slidedeck available at Google Slides
  • Bring a simple slide deck providing an overview of your results. You should present your results, a high level overview of the process used to achieve those results, and any assumptions and simplifications you made on the way to those results.
  • I may ask you to run an analysis on the day of the presentation, so be prepared to do so.
  • We'll have 20 minutes per group, so make sure your presentation can be covered in that time, focusing on the parts of your analysis you find most interesting.
  • Include a link to your github repository at the end of your slides

Technologies

  • Apache Spark
  • Spark SQL
  • YARN
  • HDFS/Google Cloud Storage
  • Google Dataproc
  • Scala 2.12.10
  • Sbt
  • Git + GitHub

Due Date

  • Presentations will take place on Monday, 11/23

Questions

Q1: Can we identify patterns in trending topics being popular in one location and then moving to others?
Q2: If so, can we identify separate regions for these patterns? ex.: North American patterns, East Asian patterns, etc.
Q3: What about global scale patterns for specific events? ex.: Burning of Notre Dame-how did this topic travel according to twitter usage?

Instructions

  • To stream Twitter trend data using Spark streaming, clone he Project-2 @ 5748ae0 repository.
  • To convert Twitter trend API response to "trend objects" for analysis, clone the transformer-pj2 @ 9359f89 repository.
  • To do analysis on the "trend objects", clone this repository and cd into Simplified-trending repository.

There is a rather large (409K) list of trend objects available as processed_trends_full_dataset.csv if you would like to move straight to analaysis. Just copy the file into the Simplified-trending directory and designate it as the input file.

There are links to sample (unprocessed data) below.

Raw data