A trove of reviews, businesses, users, tips, and check-in data!
This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset you'll find information about businesses across 11 metropolitan areas in four countries.
Highlights from each part of the exercise is shared below with relevant code snippets and visualizations. For more details, please look at the 'Analysis' notebook under the root folder, here.
In this portion, we will import the necessary dependencies and load our dataset as a pyspark dataframe.
Notebook Location
https://e-8n5tqmthd2908buqkudp1rlh2.emrnotebooks-prod.us-east-2.amazonaws.com/e-8N5TQMTHD2908BUQKUDP1RLH2/notebooks/Analysis.ipynb
Initiating the pyspark kernel
%%info
Loading data from S3 bucket
df = spark.read.json('s3://yelp-dataset-tm/*.json')
Load and verify the dependencies
sc.list_packages()
Overview of the dataset
df.printSchema()
- This module is completed
For this part, we will take a stab at denormalizing the categories that are associated with each business (there may be more than one, presented as a string of comma separated identifiers) and then running some basic analysis on the result.
Breaking of multiple categories from 'categories' column into multiple rows
res = df_2.select(df_2.business_id, explode(split(df_2.categories, ', ')).alias('category'))
Total unique categories
unique = res.select("category").distinct()
unique.select("category").count()
Total categories by business
res.groupBy("category").count().show()
Top 20 business categories
category = df.select('categories')
individual_category = category.select(explode(split('categories', ', ')).alias('category'))
grouped_category = individual_category.groupby('category').count()
top_category = grouped_category.sort('count',ascending=False)
top_category.show(20,truncate=False)
- This module is completed
For this next part, we will attempt to answer the question:
- Are the (written) reviews generally more pessimistic or more optimistic as compared to the overall business rating.
Calculate skewness
temp1 = df.select('business_id', 'name', 'city', 'state', 'stars')
join = temp1.join(avg_stars, on=['business_id'], how='inner')
cols = [col for col in join.columns if col not in ['business_id']]
join_res = join[cols]
join_res = join_res.withColumn("skewness", (col("avg(stars)") - col("stars")) / col("stars"))
- This module is completed
For this final part we may choose to either answer this question posed or explore the data in some other manner of own own choosing. The only requirements are:
- We must leverage the users dataset provided
- We must have at least one data visualization as part of your analysis
Joining Buisness, User and Review dataset
business = df.select('business_id', 'city', 'state', 'stars').withColumnRenamed('stars', 'business_stars')
review = df_rev.select('business_id', 'date', 'review_id', 'user_id', 'stars').withColumnRenamed('stars', 'review_stars')
join_b_r = business.join(review, on=['business_id'], how='inner')
join_b_r_u = join_b_r.join(user, on=['user_id'], how='inner')
Cleaning of the data
final = join_b_r_u.select('user_id', 'business_id', 'review_id', 'name', 'city', 'state', 'business_stars',
'review_stars', 'average_stars', 'elite', 'fans', 'review_count',
year(to_date(join_b_r_u.date, 'yyyy-MM-dd HH:mm:ss')).alias('review_year'),
year(to_date(join_b_r_u.yelping_since, 'yyyy-MM-dd HH:mm:ss')).alias('yelping_since'))
Categorizing Elite and Non-elite users
non_elite = final.filter(final.elite == '')
elite = final.filter(final.elite != '')
- This module is completed
To write a python or bash script that leverages the kaggle API module to download the dataset and the AWS boto3 module to upload to S3. This script must run in a docker container and it should work for anyone’s AWS and Kaggle accounts for any dataset.
All the neccesary info is available here.
All the additional info about the project - the tools used, the servers required, system configuration, references, etc are included in this section.
System Specification:
Operating System: Windows 10
RAM Size: 16 GB
Memory: 500 GB
Tools Used:
Programming Language: Python (Version 3.7)
Editor: Jupyter Notebook
Platform: Amazon EMR (Elastic MapReduce)
Services Commissioned:
Cloud Platform: Amazon Web Services (AWS)
Framework: Apache PySpark
Version Control System: Git
Offline:
Classrom: Room 10-155
Timing: Friday, 1800 to 2100 Hours
Address: Baruch Vertical Campus,
55, Lexington Avenue,
New York, USA
Online:
Slack: STA9760
Members: 53
Prof. Taqqui Karim
Subject: 9760 - Big Data Technologies
Session: Spring, 2020