Kaggle TMDB Movie data exploration in Spark

Spark program for processing data from the TMDB dataset in Scala.

Introduction

The idea of this project is to play with join operations on data frames inside Spark and use a different method of loading a CSV file. In the previous project I used a RDD for loading the information and in this it is being used the format reading directly.

Description

This program loads two CSV files obtained from Kaggle: tmdb_5000_credits.csv and tmdb_5000_movies.csv and performs a join between both data sets.

Other challange using this data set was the mix of CSV and JSON formats and in such situation it required the use of some special functions to load and handle JSON data.

Build it and run it

In order to build and run it, extract both CSV files from Kaggle web site (see References section below) and place the files in the project's root folder, then issue the command below:

$ sbt run "--s3-source-bucket s3-bucket-here --s3-source-key prefix/tmdb-5000-movie-dataset.zip --source /tmp/csv --destination /temp"

where:

--source is the folder where the CSV files are located.
--destination is the folder where the generated filed will be persisted.
--s3-source-bucket is the bucket where your file is located.
--s3-source-key is the zip file key name.

Exported results

After the program execution, the following folders are created:

single_value_df: this contains a CSV file with single values extracted from the movies data set.
sorted_movies_budget: contains a CSV file with movies sorted by budget.
sorted_movies_revenue: contains a CSV file with movies sorted by revenue.
sorted_movies_vote_avg: contains a CSV file with movies sorted by vote average count.
top10_casting_movie_revenue: contains a JSON file with casting names from top 10 movies by revenue ("most profitable casting").

Changelog

All modifications developed for this project are listed on CHANGELOG.md

References

TMDB 5000 Movie Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
project		project
src/main/scala/org/codecraftlabs/kaggle/tmdb		src/main/scala/org/codecraftlabs/kaggle/tmdb
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle TMDB Movie data exploration in Spark

Introduction

Description

Build it and run it

Exported results

Changelog

References

About

Releases 5

Packages

Languages

andersonkmi/kaggle-tmdb-movie-dataset-spark

Folders and files

Latest commit

History

Repository files navigation

Kaggle TMDB Movie data exploration in Spark

Introduction

Description

Build it and run it

Exported results

Changelog

References

About

Topics

Resources

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages