Skip to content

Spark program for processing data from the TMDB dataset in Scala

Notifications You must be signed in to change notification settings

andersonkmi/kaggle-tmdb-movie-dataset-spark

Repository files navigation

Kaggle TMDB Movie data exploration in Spark Build Status

Spark program for processing data from the TMDB dataset in Scala.

Introduction

The idea of this project is to play with join operations on data frames inside Spark and use a different method of loading a CSV file. In the previous project I used a RDD for loading the information and in this it is being used the format reading directly.

Description

This program loads two CSV files obtained from Kaggle: tmdb_5000_credits.csv and tmdb_5000_movies.csv and performs a join between both data sets.

Other challange using this data set was the mix of CSV and JSON formats and in such situation it required the use of some special functions to load and handle JSON data.

Build it and run it

In order to build and run it, extract both CSV files from Kaggle web site (see References section below) and place the files in the project's root folder, then issue the command below:

$ sbt run "--s3-source-bucket s3-bucket-here --s3-source-key prefix/tmdb-5000-movie-dataset.zip --source /tmp/csv --destination /temp"

where:

  • --source is the folder where the CSV files are located.
  • --destination is the folder where the generated filed will be persisted.
  • --s3-source-bucket is the bucket where your file is located.
  • --s3-source-key is the zip file key name.

Exported results

After the program execution, the following folders are created:

  • single_value_df: this contains a CSV file with single values extracted from the movies data set.
  • sorted_movies_budget: contains a CSV file with movies sorted by budget.
  • sorted_movies_revenue: contains a CSV file with movies sorted by revenue.
  • sorted_movies_vote_avg: contains a CSV file with movies sorted by vote average count.
  • top10_casting_movie_revenue: contains a JSON file with casting names from top 10 movies by revenue ("most profitable casting").

Changelog

All modifications developed for this project are listed on CHANGELOG.md

References

About

Spark program for processing data from the TMDB dataset in Scala

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages