Spark program for processing data from the TMDB dataset in Scala.
The idea of this project is to play with join operations on data frames inside Spark and use a different method of loading a CSV file. In the previous project I used a RDD for loading the information and in this it is being used the format reading directly.
This program loads two CSV files obtained from Kaggle: tmdb_5000_credits.csv and tmdb_5000_movies.csv and performs a join between both data sets.
Other challange using this data set was the mix of CSV and JSON formats and in such situation it required the use of some special functions to load and handle JSON data.
In order to build and run it, extract both CSV files from Kaggle web site (see References section below) and place the files in the project's root folder, then issue the command below:
$ sbt run "--s3-source-bucket s3-bucket-here --s3-source-key prefix/tmdb-5000-movie-dataset.zip --source /tmp/csv --destination /temp"
where:
- --source is the folder where the CSV files are located.
- --destination is the folder where the generated filed will be persisted.
- --s3-source-bucket is the bucket where your file is located.
- --s3-source-key is the zip file key name.
After the program execution, the following folders are created:
- single_value_df: this contains a CSV file with single values extracted from the movies data set.
- sorted_movies_budget: contains a CSV file with movies sorted by budget.
- sorted_movies_revenue: contains a CSV file with movies sorted by revenue.
- sorted_movies_vote_avg: contains a CSV file with movies sorted by vote average count.
- top10_casting_movie_revenue: contains a JSON file with casting names from top 10 movies by revenue ("most profitable casting").
All modifications developed for this project are listed on CHANGELOG.md