Please download the report for a thorough explanation of this project. Below you will find the analysis steps and links to the files for each step.
It is interesting to see novels being adapted to films. Our question is whether the science fiction novels’ ratings correlate to ratings of films. Also, is there a correlation between science fiction ratings or film ratings with revenue obtained from a film?
Here are the steps that were taken and some of the problems we found:
-
Get the list of science fiction books adapted into films
-
Obtain book ratings
- Purnima scrapped GoodReads, using the list from step 1 as search queries, to get the user ratings of these adapted science fiction books.
- Code: web_scrapping/goodread_scrape-Copy1.ipynb
- Input: input_csv/newmovielist.csv
- Output: input_csv/merged_list.csv
- Tigran attempted to obtain book ratings from Amazon Books, but was being blocked after a few queries regardless of which IP address and location he'd try from.
- Naim attempted to obtain ratings from Chapters Indigo only to find that the book ratings are not authorized/scrappable
- Purnima scrapped GoodReads, using the list from step 1 as search queries, to get the user ratings of these adapted science fiction books.
-
Obtain film ratings
- Callan queried the list from step 1 to the OMDb API to extract movie ratings and their revenue
- Code: API_manipulation/OMDB_API.ipynb
- Input: input_csv/newmovielist.csv
- Output: Transformed_data/movieListDB.csv
- Callan queried the list from step 1 to the OMDb API to extract movie ratings and their revenue
-
Callan merged the book ratings with the results from the OMDb queries to get a combined dataset with book and movie titles and their corresponding ratings, and movie revenues
- Code: API_manipulation/OMDB_API.ipynb
- Inputs: input_csv/merged_list.csv and Transformed_data/movieListDB.csv
- Output: Transformed_data/CombinedDF.csv and a cleaner Transformed_data/bookListDB.csv
-
We did not have the input_csv/merged_list.csv at the beginning, so Naim merged book and movie titles by similarity from an older version of Transformed_data/CombinedDF.csv. This worked well, but these results won't be used as it's best to use the queried title strings merged with their corresponding movie titles
-
Tigran loaded the Transformed_data/CombinedDF.csv file to plot the relationships below:
-
Naim has written a script (Loading_into_MongoDB/MongoDump.ipynb) to enable dumping of the books and films information extracted into MongoDB database called adapted_scifi_films_db, creating a books and movies collection
-
Furthermore, once the database is created, Loading_into_MongoDB/MongoLoad.ipynb enables loading into pandas dataframes, and a quick inner join creates the CombinedDF dataframe between movies and books
- Code: Loading_into_MongoDB/MongoDump.ipynb and Loading_into_MongoDB/MongoLoad.ipynb
- Input: Transformed_data/bookListDB.csv and Transformed_data/movieListDB.csv
- Output: adapted_scifi_films_db MongoDB database with books and movies collections
-
We looked into whether our GoodReads ratings that we web-scrapped were similar to the ones from which Kaggle processed about a year ago. And this is indeed the case, the ratings did not change much (see additional/Kaggle_merge_with_Adapted_MoviesList.ipynb)