Course Project
CS685A: Data Mining
Ayush Anurag
Dheeraj Athrey
Pawan Agarwal
Pawan Mishra
Rohit Gupta
Shivam Pal
- Clone and enter into the repository
git clone https://github.com/pawan47/box-office-knowledge-discovery.git
cd path/to/cloned-directory
-
Team members should make changes and test their code in their specific branch
Each branch should be named as user-initials followed by working-branch
Example: for initials XYZabc typinggit checkout -b xyzabc-working-branch
will create and checkout into the xyzabc-working-branch
Omit the-b
flag if the branch already exists
Do not make a new branch for everytime some code is to be added or modified -
Access and upload data here
-
Prepare of a consolidated database of movie details (cast, director, production house, description, critical analysis, wikipedia) by web-scraping
-
Preparing a ML model to predict movie ratings and Box-office performance of the movies based on these factors
All codes are written in python3(>3.5). To install all dependencys run the following command
pip install scrapy
pip install sklearn
pip install xgboost
pip install pandas
All the data used for recommendation and prediction has been collected from IMDB webpages of the movies. A automatic web scraping program has been implemented using the Scrapy framework, to perform data collection. To run the program follow the instructions below:
-
Install Scrapy:
pip install scrapy
or if you use anaconda,conda install scrapy
-
Clone the repository and enter into the data-collection folder of the repository
git clone https://github.com/pawan47/box-office-knowledge-discovery.git
cd path/to/cloned-directory/data-collection
-
Enter
scrapy list
on the terminal to obtain the list of spider programs and related instructions.
There are two spider programs: imdbLinks and movieCrawler
movieCrawler is the main spider program and it can only be run after seed urls have been collected and saved. To obtain the seed urls run imdbLinks spider enterscrapy crawl imdbLinks
in the terminal.
This program will save a JSON file (imdbLinks.json) to links folders. This file contains a dictionary of links that will be used by the main spider program to crawl through the IMDB website. -
To run the main spider program, enter
scrapy crawl movieScraper
in the terminal. This program will scrape the necessary information from IMDB movie pages and then follow the links on the page to other movie pages. All data of a particular movie is saved in a JSON file in the movies folder. No provision has been made in the program to automatically stop. Manually stop the program once enough data has been collected.
The collected raw data of about 50000 movies can be found in the data folder on drive.
The has been collected by recursively following links on movies pages of IMDB website. The starting urls are the links to IMDB top 250 movies. Patterns in the movie webpage have been identified and by element analysis of the pages, the necessary information has been picked. xpath has been used to pick information. Following information is collected for each movie (most of the fields are self-explanatory):
- Id: unique identification string for each movie
- Title
- Film_rating: who can watch the movie; PG-13, R and so on
- Duration
- Description
- IMDB_Rating
- IMDB_rating_count: number of people who have rated the movie
- Genre
- release_date
- Storyline
- Cast
- Taglines
- Director
- Writers
- Budget
- Revenue
- Country
- Language
- url
Each file is saved with Id as filename.
This system can Recommend top 10 similar movies given user query. It is a knowledge-based recommendation system. It takes a movie as an input and output top 10 most relevant/similar movies. We are using cosine similarity to find similar movies.
To run the script:
python Movie_recommendation_system.py
It will output the top 10 most similar movie to a movie
Algorithm:
- Clean all the data (removed all stop words, lemmatize sentences, stemming used, removed all short-hand used in English)
- Combine movie Description and storyline formed a new sentence (named it soup).
- Trained tf-idf vectorizer on soup sentences.
- Convert all the combined sentences to a tf-idf matrix.
- When a query comes, it will clean it and convert it into a tf-idf vector using previously trained tf-idf vectorizer.
- Find cosine similarity by multiplying this tf-idf vector to the matrix.
- Display top 10 movies in decreasing order of score.
This system will predcit the success of a movie. For training the model, we choose film rating, imdb rating, imdb rating count, country, revenue, budget, duration, release date. We encode the alphabetical data to numeric data and then train our model.
Algorithm:
- We cleaned the data and remove junk values from column
- Converted the film ratings to whether is the file is adult (1) or not (0)
- Using budget and revenue, we calculated the success of movie
- We normalised the data
- Encoded the country data to numbers
- Calculated the score of movie using imdb rating and imdb rating count
- Used XGBoost model for training
It displays top 20 movies based on average movie rating and no of votes on that movie. It calculates a score for every movie based on average rating and no of votes.
Algorithm:
- Calculate mean of all movie rating name it c
- Calculate 90 percentile of no of votes and name it m
- Take all the movies that are above 90 percentile of votes count.
- Calculate the score for every movie by following formula:
Score = ((2* Vote_count* rating) / (Vote_count+ m)) + ((m* c)/(Vote_count+ m))
- Display all the movies in decreasing order of Score.
We use OneVsRest classification to predict genre of a movie given its Description, StoryLine and Taglines from IMDB. Since the number of movies of some genres was small, we merged some labels into one. Thus we are finally giving movies labels from a set of 11 labels.
Method: genre_prediction.py does the following -
- Clean all the data (removed all stop words, lemmatize sentences, stemming used, removed all short-hand used in English)
- Combine movie Description, Taglines and storyline formed a new sentence (named it soup)
- Train tf-idf vectorizer on soup sentences
- Convert all the data to a tf-idf matrix
- Binary encode all the labels
- Split the train & test data
- Train n classifiers (n = number of labels) to predict whether the movie is of that genre or not
- Output Accuracy for each classifier type for each genre
Note: We have to mention the csv file name in the code whenever we want to run and ensure it is present in same directory.