Letterboxd Recommendations

Web app that uses web scraping to give film recommendations using an SVD collaborative filtering model for any Letterboxd user or a recommendation for two using Blend mode. Can include filters such as film popularity, film genre, and films in user's watchlist. Check out deployed website here: https://letterboxdrecs.netlify.app/

Methods

Scraping data

Scraped ratings data from the 5009 most popular members on Letterboxd. Most of these users had ratings for thousands of films and the total dataset came out to include roughly 13 million ratings.
Scraped data about every film's popularity ranking on Letterboxd. This was useful for when a user wants to apply popularity filters on their recommendations. This dataset came out to include ~950k films.
Needed to scrape data in a reasonable amount of time, found writing asynchronous code using the asyncio and aiohttp libraries to be the fastest.

Building Model

Decided to use SVD factorization in a collaborative filtering model (popularized for usage in recommendation algorithms by Simon Funk in the Netflix Prize competition)
Usage of the Surprise scikit package made creating this algorithm surprisingly easy, though due to memory constraints I decided to use random samples of 200 ratings from each user, meaning the trainset used by the model had ~1,000,000 ratings in it.

Running Application

Application scrapes user data dynamically when given valid Letterboxd username, includes it in testset with the other user ratings and runs SVD algorithm. Gives 50 recommendations for films user hasn't watched yet based on what films the algorithm predicts the user will rate the highest.
When using blend mode, application scrapes data for both users, takes the average of the ratings for films both users have watched. For films that user one has watched and user two hasn't, SVD algorithm predicts what rating user two would give film, and uses that to find average. Then use these ratings with testset as you would if it were a single user.
When applying filters, program must scrape info about films dynamically. Algorithm gives 1000 top recommendations, then find the 50 highest ones that also meet the filter conditions.
This is why applying genre filters can cause program to take a long time to run.
When applying popularity filters, most of the work is done already since we have the popularity rankings in films.csv, but some films that weren't listed on Letterboxd's film page can slip through the cracks which is why it's necessary to scrape info about film popularity dynamically as well.
Also included option to exclude films in user's watchlist. When checked, application scrapes user watchlist and excludes films in it from being recommended. User watchlist must be unprivated.

Technologies Used

Front End

React.js
Vite
CSS5
Netlify for hosting

Back End

Python
FastAPI
DockerHub/Railway for hosting API

Data Processing

Python
asyncio/aiohttp for asynchronous requests
BeautifulSoup4 for parsing HTML
pandas to handle data
Surprise scikit for building model

Installation

To use the command line version of the application locally, follow these steps:

Clone the repository:

git clone https://github.com/jjoej15/letterboxd-recs.git
cd letterboxd-recs

Obtain pickle files and films.csv

Either scrape your own data
or download necessary files here. (~213 mb)
Make sure your data-processing directory looks like this:
├── data-processing
        ├── data
        │        └── films.csv
        ├── pickles
        │        ├── model_df.pkl
        │        └── rec_model.pkl
        └── (rest of the py files that are in cloned repository)

Install dependencies

cd data-processing
pip install -r requirements.txt

Run app
```
py get_recs.py
```

Scrape Data and Build Model

If you want to scrape your own data from Letterboxd and build your own model, follow these steps:

Navigate to data-processing directory
```
cd data-processing
```
Install dependences:
```
pip install -r requirements.txt
```
Run the script:
```
py scrape_all_data.py
```

This process will take several hours to complete (scraping > 200,000 web pages) and all of the data will add up to a size of >1GB (ratings.csv will likely contain over 13 million ratings).

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
back-end-fastapi		back-end-fastapi
data-processing		data-processing
front-end		front-end
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Letterboxd Recommendations

Table of Contents

Methods

Scraping data

Building Model

Running Application

Technologies Used

Front End

Back End

Data Processing

Installation

Scrape Data and Build Model

About

Languages

jjoej15/letterboxd-recs

Folders and files

Latest commit

History

Repository files navigation

Letterboxd Recommendations

Table of Contents

Methods

Scraping data

Building Model

Running Application

Technologies Used

Front End

Back End

Data Processing

Installation

Scrape Data and Build Model

About

Topics

Resources

Stars

Watchers

Forks

Languages