Spring has sprung in NYC and it's time for gardening! But how to buy the right hose when you're a total gardening newbie?
This is a short step-by-step tutorial on collaborative filtering based recommendation systems on Amazon product data. Detailed instructions are included in the Jupyter Notebook (GardeningToolsRecommender.ipynb), so feel free to check it out. Below, I included materials I found super useful to learn about recommendation systems & Apache Spark (which is used to paralelize alternative least squares at the end of the notebook).
For instructional purposes, the code is not optimized for speed.
source of the image: "Recommendation systems: Principles, methods and evaluation" by Isinkayea, Folajimib and Ojokohc, http://doi.org/10.1016/j.eij.2015.06.005
Amazon's data set contains reviews and metadata spanning from 1996 to 2014 and is an excellent source if you want to practice recommendation algorithms. As you might suspect, it's huge, but Julian McAuley from UCSD also shared smaller subsets. I decided to use one of the 5-core datasets which contain entries from users who reviewed at least 5 products and products which were reviewed at least 5 times, which drastically limits the size of it allowing to run costly algorithms (such as ALS) on a personal laptop within a reasonable time (it took me few minutes).
Blog posts about collaborative filtering and Alternative Least Squares:
Andrew Ng's awesome intro to recommender systems (part of his ML coursera series, so pretty basic)
Ethan Rosenthal's excellent blog post about collaborative filtering and matrix factorization (a bit more advanced).
Alex Abate on collaborative filtering - I heavily borrowed from her prediction rating code, which demonstrates step-by-step how it works.
bugra on Alternating Least Squares.
Using Apache Spark makes a lot of sense when we're using iterative algorithms (such as gradient descent or alternative least squares) as it leverages its capacity for caching / persisting (taking a snapshot of the data and iterating only over steps which are unique across iterations).
It's not hard at all, actually! Make sure SSH is enabled on your machine, your Java is up to date, and download + install Spark. In my case, in order to run it I need to execute in the Terminal:
$ export PATH=$PATH:/usr/local/spark/bin:/usr/local/spark/sbin
followed by:
$ start-all.sh
and to lanch Jupyter Notebook with Spark:
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip="*"" pyspark --master local[*]
Then, Jupyter Notebook will run on localhost:8888
, your Spark cluster UI on localhost:8080
and Spark Jobs on localhost:4040
.
Those tips come from Austin Ouyang who wrote a great step-by-step intro and gave a two day workshop at Insight Labs that I attended (and you can sign up for too:).
http://spark.apache.org/docs/latest/mllib-guide.html
https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
A great tutorial on recommendations systems and Spark with MovieLens data: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
To install all of them (except Python) using pip run:
pip install -r requirements.txt