This is a project for creating 2D visualization of StackOverflow tags using t-SNE algorithm. See a bit more thorough problem description.
Everything is written in Python 3.5 and C++.
Live demonstration: https://tag-map.github.io/
Our repository follows cookiecutter Data Science template.
data
folder contains data on all stages of processing:raw
for unprocessed data,interim
for intermediate results,processed
for results of heavy computations (these do not require further processing).
example
folder contains information about small subset of tags (~400 of them).
You can read more about our data on a wiki page;src
folder contains scripts that transform data or infer information from it.
data
folder holds scripts for initial data transformation (fromraw
tointerim
).
models/use_bhtsne
contains scripts that prepare data for using it in t-SNE, as well as a slightly rewritten version of bhtsne.
visualization
folder holds the last part of the equation - the frontend.
We use Python 3 as a main tool, so you need a Python interpreter (e.g. cPython
).
Make sure you install every needed Python package from requirements.txt
, e.g. via
pip3 install -r requirements.txt
Aside from Python, C++ 11
is employed in time-critical places. Make sure you have a suitable compiler (e.g. gcc
).
The analysis is run via Makefile
, so you need to have make
installed. Our Makefile
was tested on Ubuntu 16.04.
After installing all prerequisites, you may want to run our example dataset (consisting of 376 tags) to be sure everything is allright. First, clone the repository via command
git clone https://github.com/ItsLastDay/StackOverflow_Map.git
Then type (from the root of the repository)
make visualize_example
It should complete in a matter of minutes. Then go to src/visualization
folder and start a server:
cd ./src/visualization
python3 run_server.py
As a final step, open http://localhost:8000/ in your favourite web browser. You should see something like this:
You can navigate on the map using mouse buttons and zoom via scroll button.
Now you are ready for the main part - running a visualization on the whole 50k+ set of tags. Our script allows you to specify a date POST_DATE. All posts earlier than this date will be filtered prior to making a visualization. This affects measuring the similarity between two tags: since we count number of questions that have both tags, filtering old questions makes the similarity more current.
From the root of the repository, type
make visualize POST_DATE=2012-08-25
(of course, you can specify any other date, but it must follow YYYY-MM-DD format)
This command requires several hours to complete. It will write tags in a separate folder with a POST_DATE value in it, e.g. tiles_2012-08-25
. Don't hesitate to try different POST_DATE's - they do not overwrite each other! Then perform the steps described above:
cd ./src/visualization
python3 run_server.py
Open http://localhost:8000/ in your web browser. You will see a drop-down list on the left. There you can choose which visualization to show. Choose the one according to specified POST_DATE. Hooray, you now see a full set of tags!
Check out a video demonstration (click to play):
Mikhail Koltsov (ItsLastDay)
Arkady Kalakutsky (testlnord)