Skip to content

Materials

Mikhail Koltsov edited this page Dec 13, 2016 · 9 revisions

Articles

  1. VP trees: A data structure for finding stuff fast
    An article, from which bhtsne's implementation of Vantage Point trees originates;
  2. How to use t-SNE effectively
    Article talks about common pitfalls when interpreting t-SNE results:
  • perplexity really matters;
  • cluster sizes on a plot mean nothing;
  • distance between clusters means nothing;
  • random noise doesn't always look random;
  • observed shapes are not reliable.

Videos

  1. Design at Large - Laurens van der Maaten, Visualizing Data Using Embeddings.
    Interesting points:
  • PCA preserves global structure, while t-SNE aims local structure (nearest neighbours);
  • Student-t distribution permits us to place dissimilar points farther on the map;
  • we can use t-SNE to evaluate our machine learning feature design (i.e. features for similar objects are similar);
  • we can use t-SNE to observe data weaknesses (e.g. denormalization);
  • matrix factorization is used (in machine learning), because it allows compact representation of data, plus we can use matrix rows as points;
  • in order to plot co-authorship or synonim data we can use multiple maps t-SNE. The number of maps can be choosed by the value of KL divergence as a function of number of maps;
  • larger datasets can have perplexity higher than 50.

Research Papers

Studied

  1. Maaten L., Hinton G. Visualizing data using t-SNE //Journal of Machine Learning Research. – 2008.
  2. Van Der Maaten L. Accelerating t-SNE using tree-based algorithms //Journal of machine learning research. – 2014.

Viewed

  1. Hinton G. E., Roweis S. T. Stochastic neighbor embedding //Advances in neural information processing systems. – 2002.
  2. Yang Z., Peltonen J., Kaski S. Optimization Equivalence of Divergences Improves Neighbor Embedding //ICML. – 2014.
    They prove something related to "equality" of graph- and point-visualization approaches, and give examples of performance of t-SNE with respect to graph visualization (in context of their ws-SNE approach superiority).
  3. Biuk-Aghai R. P. Visualizing co-authorship networks in online Wikipedia //2006 International Symposium on Communications and Information Technologies. – IEEE, 2006.

To read

  1. Venna J. et al. Information retrieval perspective to nonlinear dimensionality reduction for data visualization //Journal of Machine Learning Research. – 2010.
  2. Vladymyrov M., Carreira-Perpinan M. Partial-Hessian strategies for fast learning of nonlinear embeddings //arXiv preprint arXiv:1206.4646. – 2012.

On visualizing clustered overlapping data

  1. Vihrovs J. et al. An inverse distance-based potential field function for overlapping point set visualization //Information Visualization Theory and Applications (IVAPP), 2014 International Conference on. – IEEE, 2014.
  2. Santamaría R., Therón R. Overlapping clustered graphs: co-authorship networks visualization //International Symposium on Smart Graphics. – Springer Berlin Heidelberg, 2008.
  3. Vehlow C., Beck F., Weiskopf D. The state of the art in visualizing group structures in graphs //Eurographics Conference on Visualization (EuroVis)-STARs. – 2015.

Related to project structure and Python

  1. Sharing Your Side Projects Online and Good Enough Practices for Scientific Computing
  • include runnable examples (and walk-throughs). Jupyter notebooks even allow .js code inside;
  • README should provide: context for a project, build instructions, limitations, example output;
  • you should provide a test (small) data set for user to work on, so that he/she is sure the environment is ok;
  • you should provide explicit dependencies (requirements.txt);
  • import click - with this you can make CLI interface;
  • make is ok even for non-C++ commands;
  • always engineer: nice variable names, separated functions, etc.
  1. Generators: The Final Frontier
  2. Stop Writing Classes
  3. Cookiecutter Data Science project structure.
  • src and data folders;
  • visualization folder inside src;
  • analysis is a DAG, so make is a good choice;
  • data is immutable, always include raw data (or at least give a script to obtain it).

Related to deploying and performance of the server

  1. Understanding resource timing: how to interpret timings in chrome developer panel. It says that only 6 images can be concurrently downloaded from a single web-server in HTTP/1.1 manner. So we need to do HTTP 2.0;
  2. Guide on how to set up NGINX with http2 support.