-
Notifications
You must be signed in to change notification settings - Fork 1
Materials
Mikhail Koltsov edited this page Dec 13, 2016
·
9 revisions
-
VP trees: A data structure for finding stuff fast
An article, from whichbhtsne
's implementation of Vantage Point trees originates; -
How to use t-SNE effectively
Article talks about common pitfalls when interpreting t-SNE results:
- perplexity really matters;
- cluster sizes on a plot mean nothing;
- distance between clusters means nothing;
- random noise doesn't always look random;
- observed shapes are not reliable.
-
Design at Large - Laurens van der Maaten, Visualizing Data Using Embeddings.
Interesting points:
- PCA preserves global structure, while t-SNE aims local structure (nearest neighbours);
- Student-t distribution permits us to place dissimilar points farther on the map;
- we can use t-SNE to evaluate our machine learning feature design (i.e. features for similar objects are similar);
- we can use t-SNE to observe data weaknesses (e.g. denormalization);
- matrix factorization is used (in machine learning), because it allows compact representation of data, plus we can use matrix rows as points;
- in order to plot co-authorship or synonim data we can use multiple maps t-SNE. The number of maps can be choosed by the value of KL divergence as a function of number of maps;
- larger datasets can have perplexity higher than 50.
- Maaten L., Hinton G. Visualizing data using t-SNE //Journal of Machine Learning Research. – 2008.
- Van Der Maaten L. Accelerating t-SNE using tree-based algorithms //Journal of machine learning research. – 2014.
- Hinton G. E., Roweis S. T. Stochastic neighbor embedding //Advances in neural information processing systems. – 2002.
-
Yang Z., Peltonen J., Kaski S. Optimization Equivalence of Divergences Improves Neighbor Embedding //ICML. – 2014.
They prove something related to "equality" of graph- and point-visualization approaches, and give examples of performance of t-SNE with respect to graph visualization (in context of their ws-SNE approach superiority). - Biuk-Aghai R. P. Visualizing co-authorship networks in online Wikipedia //2006 International Symposium on Communications and Information Technologies. – IEEE, 2006.
- Venna J. et al. Information retrieval perspective to nonlinear dimensionality reduction for data visualization //Journal of Machine Learning Research. – 2010.
- Vladymyrov M., Carreira-Perpinan M. Partial-Hessian strategies for fast learning of nonlinear embeddings //arXiv preprint arXiv:1206.4646. – 2012.
- Vihrovs J. et al. An inverse distance-based potential field function for overlapping point set visualization //Information Visualization Theory and Applications (IVAPP), 2014 International Conference on. – IEEE, 2014.
- Santamaría R., Therón R. Overlapping clustered graphs: co-authorship networks visualization //International Symposium on Smart Graphics. – Springer Berlin Heidelberg, 2008.
- Vehlow C., Beck F., Weiskopf D. The state of the art in visualizing group structures in graphs //Eurographics Conference on Visualization (EuroVis)-STARs. – 2015.
- include runnable examples (and walk-throughs). Jupyter notebooks even allow .js code inside;
- README should provide: context for a project, build instructions, limitations, example output;
- you should provide a test (small) data set for user to work on, so that he/she is sure the environment is ok;
- you should provide explicit dependencies (
requirements.txt
); -
import click
- with this you can make CLI interface; -
make
is ok even for non-C++ commands; - always engineer: nice variable names, separated functions, etc.
- Generators: The Final Frontier
- Stop Writing Classes
- Cookiecutter Data Science project structure.
-
src
anddata
folders; -
visualization
folder insidesrc
; - analysis is a DAG, so make is a good choice;
- data is immutable, always include raw data (or at least give a script to obtain it).
- Understanding resource timing: how to interpret timings in chrome developer panel. It says that only 6 images can be concurrently downloaded from a single web-server in HTTP/1.1 manner. So we need to do HTTP 2.0;
- Guide on how to set up NGINX with http2 support.