The NCD (Normalized Compression Distance) algorithm lets you calculate the similarities between a set of files so you can plot them as a hierarchical cluster. The resulting image lets you easily detect relationships in your data.
Say, you have a set of DNA sequences. Coincidentally there is one in this directory. In an ideal world you could now analyse the information stored in these sequences and decide about which species are related on this basis. Unfortunately though, it's not possible to calculate this so-called information distance. However, there is an approximation, the normalised compression distance which estimates the information distance by using a compression algorithm on the files. You might want to read the Wikipedia article for more details.
This is what this program does. The result is a text file containing the similarities in a matrix, a huge pile of numbers. This data can then be plotted using a tool like R. If you don't know R, it might be best to start by having a look at a graphical frontend like RStudio.
This should give you an image like this one:
- Setup build environment. I'm using LLVM/Clang but g++ works just as fine
- Install dependencies (LZMA library and Boost filesystem library)
- make all
- make run
$ ./cluster input-dir output-file
This reads all files in the specified input directory and performs an NCD calculation on them. The result are written to the specified output file which is overwritten without warning if it already exists.
You can then use the included plot.R file to plot the NCD matrix.
No.
Feel free to just open a new issue.