Normalized Compression Distance

What is this?

The NCD (Normalized Compression Distance) algorithm lets you calculate the similarities between a set of files so you can plot them as a hierarchical cluster. The resulting image lets you easily detect relationships in your data.

For example?

Say, you have a set of DNA sequences. Coincidentally there is one in this directory. In an ideal world you could now analyse the information stored in these sequences and decide about which species are related on this basis. Unfortunately though, it's not possible to calculate this so-called information distance. However, there is an approximation, the normalised compression distance which estimates the information distance by using a compression algorithm on the files. You might want to read the Wikipedia article for more details.

This is what this program does. The result is a text file containing the similarities in a matrix, a huge pile of numbers. This data can then be plotted using a tool like R. If you don't know R, it might be best to start by having a look at a graphical frontend like RStudio.

This should give you an image like this one:

Is it any good?

Yes.

How do you build it?

Setup build environment. I'm using LLVM/Clang but g++ works just as fine
Install dependencies (LZMA library and Boost filesystem library)
make all
make run

How can I run run it?

$ ./cluster input-dir output-file

This reads all files in the specified input directory and performs an NCD calculation on them. The result are written to the specified output file which is overwritten without warning if it already exists.

You can then use the included plot.R file to plot the NCD matrix.

Any prebuilt executables available?

No.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data/dna		data/dna
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cluster.cpp		cluster.cpp
example.png		example.png
plot.R		plot.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Normalized Compression Distance

What is this?

For example?

Is it any good?

How do you build it?

How can I run run it?

Any prebuilt executables available?

More questions?

About

Releases

Packages

Languages

License

MKleusberg/ncd

Folders and files

Latest commit

History

Repository files navigation

Normalized Compression Distance

What is this?

For example?

Is it any good?

How do you build it?

How can I run run it?

Any prebuilt executables available?

More questions?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages