Merge branch 'release/0.0.1'

jmbr · May 20, 2017 · 6a021be · 6a021be
2 parents 706cb07 + 07d47ac
commit 6a021be
Show file tree

Hide file tree

Showing 29 changed files with 1,858 additions and 0 deletions.
diff --git a/GNUmakefile b/GNUmakefile
@@ -0,0 +1,10 @@
+#############################################################################
+# Development make file.
+#############################################################################
+
+SOURCE_DIR := diffusion_maps
+
+$(SOURCE_DIR)/version.py:
+	@scripts/make-version > $@
+
+.PHONY: $(SOURCE_DIR)/version.py
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2017 Juan M. Bello Rivas <jmbr@superadditive.com>
+
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.org b/README.org
@@ -0,0 +1,65 @@
+#+TITLE: Diffusion Maps and Geometric Harmonics for Python
+#+AUTHOR: Juan M. Bello Rivas
+#+EMAIL: jmbr@superadditive.com
+#+DATE: <2017-05-20 Sat>
+
+* Overview
+
+The =diffusion-maps= library for Python provides a fast and accurate implementation of diffusion maps[fn:1] and geometric harmonics[fn:2]. Its speed stems from the use of sparse linear algebra and (optionally) graphics processing units to accelerate computations.
+The included code routinely solves eigenvalue problems 3 x faster than SciPy using GPUs on matrices with over 200 million non-zero entries. 
+
+The package includes a command-line utility for the quick calculation of diffusion maps on data sets.
+
+Some of the features of the =diffusion-maps= module include:
+
+- Fast evaluation of distance matrices using nearest neighbors.
+
+- Fast and accurate computation of eigenvalue/eigenvector pairs using sparse linear algebra.
+
+- Optional GPU-accelerated sparse linear algebra routines.
+
+- Optional interface to the [[https://github.com/opencollab/arpack-ng][ARPACK-NG]] library.
+
+- Simple and easily modifiable code.
+
+[fn:1] Coifman, R. R., & Lafon, S. (2006). Diffusion maps. Applied and Computational Harmonic Analysis, 21(1), 5–30. http://doi.org/10.1016/j.acha.2006.04.006
+
+[fn:2] Coifman, R. R., & Lafon, S. (2006). Geometric harmonics: A novel tool for multiscale out-of-sample extension of empirical functions. Applied and Computational Harmonic Analysis, 21(1), 31–52. http://doi.org/10.1016/j.acha.2005.07.005
+
+#+CAPTION: Geometric harmonics for $z = sin(x^2 + y^2)$.
+#+NAME:   fig:geometric-harmonics
+[[./geometric-harmonics.png]]
+
+* Prerequisites
+
+The library is implemented in Python 3.5+ and uses [[http://www.numpy.org/][NumPy]] and [[https://www.scipy.org/][SciPy]]. It is recommended to install [[https://mathema.tician.de/software/pycuda/][PyCUDA]] to enable the GPU-accelerated eigenvalue solver.
+
+The =diffusion-maps= command can display the resulting diffusion maps using [[https://matplotlib.org/][Matplotlib]] if it is available.
+
+* Installation
+
+ Use ~python setup.py install~ to install on your system or ~python setup.py install --user~ for a user-specific installation.
+
+* Command-line utility
+
+The ~diffusion-maps~ command reads data sets stored in NPY format. The simplest way to use it is to invoke it as follows:
+
+#+BEGIN_SRC bash
+diffusion-maps DATA-SET.NPY EPSILON-VALUE
+#+END_SRC
+
+There exist parameters to save and visualize different types of results, to specify how many eigenvalue/eigenvector pairs to compute, etc. See the help page displayed by:
+
+#+BEGIN_SRC bash
+diffusion-maps --help
+#+END_SRC
+
+* Additional documentation
+
+[[http://www.sphinx-doc.org/en/stable/][Sphinx]]-based API documentation is available in the =doc/= folder. Run
+
+#+BEGIN_SRC bash
+make -C doc html
+#+END_SRC
+
+to build the documentation.
diff --git a/diffusion_maps/__init__.py b/diffusion_maps/__init__.py
@@ -0,0 +1,7 @@
+"""Diffusion maps module.
+
+"""
+
+from .diffusion_maps import *
+from .geometric_harmonics import *
+from .version import *
diff --git a/diffusion_maps/clock.py b/diffusion_maps/clock.py
@@ -0,0 +1,61 @@
+"""Clock for keeping track of the wall time.
+
+"""
+
+
+__all__ = ['ClockError', 'Clock']
+
+import datetime
+import time
+
+from typing import Optional     # noqa: F401. Used for mypy.
+
+
+class ClockError(Exception):
+    """Invalid clock operation."""
+    pass
+
+
+class Clock:
+    """Clock for keeping track of time.
+
+    """
+    def __init__(self) -> None:
+        self.start = None       # type: Optional[float]
+        self.stop = None        # type: Optional[float]
+
+    def tic(self) -> None:
+        """Start the clock."""
+        self.start = time.monotonic()
+        self.stop = None
+
+    def toc(self) -> None:
+        """Stop the clock."""
+        assert self.start is not None
+        self.stop = time.monotonic()
+
+    def __str__(self) -> str:
+        """Human-readable representation of elapsed time."""
+        if self.start is None:
+            raise ClockError('The clock has not been started')
+        else:
+            start = datetime.datetime.fromtimestamp(self.start)
+
+        if self.stop is None:
+            stop = datetime.datetime.fromtimestamp(time.monotonic())
+        else:
+            stop = datetime.datetime.fromtimestamp(self.stop)
+
+        delta = stop - start
+
+        return str(delta)
+
+    def __enter__(self):
+        if self.start is None and self.stop is None:
+            self.tic()
+
+        return self
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        if self.start is not None:
+            self.toc()
diff --git a/diffusion_maps/command_line_interface.py b/diffusion_maps/command_line_interface.py
@@ -0,0 +1,152 @@
+#!/usr/bin/env python
+# PYTHON_ARGCOMPLETE_OK                                -*- mode: python -*-
+
+"""Command line interface for the computation of diffusion maps.
+
+"""
+
+import argparse
+try:
+    import argcomplete
+except ImportError:
+    argcomplete = None
+import logging
+import os
+import sys
+
+import numpy as np
+
+import diffusion_maps.default as default
+import diffusion_maps.version as version
+from diffusion_maps import downsample, DiffusionMaps
+from diffusion_maps.profiler import Profiler
+from diffusion_maps.plot import plot_eigenvectors
+
+
+def output_eigenvalues(ew: np.array) -> None:
+    """Output the table of eigenvalues.
+
+    """
+    logging.info('List of eigenvalues:')
+    fields = ['Real part', 'Imaginary part']
+    fmt = '{:>12} | {:<21}'
+    logging.info('-' * 30)
+    logging.info(fmt.format(*fields))
+    logging.info('-' * 30)
+    fmt = '{:+2.9f} | {:+2.9f}'
+    for eigenvalue in ew:
+        logging.info(fmt.format(eigenvalue.real, eigenvalue.imag))
+
+
+def use_cuda(args: argparse.Namespace) -> bool:
+    """Determine whether to use GPU-accelerated code or not.
+
+    """
+    try:
+        import pycuda           # noqa
+        use_cuda = True and not args.no_gpu
+    except ImportError:
+        use_cuda = False
+
+    return use_cuda
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Diffusion maps')
+    parser.add_argument('data_file', metavar='FILE', type=str,
+                        help='process %(metavar)s (should be in NPY format)')
+    parser.add_argument('epsilon', metavar='VALUE', type=float,
+                        help='kernel bandwidth')
+    parser.add_argument('-n', '--num-samples', type=float, metavar='NUM',
+                        required=False, help='number of data points to use')
+    parser.add_argument('-e', '--num-eigenpairs', type=int, metavar='NUM',
+                        required=False, default=default.num_eigenpairs,
+                        help='number of eigenvalue/eigenvector pairs to '
+                        'compute')
+    parser.add_argument('-c', '--cut-off', type=float, required=False,
+                        metavar='DISTANCE', help='cut-off to use to enforce '
+                        'sparsity in the diffusion maps computation.')
+    parser.add_argument('-o', '--output-data', type=str, required=False,
+                        default='actual-data.npy', metavar='FILE', help='save '
+                        'actual data used in computation to %(metavar)s')
+    parser.add_argument('-w', '--eigenvalues', type=str,
+                        default='eigenvalues.dat', required=False,
+                        metavar='FILE', help='save eigenvalues to '
+                        '%(metavar)s')
+    parser.add_argument('-v', '--eigenvectors', type=str,
+                        default='eigenvectors.npy', required=False,
+                        metavar='FILE', help='save eigenvectors to '
+                        '%(metavar)s')
+    parser.add_argument('-m', '--matrix', type=str, required=False,
+                        metavar='FILE', help='save transition matrix to '
+                        '%(metavar)s')
+    parser.add_argument('-p', '--plot', action='store_true', default=False,
+                        help='plot first two eigenvectors')
+    parser.add_argument('--no-gpu', action='store_true', required=False,
+                        help='disable GPU eigensolver')
+    parser.add_argument('--debug', action='store_true', required=False,
+                        help='print debugging information')
+    parser.add_argument('--profile', required=False, metavar='FILE',
+                        type=argparse.FileType('w', encoding='utf-8'),
+                        help='run under profiler and save report to '
+                        '%(metavar)s')
+
+    args = parser.parse_args(sys.argv[1:])
+
+    if args.debug is True:
+        logging.basicConfig(level=logging.DEBUG, format='%(message)s')
+    else:
+        logging.basicConfig(level=logging.INFO, format='%(message)s')
+
+    prog_name = os.path.basename(sys.argv[0])
+    logging.info('{} {}'.format(prog_name, version.v_long))
+    logging.info('')
+    logging.info('Reading data from {!r}...'.format(args.data_file))
+
+    orig_data = np.load(args.data_file)
+    if args.num_samples:
+        data = downsample(orig_data, int(args.num_samples))
+    else:
+        data = orig_data
+
+    logging.info('Computing {} diffusion maps with epsilon = {:g} '
+                 'on {} data points...'
+                 .format(args.num_eigenpairs-1, args.epsilon, data.shape[0]))
+
+    with Profiler(args.profile):
+        dm = DiffusionMaps(data, args.epsilon,
+                           num_eigenpairs=args.num_eigenpairs,
+                           use_cuda=use_cuda(args))
+
+    if args.profile:
+        args.profile.close()
+
+    output_eigenvalues(dm.eigenvalues)
+
+    if args.matrix:
+        logging.info('Saving transition matrix to {!r}'
+                     .format(args.matrix))
+        import scipy.io
+        scipy.io.mmwrite(args.matrix, dm.kernel_matrix)
+
+    if args.eigenvalues:
+        logging.info('Saving eigenvalues to {!r}'
+                     .format(args.eigenvalues))
+        np.savetxt(args.eigenvalues, dm.eigenvalues)
+
+    if args.eigenvectors:
+        logging.info('Saving eigenvectors to {!r}'
+                     .format(args.eigenvectors))
+        np.save(args.eigenvectors, dm.eigenvectors)
+
+    if args.output_data and args.num_samples:
+        logging.info('Saving downsampled data to {!r}'
+                     .format(args.output_data))
+        np.save(args.output_data, data)
+
+    if args.plot is True:
+        plot_eigenvectors(data, dm.eigenvectors)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/diffusion_maps/cpu_eigensolver.py b/diffusion_maps/cpu_eigensolver.py
@@ -0,0 +1,49 @@
+"""Compute dominant eigenvectors of a sparse matrix.
+
+"""
+
+__all__ = ['eigensolver']
+
+from typing import Optional, Tuple
+
+import numpy as np
+import scipy.sparse
+import scipy.sparse.linalg
+
+from . import default
+
+
+def eigensolver(matrix: scipy.sparse.csr_matrix,
+                num_eigenpairs: int = default.num_eigenpairs,
+                sigma: Optional[float] = None,
+                initial_vector: Optional[np.array] = None)  \
+        -> Tuple[np.array, np.array]:
+    """Solve  eigenvalue problem for sparse matrix.
+
+    Parameters
+    ----------
+    matrix : scipy.sparse.csr_matrix
+        A matrix in compressed sparse row format.
+    num_eigenpairs : int, optional
+        Number of eigenvalue/eigenvector pairs to obtain.
+    sigma : float, optional
+        Find eigenvalues close to the value of sigma. The default value is
+        near 1.0.
+    initial_vector : np.array, optional
+        Initial vector to use in the Arnoldi iteration. If not set, a vector
+        with all entries equal to one will be used.
+
+    Returns
+    -------
+    ew : np.array
+        Eigenvalues in descending order of magnitude.
+    ev : np.array
+        Eigenvectors corresponding to the eigenvalues in `ew`.
+
+    """
+    if initial_vector is None:
+        initial_vector = np.ones(matrix.shape[0])
+    ew, ev = scipy.sparse.linalg.eigs(matrix, k=num_eigenpairs, which='LM',
+                                      sigma=sigma, v0=initial_vector)
+    ii = np.argsort(np.abs(ew))[::-1]
+    return ew[ii], ev[:, ii].T