Skip to content

Latest commit

 

History

History
650 lines (453 loc) · 15.3 KB

README.md

File metadata and controls

650 lines (453 loc) · 15.3 KB

Fast Exact Summation Using Small and Large Superaccumulators (XSUM)

Python package Windows Build status PyPI Anaconda-Server Badge License

In applications like optimization or finding the sample mean of data, it is desirable to use higher accuracy than a simple summation. Where in a simple summation, rounding happens after each addition. An exact summation is a way to achieve higher accuracy results.

XSUM is a library for performing exact summation using super-accumulators. It provides methods for exactly summing a set of floating-point numbers, where using a simple summation and the rounding which happens after each addition could be an important factor in many applications.

This library is an easy to use header-only cross-platform C++11 implementation and also contains Python bindings (please see the example).

The main algorithm is taken from the original C library FUNCTIONS FOR EXACT SUMMATION described in the paper "Fast Exact Summation Using Small and Large Superaccumulators," by Radford M. Neal.

The code is rewritten in C++ and amended with more functionalities with the goal of ease of use. The provided Python bindings provide the exact summation interface in a Python code.

The C++ code also includes extra summation functionalities, (parallel reduction on multi-core architectures) which are especially useful in high-performance message passing libraries (like OpenMPI and MPICH). Where binding a user-defined global summation operation to an op handle can subsequently be used in MPI_Reduce, MPI_Allreduce, MPI_Reduce_scatter, and MPI_Scan or a similar calls.

  • NOTE: To see or use or reproduce the results of the original implementation reported in the paper Fast Exact Summation Using Small and Large Superaccumulators, by Radford M. Neal, please refer to FUNCTIONS FOR EXACT SUMMATION.

Usage

XSUM library presents two objects, or super-accumulators xsum_small_accumulator and xsum_large_accumulator. It also provides methods for summing floating-point numbers and rounding to the nearest floating-point number.

A small superaccumulator uses sixty-seven 64-bit chunks, each with 32-bit overlap with the next one. This accumulator is the preferred method for summing a moderate number of terms. A large superaccumulator uses 4096 64-bit chunks and is suitable for big summations. A small superaccumulator is also a component of the large superaccumulator [1].

XSUM library provides two interfaces to use superaccumulators. The first one is a function interface, which takes input and produces output, and in the second one, supperaccumulators are represented as classes (xsum_small and xsum_large.)

C++

xsum_small_accumulator and xsum_large_accumulator, both have a default constructor, thus they do not need to be initialized. Addition operation is simply a xsum_add,

// A small superaccumulator
xsum_small_accumulator sacc;

// Adding values to the small accumulator sacc
xsum_add(&sacc, 1.0);
xsum_add(&sacc, 2.0);

// Large superaccumulator
xsum_large_accumulator lacc;

// Adding values to the large accumulator lacc
xsum_add(&lacc, 1.0e-3);
xsum_add(&lacc, 2.0e-3);

or xsum_small, and xsum_large objects can simply be used as,

// A small superaccumulator
xsum_small sacc;

// Adding values to the small accumulator sacc
sacc.add(1.0);
sacc.add(2.0);

// Large superaccumulator
xsum_large lacc;

// Adding values to the large accumulator lacc
lacc.add(1.0e-3);
lacc.add(2.0e-3);

One can also add a vector of numbers to a superaccumulator.

// A small superaccumulator
xsum_small_accumulator sacc;

// Adding a vector of numbers
double vec[] = {1.234e88, -93.3e-23, 994.33, 1334.3, 457.34, -1.234e88, 93.3e-23, -994.33, -1334.3, -457.34};

xsum_add(&sacc, vec, 10);

std::vector<double> v = {1.234e88, -93.3e-23, 994.33, 1334.3, 457.34, -1.234e88, 93.3e-23, -994.33, -1334.3, -457.34};

xsum_add(&sacc, v);

the same with xsum_small, and xsum_large objects as,

// A small superaccumulator
xsum_small sacc;

// Adding a vector of numbers
double vec[] = {1.234e88, -93.3e-23, 994.33, 1334.3, 457.34, -1.234e88, 93.3e-23, -994.33, -1334.3, -457.34};

sacc.add(vec, 10);

std::vector<double> v = {1.234e88, -93.3e-23, 994.33, 1334.3, 457.34, -1.234e88, 93.3e-23, -994.33, -1334.3, -457.34};

sacc.add(v);

The squared norm of a vector (sum of squares of elements) and the dot product of two vectors (sum of products of corresponding elements) can be added to a superaccumulator using xsum_add_sqnorm and xsum_add_dot respectively.

For exmaple,

// A small superaccumulator
xsum_small_accumulator sacc;

// Adding a vector of numbers
double vec1[] = {1.e-2, 2., 3.};
double vec2[] = {1.e-3, 2., 3.};

// Add dot product of vectors to a small superaccumulator
xsum_add_dot(&sacc, vec1, vec2, 3);

with xsum_small, and xsum_large objects as,

// A small superaccumulator
xsum_small sacc;

// Adding a vector of numbers
double vec1[] = {1.e-2, 2., 3.};
double vec2[] = {1.e-3, 2., 3.};

// Add dot product of vectors to a small superaccumulator
sacc.add_dot(vec1, vec2, 3);

When it is needed, one can simply use the xsum_init to reinitilize the superaccumulator.

xsum_small_accumulator sacc;

xsum_add(&sacc, 1.0e-2);
...

// Reinitilize the small accumulator
xsum_init(&sacc);
...

or with xsum_small object as,

xsum_small sacc;

sacc.add(1.0e-2);
...

// Reinitilize the small accumulator
sacc.init();
...

The superaccumulator can be rounded as,

xsum_small_accumulator sacc;
xsum_add(&sacc, 1.0e-15);

....

double s = xsum_round(&sacc);

where, xsum_round is used to round the superaccumulator to the nearest floating-point number.

With xsum_small, and xsum_large objects we do as,

xsum_small sacc;
sacc.add(1.0e-15);

....

double s = sacc.round();

where, round is used to round the superaccumulator to the nearest floating-point number.

Two small superaccumulators can be added together. xsum_add can be used to add the second superaccumulator to the first one without doing any rounding. Two large superaccumulator can also be added in the same way. In the case of the two large superaccumulators, the second one internally will be rounded to a small superaccumulator and then the addition is done.

// Small superaccumulators
xsum_small_accumulator sacc1;
xsum_small_accumulator sacc2;

xsum_add(&sacc1, 1.0);
xsum_add(&sacc2, 2.0);

xsum_add(&sacc1, &sacc2);

// Large superaccumulators
xsum_large_accumulator lacc1;
xsum_large_accumulator lacc2;

xsum_add(&lacc1, 1.0);
xsum_add(&lacc2, 2.0);

xsum_add(&lacc1, &lacc2);

or as,

// Small superaccumulators
xsum_small sacc1;
xsum_small sacc2;

sacc1.add(1.0);
sacc2.add(2.0);

// Add a small accumulator sacc2 to the first accumulator sacc1
sacc1.add(sacc2);

// Large superaccumulators
xsum_large lacc1;
xsum_large lacc2;

lacc1.add(1.0);
lacc2.add(2.0);

// Add a large accumulator lacc2 to the first accumulator lacc1
lacc1.add(lacc2);

A small superaccumulator can also be added to a large one,

// Small superaccumulator
xsum_small_accumulator sacc;

xsum_add(&sacc, 1.0e-10);

...

// Large superaccumulator
xsum_large_accumulator lacc;

xsum_add(&lacc, 2.0e-3);

...

// Addition of a small superaccumulator to a large one
xsum_add(&lacc, &sacc);

With xsum_small, and xsum_large objects we do as,

// Small superaccumulator
xsum_small sacc;

sacc.add(1.0e-10);

...

// Large superaccumulator
xsum_large lacc;

lacc.add(2.0e-3);

...

// Addition of a small superaccumulator to a large one
lacc.add(sacc);

The large superaccumulator can be rounded to a small one as,

xsum_large_accumulator lacc;

xsum_small_accumulator sacc = xsum_round_to_small(&lacc);

or as,

xsum_large lacc;

xsum_small_accumulator sacc = lacc.round_to_small();

Example

Two simple examples on how to use the library:

#include <iomanip>
#include <iostream>

#include "xsum/xsum.hpp"

using namespace xsum;

int main() {
  // Large superaccumulator
  xsum_large_accumulator lacc;
  double const a = 0.7209e-5;
  double s = 0;
  for (int i = 0; i < 10000; ++i) {
    xsum_add(&lacc, a);
    s += a;
  }
  std::cout << std::setprecision(20) << xsum_round(&lacc) << "\n"
            << std::setprecision(20) << s << "\n";
}

or a xsum_small or xsum_large objects can simply be used as,

#include <iomanip>
#include <iostream>

#include "xsum/xsum.hpp"

using namespace xsum;

int main() {
  // Large superaccumulator
  xsum_large lacc;
  double const a = 0.7209e-5;
  double s = 0;
  for (int i = 0; i < 10000; ++i) {
    lacc.add(a);
    s += a;
  }
  std::cout << std::setprecision(20) << lacc.round() << "\n"
            << std::setprecision(20) << s << "\n";
}

One can compile the code as,

g++ simple.cpp -std=c++11 -O3 -o simple

or

icpc simple.cpp -std=c++11 -O3 -fp-model=double -o simple

running the simple would result,

./simple

0.072090000000000001301
0.072089999999998641278

MPI reduction example (MPI_Allreduce)

To use a superaccumulator in high-performance message passing libraries, first we need an MPI datatype of a superaccumulator, and then we define a user-defined global operation XSUM that can subsequently be used in MPI_Reduce, MPI_Aallreduce, MPI_Reduce_scatter, and MPI_Scan.

The below example is a simple demonstration of the use of a superaccumulator on multiple processors, where the final summation across all processors is desired.

#include <mpi.h>

#include <iomanip>
#include <iostream>

#include "xsum/myxsum.hpp"
#include "xsum/xsum.hpp"

using namespace xsum;

int main() {
  // Initialize the MPI environment
  MPI_Init(NULL, NULL);

  // Get the number of processes
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Get the rank of the process
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Create the MPI data type of the superaccumulator
  MPI_Datatype acc_mpi = create_mpi_type<xsum_large_accumulator>();

  // Create the XSUM user-op
  MPI_Op XSUM = create_XSUM<xsum_large_accumulator>();

  double const a = 0.239e-3;
  double s(0);

  xsum_large_accumulator lacc;

  for (int i = 0; i < 1000; ++i) {
    s += a;
    xsum_add(&lacc, a);
  }

  MPI_Allreduce(MPI_IN_PLACE, &s, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
  MPI_Allreduce(MPI_IN_PLACE, &lacc, 1, acc_mpi, XSUM, MPI_COMM_WORLD);

  if (world_rank == 0) {
    std::cout << "Rank =  " << world_rank
              << ", sum   =  " << std::setprecision(20) << a * 1000 * world_size
              << ", sum 1 =  " << std::setprecision(20) << s
              << ", sum 2 =  " << std::setprecision(20) << xsum_round(&lacc)
              << std::endl;
  }

  // Free the created user-op
  destroy_XSUM(XSUM);

  // Free the created MPI data type
  destroy_mpi_type(acc_mpi);

  // Finalize the MPI environment.
  MPI_Finalize();
}
mpic++ mpi_simple.cpp -std=c++11 -O3 -o simple

running the above code using 4 processors would result,

mpirun -np 4 ./simple

Rank =  0, sum   =  0.95600000000000007194, sum 1 =  0.95599999999998419575, sum 2 =  0.95600000000000007194

Python

The provided Python bindings provide the exact summation interface in a Python code.

Python requirements

You need Python 3.6 or later to run xsum. You can have multiple Python versions (2.x and 3.x) installed on the same system without problems.

To install Python 3 for different Linux flavors, macOS and Windows, packages are available at
https://www.python.org/getit/

Using pip

PyPI

pip is the most popular tool for installing Python packages, and the one included with modern versions of Python.

xsum can be installed with pip:

pip install xsum

Note:

Depending on your Python installation, you may need to use pip3 instead of pip.

pip3 install xsum

Depending on your configuration, you may have to run pip like this:

python3 -m pip install xsum

Using pip (GIT Support)

pip currently supports cloning over git

pip install git+https://github.com/yafshar/xsum.git

For more information and examples, see the pip install reference.

Using conda

Anaconda-Server Badge Conda Platforms

conda is the package management tool for Anaconda Python installations.

Installing xsum from the conda-forge channel can be achieved by adding conda-forge to your channels with:

conda config --add channels conda-forge

Once the conda-forge channel has been enabled, xsum can be installed with:

conda install xsum

It is possible to list all of the versions of xsum available on your platform with:

conda search xsum --channel conda-forge

Python examples

from xsum import *
import numpy as np

# A small superaccumulator
sacc = xsum_small_accumulator()

# Adding values to the small accumulator sacc
xsum_add(sacc, 1.0)
xsum_add(sacc, 2.0)

# Large superaccumulator
lacc = xsum_large_accumulator()

# Adding values to the large accumulator lacc
xsum_add(lacc, 1.0e-3)
xsum_add(lacc, 2.0e-3)

One can also add a vector of numbers to a superaccumulator.

from xsum import *
import numpy as np

# A small superaccumulator
sacc = xsum_small_accumulator()

a = np.arange(0, 1, 0.1)

# Adding a vector of numbers
xsum_add(sacc, a)

print("sum       = {:.20f}".format(np.sum(a)))
print("Exact sum = {:.20f}".format(xsum_round(sacc)))

or a xsum_small or xsum_large objects can simply be used as,

from xsum import *
import numpy as np

# A small superaccumulator
sacc = xsum_small()

a = np.arange(0, 1, 0.1)

# Adding a vector of numbers
sacc.add(a)

print("sum       = {:.20f}".format(np.sum(a)))
print("Exact sum = {:.20f}".format(sacc.round()))

running the simple.py script would result,

python ./simple.py

sum       = 4.50000000000000088818
Exact sum = 4.50000000000000000000

References

  1. Neal, Radford M., "Fast exact summation using small and large superaccumulators," arXiv e-prints, (2015)
  2. https://www.cs.toronto.edu/~radford/xsum.software.html
  3. https://gitlab.com/radfordneal/xsum

Contributing

Copyright (c) 2020, Regents of the University of Minnesota.
All rights reserved.

Contributors:
      Yaser Afshar

License

LGPLv2.1