xray is a Python package for working with aligned sets of homogeneous, n-dimensional arrays. It implements flexible array operations and dataset manipulation for in-memory datasets within the Common Data Model widely used for self-describing scientific data (e.g., the NetCDF file format).
Adding dimensions names and coordinate values to numpy's ndarray makes many powerful array operations possible:
- Apply operations over dimensions by name:
x.sum('time')
. - Select values by label instead of integer location:
x.loc['2014-01-01']
orx.labeled(time='2014-01-01')
. - Mathematical operations (e.g.,
x - y
) vectorize across multiple dimensions (known in numpy as "broadcasting") based on dimension names, regardless of their original order. - Flexible split-apply-combine operations with groupby:
x.groupby('time.dayofyear').mean()
. - Database like aligment based on coordinate labels that smoothly
handles missing values:
x, y = xray.align(x, y, join='outer')
. - Keep track of arbitrary metadata in the form of a Python dictionary:
x.attrs
.
xray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. Indeed, much of its design and internal functionality (in particular, fast indexing) is shamelessly borrowed from pandas.
Because xray implements the same data model as the NetCDF file format,
xray datasets have a natural and portable serialization format. But it's
also easy to robustly convert an xray DataArray
to and from a numpy
ndarray
or a pandas DataFrame
or Series
, providing compatibility with
the full PyData ecosystem.
pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?
xray provides two data-structures that are missing in pandas:
- An extended array object (with labels) that is truly n-dimensional.
- A dataset object for holding a collection of these extended arrays aligned along shared coordinates.
Sometimes, we really want to work with collections of higher dimensional array
(ndim > 2
), or arrays for which the order of dimensions (e.g., columns vs
rows) shouldn't really matter. This is particularly common when working with
climate and weather data, which is often natively expressed in 4 or more
dimensions.
The use of datasets, which allow for simultaneous manipulation and indexing of many varibles, actually handles most of the use-cases for heterogeneously typed arrays. For example, if you want to keep track of latitude and longitude coordinates (numbers) as well as place names (strings) along your "location" dimension, you can simply toss both arrays into your dataset.
This is a proven data model: the netCDF format has been around for decades.
Pandas does support N-dimensional panels, but the implementation is very limited:
- You need to create a new factory type for each dimensionality.
- You can't do math between NDPanels with different dimensionality.
- Each dimension in a NDPanel has a name (e.g., 'labels', 'items', 'major_axis', etc.) but the dimension names refer to order, not their meaning. You can't specify an operation as to be applied along the "time" axis.
Fundamentally, the N-dimensional panel is limited by its context in the pandas
data model, which treats 2D DataFrame
s as collections of 1D Series
, 3D
Panel
s as a collection of 2D DataFrame
s, and so on. Quite simply, we
think the Common Data Model implemented in xray is better suited for
working with many scientific datasets.
Iris (supported by the UK Met office) is a similar package designed
for working with weather data in Python. Iris provided much of the inspiration
for xray (xray's DataArray
is largely based on the Iris Cube
), but it has
several limitations that led us to build xray instead of extending Iris:
- Iris has essentially one first-class object (the
Cube
) on which it attempts to build all functionality (Coord
supports a much more limited set of functionality). xray has its equivalent of the Cube (theDataArray
object), but under the hood it is only thin wrapper on the more primitive building blocks of Dataset and Variable objects. - Iris has a strict interpretation of CF conventions, which, although a principled choice, we have found to be impractical for everyday uses. With Iris, every quantity has physical (SI) units, all coordinates have cell-bounds, and all metadata (units, cell-bounds and other attributes) is required to match before merging or doing operations with on multiple cubes. This means that a lot of time with Iris is spent figuring out why cubes are incompatible and explicitly removing possibly conflicting metadata.
- Iris can be slow and complex. Strictly interpretting metadata requires a lot of work and (in our experience) can be difficult to build mental models of how Iris functions work. Moreover, it means that a lot of logic (e.g., constraint handling) uses non-vectorized operations. For example, extracting all times within a range can be surprisingly slow (e.g., 0.3 seconds vs 3 milliseconds in xray to select along a time dimension with 10000 elements).
netCDF4-python provides a low level interface for working with NetCDF and OpenDAP datasets in Python. We use netCDF4-python internally in xray, and have contributed a number of improvements and fixes upstream.
larry and datarray are other implementations of labeled numpy arrays that provided some guidance for the design of xray.
- Whenever possible, build on top of and interoperate with pandas and the rest of the awesome scientific python stack.
- Be fast. There shouldn't be a significant overhead for metadata aware manipulation of n-dimensional arrays, as long as the arrays are large enough. The goal is to be as fast as pandas or raw numpy.
- Support loading and saving labeled scientific data in a variety of formats (including streaming data).
For more details, see the full documentation, particularly the tutorial.
xray requires Python 2.7 or 3.3 and recent versions of numpy (1.7.0 or later) and pandas (0.13.1 or later). netCDF4-python, pydap and scipy are optional: they add support for reading and writing netCDF files and/or accessing OpenDAP datasets.
You can install xray from the pypi with pip:
pip install xray
Aspects of the API that we currently intend to change in future versions of xray:
- The constructor for
DataArray
objects will probably change, so that it is possible to create newDataArray
objects without putting them into aDataset
first. - Array reduction methods like
mean
may change to NA skipping versions (like pandas). - We will automatically align
DataArray
objects when doing math. Most likely, we will use an inner join (unlike pandas's outer join), because an outer join can result in ridiculous memory blow-ups when working with high dimensional arrays. - Future versions of xray will add better support for working with datasets
too big to fit into memory, probably by wrapping libraries like
blaze/blz or biggus. More immediately, we intend
to support
Dataset
objects linked to NetCDF or HDF5 files on disk to allow for incremental writing of data.
If you have questions or comments about any of this, please feel free to raise a GitHub issue or get in touch via the mailing list.
xray is an evolution of an internal tool developed at The Climate Corporation, and was written by current and former Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo. It is available under the open source Apache License.