Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrays with ndim != 1 #10

Open
ExpHP opened this issue Jun 2, 2019 · 4 comments
Open

Arrays with ndim != 1 #10

ExpHP opened this issue Jun 2, 2019 · 4 comments

Comments

@ExpHP
Copy link

ExpHP commented Jun 2, 2019

Is there any particular reason why is the current implementation restricted to 1-dimensional arrays?

In particular, right now I want to use this crate to read a bsr_matrix saved by scipy.sparse.save_npz. The output contains a data.npy which has a header like this:

{'descr': '<f8', 'fortran_order': False, 'shape': (1358814, 3, 3), }

I already have a type of my own that can represent this; all I really need from the npy crate are its shape and a raw Vec<f64> of all the data.

More generally, I think that the reasonable behavior for this crate when ndim != 1 is to continue doing exactly what it already does; give a flat, 1-dimensional iterator of the data's scalars in reading order, and have to_vec() produce a flat 1D Vec<_>. If anybody feels the need to make the dimensionality of the data more "substantial," they're free to inspect the shape attribute and go nuts.

@ExpHP
Copy link
Author

ExpHP commented Jun 12, 2019

A note: The public API I've been prototyping for n-dimensional arrays is:

Reading

The existing API is good enough, we just need a few more getters:

pub enum Order { C, Fortran }

impl<'a, T: Deserialize> NpyData<'a, T> {
    /// Get the shape as written in the file.
    pub fn shape(&self) -> &[usize] {
        &self.shape
    }

    /// Get strides for each of the dimensions.
    ///
    /// This is the amount by which the item index changes as you move along each dimension.
    /// It is a function of both [`NpyData::order`] and [`NpyData::shape`],
    /// provided for your convenience.
    pub fn strides(&self) -> &[usize] {
        &self.strides
    }

    /// Get whether the data is in C order or fortran order.
    pub fn order(&self) -> Order {
        self.order
    }
}

Documentation of get, to_vec and etc. are updated to clarify that they access the items in a flattened layout, and it is suggested that the user look at .shape() and .strides() to figure out how their n-dimensional data is arranged.

I did not bother adding anything like e.g. get_nd(nd_index: &[usize]) -> Option<&T> because it opens a rather large can of worms and I doubt it could be made performant (so I don't want to encourage working with the data in this manner).


Writing

With #16, #17, and this, I found myself wrestling with 2x2x2=8 constructors for NpyWriter. That's far too much! So I introduced a Builder API:

/// Configuration for an output `.NPY` file.
pub struct Builder<Row>;

// returns Builder::new()
impl<Row> Default for Builder<Row> { ... }

impl<Row> Builder<Row> {
    /// Construct a builder with default configuration.
    ///
    /// Data order will be initially set to C order.
    ///
    /// No dtype will be configured; the [`Builder::dtype`] method **must** be called.
    pub fn new() -> Self;

    /// Set the data order for arrays with more than one dimension.
    pub fn order(mut self, order: Order) -> Self;

    /// Use the specified dtype.
    pub fn dtype(mut self, dtype: DType) -> Self;

    /// Calls [`Builder::dtype`] with the default dtype for the type to be serialized.
    pub fn default_dtype(self) -> Self
    where Row: AutoSerialize;
}

impl<Row: Serialize> Builder<Row> {
    /// Begin writing an array of known shape.
    ///
    /// Panics if [`Builder::dtype`] was not called.
    pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize]) -> io::Result<NpyWriter<Row, W>>;

    /// Begin writing a 1d array, of length to be inferred.
    ///
    /// Panics if [`Builder::dtype`] was not called.
    pub fn begin_1d<W: Write + Seek>(&self, w: W) -> io::Result<NpyWriter<Row, W>>;
}
  • shape is provided at the begin functions rather than via a setter, so that the W: Seek bound can be removed. (No need to go back and write the shape if we know it from the very beginning!)
    • Technically speaking the drop impl requires W: Seek, but this technical limitation is easily worked around by introducing some private newtype wrapper that impls Seek by panicking.
  • open_with_dtype introduced in Support one rust type serializing as many dtypes #15 and NpyWriter::begin introduced in Allow OutFile to write to any Write + Seek #16 become redundant with the Builder API, which is why I would like to get this in before the next release so I can remove them! :P

@ExpHP
Copy link
Author

ExpHP commented Jun 16, 2019

Hmmm, eliminating the Seek bound from begin_with_shape ended up being tricker than I thought it would be. If you try the PanicSeek newtype strategy, it ends up having to become public:

impl<Row: Serialize> Builder<Row> {
    pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize])
        -> io::Result<NpyWriter<Row, PanicSeek<W>>>;   // <--------- ewww, gross

    pub fn begin_1d<W: Write + Seek>(&self, w: W)
        -> io::Result<NpyWriter<Row, W>>;
}

I had a bit of a clever idea to try hiding it with dynamic polymorphism, however, this introduces a lifetime:

pub(crate) enum MaybeSeek<'w, W> {
    Is(Box<dyn WriteSeek + 'w>),
    Isnt(W),
}

impl<Row: Serialize> Builder<Row> {
    pub fn begin_with_shape<'w, W: Write + 'w>(&self, w: W, shape: &[usize])
        -> io::Result<NpyWriter<'w, Row, W>>;

    pub fn begin_1d<'w, W: Write + Seek + 'w>(&self, w: W)
        -> io::Result<NpyWriter<'w, Row, W>>;
}

trait WriteSeek: Write + Seek {}
impl<W: Write + Seek> WriteSeek for W {}

I feel like it should be possible to remove this lifetime by turning the trait into WriteSeek<W> (which the compiler will assume borrows from W due to variance), however, the compiler doesn't seem to agree.

By the way, I benchmarked this, and the worst impact on performance I was able to register was a 5% slowdown, which occurs on plain dtypes (record dtypes are basically unaffected).

@bluenote10
Copy link

bluenote10 commented May 15, 2020

My use case was to read multidimensional tensors, so this limitation is also quite a showstopper. Passing the shape information around separately is so ugly and error prone, considering that it is available from the file.

Could we not start simple: Only add a shape getter (the strides can be inferred from that, right?) and as a first step only fill it on read (to side step having many arguments on writer side for now)?

Edit: Looking over the documentation/code a bit more, I'm getting confused. The DType and Field types actually seem to provide that information. But as far as I can see, both of these types are public without being ever used by any public interface, i.e., the user simply can't access them. Is there just a function missing?

@ExpHP
Copy link
Author

ExpHP commented May 16, 2020

N.B. shape() alone does not imply strides because arrays can either be stored as C-order or Fortran order. strides() is just a helper method to help write code that takes this into account.

When I was working on this, @potocpav seemed to have been very busy with other things. At some point in the past, I began to get a bit impatient waiting for my PRs to move, and had determined that if I didn't hear back within another week that I would start fixing up examples and documentation and preparing to release my own fork under a new name... and then suddenly I heard back from the author on my PR. (but then he disappeared again!)

Edit: Looking over the documentation/code a bit more, I'm getting confused. The DType and Field types actually seem to provide that information. But as far as I can see, both of these types are public without being ever used by any public interface, i.e., the user simply can't access them. Is there just a function missing?

My memory is foggy but I definitely do recall seeing some confusing things regarding what is and what isn't public API.

You may want to check out my fork and see if that API makes more sense. (try git clone and cargo doc --open). According to the commit list, it has both nd support as well as allowing one numpy type to be deserialized as multiple rust types. I'm pretty sure that I also tried to clean up parts of the API that I found confusing, though honestly, it has been a long while and I don't remember too much any more about what I did: https://github.com/ExpHP/nippy


Edit 2021/07/01: I have now released https://crates.io/crates/npyz which has this feature and a lot more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants