-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add methods for the Arrow PyCapsule Protocol to DataFrame/Column interchange protocol objects #342
base: main
Are you sure you want to change the base?
Add methods for the Arrow PyCapsule Protocol to DataFrame/Column interchange protocol objects #342
Conversation
…rchange protocol objects
What should the behavior be if the library has its data on say a GPU instead of CPU? Throw? Copy to CPU? If we support this CPU only protocol why don't we support other CPU only protocols as well? |
That's one thing I still needed to add: if you do not support this protocol, I would propose to indicate this by not having the methods (so you get do So if we decide that GPU libraries shouldn't support this, then they should simply not add those methods. Personally I don't have much experience with GPU, but based on following earlier discussions, I think the general preference is that there is no silent copy? Certainly given the fact we want to expand this protocol to support GPU data as well (apache/arrow#38325), I think not implementing this right now is the best option for objects backed by GPU data.
Which other CPU-only protocol are you thinking about? I am not aware of any other dataframe protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
If these methods aren't strictly required, maybe this could be noted in the docstring? I think that would address Keith's concern
Sorry for having taking a while to respond here, it's taken me a while to fill gaps in my knowledge (thanks Joris for patiently explaining and answering questions!)
I agree - if they'd start copying data to host memory now to support the CPU-only protocol, that would make it much harder to later add the device-agnostic version. If there was movement and a timeline on apache/arrow#38325, it'd be easier to say "let's not do the CPU-only version". However, it looks like that may take quite a while. Somehow work on these protocols is always slow going (DLPack was/is no exception). So it seems reasonable to me to start supporting the CPU protocol now. There is indeed the fragmentation on the array side, with too many different CPU-only protocols. However, on the dataframe side things are cleaner. If things could stay limited to:
then it looks like things are in pretty good shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche! I don't have practical experience with this protocol, but it makes sense to me conceptually to add, and this PR looks like a good start to me. It's pretty self-contained, which is nice.
I have two main questions:
(1) Should these methods say to only implement them if the data is natively stored in Arrow format under the hood? If not, it may turn into yet more work for implementers - and require copying on the producer side.
(2) Exporting PyCapsule
objects always comes with the "who owns this" question. For DLPack we worked hard at eliminating that problem by specifying a public function for it and making clear that the PyCapsule
was supposed to be consumed once after creating and then immediately discarded, so it wasn't possible for PyCapsule
's to hang around in user land. I think the intent here is similar, but it's not mentioned at all. Should there be a requirement on this?
Export the Column as an Arrow C array and schema PyCapsule. | ||
|
||
If the Column consists of multiple chunks, this method should raise | ||
an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not supported at all? Or if this is supported only at the dataframe level (since the same restriction isn't mentioned there), should this say why and/or refer to that support at dataframe level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be useful to add a section:
Raises
------
Parameters | ||
---------- | ||
requested_schema : PyCapsule, default None | ||
The schema to which the dataframe should be casted, passed as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "casted" -> "cast"
(also further down)
|
||
def __arrow_c_schema__(self) -> object: | ||
""" | ||
Export the schema of the DataFrae to a Arrow C schema PyCapsule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in DataFrame
""" | ||
pass | ||
|
||
def __arrow_c_stream__(self, requested_schema: Optional[object] = None) -> object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet more conflicting terminology? Is "stream" supposed to mean "dataframe" here, rather than CUDA stream? If so, won't that conflict with device support later, and/or confused with DLPack stream support?
""" | ||
pass | ||
|
||
def __arrow_c_stream__(self, requested_schema: Optional[object] = None) -> object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does one differentiate between a DataFrame and a struct column here (assuming those will be supported in the future)?
I am still a strong -1 on this proposal. The array API didn't implement We have already defined a new interchange protocol that can handle Arrow formatted data zero copy in addition to handling non-CPU memory and non-arrow memory layouts. |
That's OK @jorisvandenbossche we can still use this in pandas/polars/pyarrow def from_dataframe(df: SupportsInterchange, *, allow_copy: bool = True) -> DataFrame:
if hasattr(df, '__arrow_c_schema__'):
return fastpath_constructor(df, allow_copy=allow_copy)
return _from_dataframe(
df.__dataframe__(allow_copy=allow_copy), # type: ignore[arg-type]
allow_copy=allow_copy,
) even if it's not officially part of the interchange protocol |
I understand why the array API didn't implement those, but not sure how that is relevant for this discussion? I am not proposing to add this protocol based on some popularity metric, but because it is adding missing features and convenience. The array protocols you mentioned also wouldn't add much additional value for the Array API because it already had DLPack covering the same. @kkraus14 what is not clear to me, is your objection mainly based on the lack of non-cpu device support, or more in general? Assume for a moment that the Arrow community already resolved the discussion around adding device support for the Arrow PyCapsule protocol, and we could directly add it here as well. Would you then be OK with it?
@MarcoGorelli that's certainly possible and would already improve that part of the interchange's usecases, but that still doesn't enable usage of the interchange protocol in places that require Arrow memory (using duckdb as the typical example, they support Arrow memory, and doubt they would ever support the interchange protocol directly. They could of course already easily support it if they want by depending on pyarrow, but that dependency isn't needed with this PR) |
I would be more okay with it, but I still feel funny about giving Arrow memory layout arguably preferential treatment over non Arrow memory layout, though we already do that for strings more or less. I.E. if someone uses a sentinel value or a byte mask for representing nulls, should they not support this protocol or should they support it via a copy + conversion? Both are unideal from the perspective of a library not using Arrow format for representing nulls supporting a standard. |
We indeed already use Arrow for several things. But I would also say that Arrow deserves this special treatment because of being the most widely adopted standard for representing tabular data?
If the consuming library doesn't use Arrow compatible memory, they don't need to consume the dataframe object through this protocol. The existing Column Buffers access is still there, allowing you to get the data exactly as they are. For the producing side, if the dataframe library doesn't use Arrow compatible memory, they indeed need to make a copy if they want to support this protocol. But that copy would need to be made anyhow, if the consumer wants Arrow data. |
Note that cuDF currently works perfectly fine with the PyCapsule Interface, whereas its interchange protocol fails with the most basic of test cases rapidsai/cudf#17282 (comment) I'll also note that Vegafusion has replaced its usage of the dataframe interchange protocol with the PyCapsule Interface |
Addresses #279
Currently, I added schema and array support to
Column
, and schema and stream support toDataFrame
. I think that's the typical way those objects are used (chunked dataframe, gets chunks+columns, and then single-chunk column), but for completeness (and because the spec doesn't distinguish between single vs multi-chunk objects) we can probably add both the array and stream method to both the DataFrame and Column object.I also added the schema method to both DataFrame and Column, although those are strictly speaking not "schema" objects. But we don't have a dedicated schema object in the interchange spec (DataFrame has no attribute, and the Column.dtype attribute is a plain tuple)