Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: idiomatic way of elegantly retrieving the underlying DataFrame type #1443

Open
elephaint opened this issue Nov 25, 2024 · 12 comments
Labels
enhancement New feature or request needs discussion

Comments

@elephaint
Copy link

elephaint commented Nov 25, 2024

Currently I often have the following code:

df_nw = nw.from_native(df)
is_pandas = nw.dependencies.is_pandas_dataframe(df)
.....
if is_pandas:
     do something different
......

What is difficult about this, is that I need to keep track of is_pandas variables throughout the code, send them in subfunctions, etc. If I have multiple DataFrames, I have multiple such is_pandas variables. Ideally, I'd be able to do something such as:

is_pandas = df_nw.is_native_pandas

i.e., having whether the underlying dataframe is pandas or not simply as a boolean attribute of the Narwhals DataFrame. That would allow me to use df_nw everywhere without requiring the auxiliary variables everywhere or first converting to native.

Of course, I know I can also do this everywhere: nw.dependencies.is_pandas_dataframe(df_nw.to_native())) but that feels convoluted.

What is the cleanest way to do this?

@FBruzzesi
Copy link
Member

Hey @elephaint , thanks for your request. This can certainly be a pain point for other libraries trying to adopt narwhals.

I would say that the answer is it depends.

We have a set of functionalities, namely maybe_align_index, maybe_get_index , maybe_set_index, maybe_reset_index and maybe_convert_dtypes, which are meant to help working with pandas objects without having to manually check all the times.

If that's not enough, nw.dependencies.is_pandas_dataframe(df_nw.to_native())) is an option, another one would be df_nw._compliant_frame._implementation is Implementation.PANDAS but it doesn't look less convoluted to me.

In plotly express, I had to do something similar, by adding a flag is_pd_like on the first encounter of the dataframe object, and passing that to various functions to branch out the logic.

@MarcoGorelli
Copy link
Member

Thanks for the request!

I think currently the two documented way would be:

  • if nw.get_native_namespace(df) is nw.dependencies.get_pandas()
  • if nw.dependencies.is_pandas_dataframe(df.to_native())

I can see that it would be convenient to have something more ergonomic... 🤔 will think about this one. Thanks for having highlighted this

another one would be df_nw._compliant_frame._implementation is Implementation.PANDAS but it doesn't look less convoluted to me.

wait, this would be highly risky as it involves using private methods which may change at any time 😉 Better to stick with the public API, which we make some stability guarantees about

@MarcoGorelli MarcoGorelli added enhancement New feature or request needs discussion labels Nov 25, 2024
@elephaint
Copy link
Author

Thanks for the discussion!

I think for now I'll go with nw.dependencies.is_pandas_dataframe(df.to_native())

Just to be clear - this is really a 'nice to have' but by no means very important to me, so don't make something crazy complex over this 😛

@FBruzzesi
Copy link
Member

Just out of curiosity for now, could you point to such example that requires branching a specific path for pandas?

@dangotbanned
Copy link

Just out of curiosity for now, could you point to such example that requires branching a specific path for pandas?

@FBruzzesi I haven't had to branch for pandas yet, but I have two examples for pyarrow.

They both relate to extracting scalar(s):

Although for these, I'd be happy with an API like nw.to_py_scalar but for nw.Series (and maybe nw.DataFrame)

@MarcoGorelli
Copy link
Member

thanks for sharing

Although for these, I'd be happy with an API like nw.to_py_scalar but for nw.Series (and maybe nw.DataFrame)

sorry not sure i follow, could you clarify please?

@FBruzzesi
Copy link
Member

FBruzzesi commented Nov 30, 2024

+1 on Marco question, I am not sure I get the point.

but I have two examples for pyarrow

I share the pain of working with pyarrow objects.

For your specific case here, I think we should implement __contains__ operator using native functionalities and without having to convert to python lists. Currently:

import narwhals as nw
import pandas as pd
import polars as pl
import pyarrow as pa

def func(value, series):
    return value in nw.from_native(series, series_only=True)

data = [1,2,3]
func(1, pd.Series(data))  # True
func(1, pl.Series(data))  # True
func(1, pa.chunked_array([data]))  # False

@dangotbanned
Copy link

thanks for sharing

Although for these, I'd be happy with an API like nw.to_py_scalar but for nw.Series (and maybe nw.DataFrame)

sorry not sure i follow, could you clarify please?

Sure @MarcoGorelli

So for the pandas and polars paths in these examples:

# 1
nw.DataFrame.item(0, ...) -> PythonDataType
# 2
nw.Series.__iter__() -> Iterator[PythonDataType]

But with pyarrow:

  • # 1 needs an extra nw.to_py_scalar
  • # 2 instead uses nw.Series.to_list, which is an eager version of what I'm after

@MarcoGorelli
Copy link
Member

thanks!

for #1 - yup, we could return Python scalars for all PyArrow aggregations (like Polars) does, it would just require a bit of a refactor

for #2, I think you'd be better off using to_list for all backends. Polars example:

In [20]: s
Out[20]:
shape: (10_000_000,)
Series: '' [i64]
[
        4
        5
        7
        9
        08
        8
        2
        9
        4
]

In [21]: %timeit _ = set(s)
327 ms ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [22]: %timeit _ = set(s.to_list())
164 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Though both solutions are eager so I'm not sure I understand "which is an eager version of what I'm after"

@MarcoGorelli
Copy link
Member

yup, we could return Python scalars for all PyArrow aggregations (like Polars) does, it would just require a bit of a refactor

actually, it's not as bad as I was expecting...gonna make a PR 😎

@dangotbanned
Copy link

for #2, I think you'd be better off using to_list for all backends. Polars example:

In [21]: %timeit _ = set(s)
327 ms ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [22]: %timeit _ = set(s.to_list())
164 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Though both solutions are eager so I'm not sure I understand "which is an eager version of what I'm after"

Thanks @MarcoGorelli, good call on the to_list suggestion!

Seems I forgot about the performance impact 🤦‍♂️ - as I knew this back in vega/altair#3501 (comment)

Always fun to learn a refactor can be simpler and more performant 😄

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Nov 30, 2024

actually, it's not as bad as I was expecting...gonna make a PR 😎

done, available in Narwhals 1.15.0 🚀 #1471

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion
Projects
None yet
Development

No branches or pull requests

4 participants