Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46689][SPARK-46690][PYTHON][CONNECT] Support v2 profiling in group/cogroup applyInPandas/applyInArrow #45050

Closed
wants to merge 8 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Feb 6, 2024

What changes were proposed in this pull request?

Support v2 (perf, memory) profiling in group/cogroup applyInPandas/applyInArrow, which rely on physical plan nodes FlatMapGroupsInBatchExec and FlatMapCoGroupsInBatchExec.

Why are the changes needed?

Complete v2 profiling support.

Does this PR introduce any user-facing change?

Yes. V2 profiling in group/cogroup applyInPandas/applyInArrow is supported.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng xinrong-meng changed the title [SPARK-46689][SPARK-46690][PYTHON][CONNECT] Support v2 profiling in group/cogroup applyInPandas [SPARK-46689][SPARK-46690][PYTHON][CONNECT] Support v2 profiling in group/cogroup applyInPandas/applyInArrow Feb 7, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review February 7, 2024 22:06
Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./python/pyspark/tests/test_memory_profiler.py:548:23: E741 ambiguous variable name 'l'
        def summarize(l, r):
                      ^
./python/pyspark/sql/tests/test_udf_profiler.py:501:23: E741 ambiguous variable name 'l'
        def summarize(l, r):
                      ^

df1 = self.spark.createDataFrame([(1, 1.0), (2, 2.0), (1, 3.0), (2, 4.0)], ("id", "v1"))
df2 = self.spark.createDataFrame([(1, "x"), (2, "y")], ("id", "v2"))

def summarize(l, r):
Copy link
Member

@ueshin ueshin Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we rename to left and right according to the lint error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the reason and saw "In some fonts, these characters are indistinguishable from the numerals one and zero. When tempted to use 'l', use 'L' instead." That's good to learn :p

df1 = self.spark.createDataFrame([(1, 1.0), (2, 2.0), (1, 3.0), (2, 4.0)], ("id", "v1"))
df2 = self.spark.createDataFrame([(1, "x"), (2, "y")], ("id", "v2"))

def summarize(l, r):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@xinrong-meng
Copy link
Member Author

Merged to master, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants