Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50243][SQL][Connect] Cached classloader for ArtifactManager #49007

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

xupefei
Copy link
Contributor

@xupefei xupefei commented Nov 28, 2024

What changes were proposed in this pull request?

This PR implements a caching mechanism for the classloader of ArtifactManager, to avoid re-generating a new one every time when a SQL query runs. This change also fixes a longstanding bug where the codegen cache was broken for Spark Connect, due to new classloaders causing cache misses:

def compile(code: CodeAndComment): (GeneratedClass, ByteCodeStats) = try {
val classLoaderRef = new HashableWeakReference(Utils.getContextOrSparkClassLoader)
cache.get((classLoaderRef, code))
} catch {
.

The approach we use is to cache the generated classloader until a new artifact is added.

Why are the changes needed?

To improve performance and fix codegen caching.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added two new test cases.
Also, some existing tests will fail when the config ARTIFACTS_SESSION_ISOLATION_ALWAYS_APPLY_CLASSLOADER is enabled:

  • AdaptiveQueryExecSuite: SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side
  • DatasetOptimizationSuite: SPARK-27871: Dataset encoder should benefit from codegen cache

These tests will not fail after this PR.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 28, 2024
@xupefei xupefei changed the title [WIP] Cached classloader for ArtifactManager [SPARK-50243][WIP] Cached classloader for ArtifactManager Nov 29, 2024
@xupefei xupefei changed the title [SPARK-50243][WIP] Cached classloader for ArtifactManager [SPARK-50243][SQL][Connect] Cached classloader for ArtifactManager Dec 2, 2024
This reverts commit 089f064.
@xupefei xupefei marked this pull request as ready for review December 2, 2024 16:15
Comment on lines +91 to +94
protected val cachedClassLoader = new ThreadLocal[ClassLoader] {
override def initialValue(): ClassLoader = null
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why this is a ThreadLocal? Artifacts can be added from multiple threads so wouldn't the addition on one thread cause all the others cached classloaders to be invalidated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants