Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand procedure architecture for distributed execution, and support iceberg procedure rewrite_data_files #22659

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented May 3, 2024

Description

This PR expand the current procedure architecture in presto, support defining, registering and calling procedures which need to be executed in a distributed way. Then support distributed procedure in Iceberg connector and implement a specific procedure rewrite_data_files for it.

Referring to: prestodb/rfcs#12

The whole PR is separated into 6 parts:

  1. Re-factor ProcedureRegistry/Procedure data structure to support the creation and register of DistributedProcedure. And make sure ProcedureRegistry be available in presto-analyzer module, so that we can recognize distributed procedures in call statement during prepare and analyze stages.

  2. Handle call statement on distributed procedures in preparer stage. In this stage, we figure out the procedure's type in call statement, and define a new query type CALL_DISTRIBUTED_PROCEDURE for call distributed procedure in BuiltInPreparedQuery. In this way, call distributed procedure statement would be handled by SqlQueryExecutionFactory, then be created and handled as a SqlQueryExecution.

  3. Analyze and plan the call distributed procedure statement, and finally generate a logical plan for it as follows:

OutputNode <- TableFinishNode <- CallDistributedProcedureNode <- FilterNode <- TableScanNode
  1. Optimize, segmentation, grouped tag and local plan for the logical plan generated above. The handle logical for CallDistributedProcedureNode is similar as TableWriterNode. Besides, a new optimizer RewriteWriterTarget is added, which is placed after all optimization rules. It is used to update the TableHandle held in TableFinishNode and CallDistributedProcedureNode based on the underlying TableScanNode after the entire optimization is completed, considering the possible filter pushing down.

  2. Re-factor Iceberg connector to support call distributed procedure. Introduce Iceberg's transaction context and expand IcebergSplitManager to support split source planned by IcebergAbstractMetadata.beginCallDistributedProcedure(...). This split source will be set to transaction context, and use transaction context to hold all the files to be rewritten as well.

  3. Support Iceberg rewrite_data_files procedure. It build a customized split source, set the split source to transaction context in order to be used in IcebergSplitManager. And register a file scan task consumer to collector and hold all the scanned files into transaction context. Then finally in the commit stage, get all the data files and delete files that has been rewritten, and all the files that has been newly generated, change and commit their metadata through Iceberg table's RewriteFiles transaction.

Motivation and Context

N/A

Impact

N/A

Test Plan

N/A

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@hantangwangd hantangwangd marked this pull request as draft May 3, 2024 11:24
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch 4 times, most recently from 7ec819c to 9440737 Compare May 8, 2024 08:36
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch 4 times, most recently from f89dc40 to e796fa2 Compare May 13, 2024 12:56
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch 5 times, most recently from acb0351 to c3eaa96 Compare May 24, 2024 19:59
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch 3 times, most recently from 05de3c8 to 0dc3dbb Compare June 11, 2024 11:08
@hantangwangd hantangwangd changed the title [ForTest]Expand procedure architecture for distributed execution, and support iceberg procedure rewrite_data_files [WIP]Expand procedure architecture for distributed execution, and support iceberg procedure rewrite_data_files Jun 11, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the draft doc! Some nits about punctuation, formatting, and some suggested rephrasing for readability and conciseness, but the content looks good.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch from 0dc3dbb to a78c41c Compare June 13, 2024 18:47
@hantangwangd
Copy link
Member Author

@steveburnett Thanks a lot for your suggestion, all be fixed. Please take a look when convenient!

steveburnett
steveburnett previously approved these changes Jun 13, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch from a78c41c to 2fdbab7 Compare July 13, 2024 03:17
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch from 2fdbab7 to befe9a7 Compare July 31, 2024 18:17
@hantangwangd hantangwangd marked this pull request as ready for review July 31, 2024 19:28
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch 2 times, most recently from efc388f to 0dfa54c Compare August 18, 2024 03:55
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch 3 times, most recently from 4286ff5 to ff0a4dc Compare September 9, 2024 09:33
@hantangwangd hantangwangd changed the title [WIP]Expand procedure architecture for distributed execution, and support iceberg procedure rewrite_data_files Expand procedure architecture for distributed execution, and support iceberg procedure rewrite_data_files Oct 7, 2024
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch from ff0a4dc to 84989af Compare October 7, 2024 17:29
@hantangwangd hantangwangd force-pushed the support_call_distributed_procedure branch from 84989af to f756641 Compare November 3, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants