-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add a module to read all Substrait plan formats #45
base: main
Are you sure you want to change the base?
feat: add a module to read all Substrait plan formats #45
Conversation
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
This PR depends on substrait-io/substrait-cpp#91 (the substrait-cpp submodule dependency) which is already submitted. A concern will be to keep the version of Substrait that both substrait-python and substrait-cpp depend on in sync. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid the potential memory leak and then this is good from my perspective. Can't comment too much on the build.
if result.contents.errorMessage: | ||
raise PlanFileException(result.contents.errorMessage) | ||
data = ctypes.string_at(result.contents.buffer, result.contents.size) | ||
plan = plan_pb2.Plan() | ||
plan.ParseFromString(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance something goes wrong here that doesn't free the plan? For example, maybe the plan fails to deserialize into protobuf? Can you use a try / finally?
Thanks for putting this together, @EpsilonPrime ! Is the text format documented somewhere? I have a few questions:
we currently ship a single no-arch wheel. Once we're wrapping c++ we'll have to build versions for every architecture and also for every version of Python we aim to support. |
I'll look into the other ways of including the C++ code -- performance really isn't a concern though because we have to serialize and deserialize the plan anyway to cross the language boundary (unless we want to enforce C++ protobuffers in Python which seems even more restrictive since we'd need to match versions more closely). While implementing in native Python would be nice there's a full parser behind the text format which required 6 months of implementation in C++ (take a look at the substrait-io/substrait-cpp change log). There is an Antlr4 grammar file which specifies the language for the text format so part of the work is available and perhaps it'd be easier to power through it given that groundwork. My availability is limited (probably no time for a second language version this year) so if it's possible in the short term, that'd be ideal. One thing we may want to consider is temporarily making the new library hidden behind a flag so that systems not ready to take on the feature don't need to be burdened by an OS-specific build. |
Very cool! It's great to see new features added to the python impl. I'm a big +1 on using Also, it might be nice to make these additional features optional to downstream dependencies. For example, would Ibis want this included by default? We could organize the python package such that Edit: these suggestions can be follow up PRs, not necessarily required for merging |
AFAIK that will only work if we package the python bindings for substrait-cpp separately (a separate python package), such that substrait-python can depend on that additional package (I think extras can only be used for dependencies, not for enabling features within the library itself, except by relying on an external dependency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the pyproject.toml
file we might be able to get this working if we add cmake
to the requires
list under [build-system]
-- I believe that the setuptools.build_meta
build-backend
has at least some support for that.
If not, we can look at scikit-build
to help with getting things building as part of the pip
installation, and that will help us feed something like cibuildwheel
for handling packaging for all the python and architecture versions.
- name: Build Substrait planloader library | ||
run: | | ||
cd ${{ github.workspace }}/third_party/substrait-cpp | ||
make release | ||
- name: Install Substrait planloader library | ||
run: | | ||
cd ${{ github.workspace }}/third_party/substrait-cpp/build-Release/export/planloader | ||
make install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to handle compilation via the pyproject.toml
file, I'll leave some notes there.
- name: Build Substrait planloader library | ||
run: | | ||
cd ${{ github.workspace }}/third_party/substrait-cpp | ||
make release | ||
- name: Install Substrait planloader library | ||
run: | | ||
cd ${{ github.workspace }}/third_party/substrait-cpp/build-Release/export/planloader | ||
make install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to handle compilation via the pyproject.toml
file, I'll leave some notes there.
In the event that the nested submodules make this extra painful (which seems... likely?) we could always use |
@EpsilonPrime can you please add an example to the README to show how to parse/produce a text plan? Also I think it would be helpful to be able to load/save the plans to a string instead of a file. Is that a feature the C++ library can expose? I guess we could implement that by using a |
…substrait used by substrait-python.
|
With the generated Substrait plan protobufs already in this repository it is
possible to read and write binary protobufs, text protobufs, and json-encoded
protobufs. This PR adds the capability to additionally read and write the
Substrait text format as well as auto-detect formats when reading. It does this
by wrapping the substrait-cpp implementation with ctypes.