-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][C++] test-ubuntu-20.04-cpp fails sometimes with a segmentation fault on arrow-dataset-file-parquet-encryption-test #43057
Comments
I finally got a backtrace from https://github.com/ursacomputing/crossbow/actions/runs/10686494834/job/29654051500:
|
Disassembly of crashing code:
The
Its value comes from This probably means that, at this point, the |
Also, the backtrace in the main thread looks like this. Other threads are sleeping:
|
…/encryptor (apache#43947) This is to get a clearer error rather than an obscure crash, see apache#43057 for an example. * GitHub Issue: apache#43946 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…/encryptor (apache#43947) This is to get a clearer error rather than an obscure crash, see apache#43057 for an example. * GitHub Issue: apache#43946 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
I managed to make This reveals two issues:
Adding the reconstruction fixes 1., making the cipher context a thread local variable fixes 2. However, I think this should be fixed be rethinking the design. Why are the cipher contexts wiped out in the first place? Are this security risk concerns or is this to reduce memory footprint? I haven't yet found the cause of multiple threads getting hold of the same AesDecryptorImpl instance. It'd be great to get some pointers to the mechanics of that. |
@pitrou in #43947 (comment) you mentioned the test scans the dataset multi-threaded. Can you point me to the bit of the stacktrace that implies that? I think |
That was just an intuition, so I might be wrong indeed. |
I am not the original author, but I suppose this is for security concerns. |
I think this happens in multiple places. At the top-most level (the Parquet file reader), a single arrow/cpp/src/parquet/file_reader.cc Lines 275 to 276 in 9015a81
This Worse, most encrypted Parquet files will use the same (footer) key for decrypting all columns. This means all arrow/cpp/src/parquet/encryption/internal_file_decryptor.cc Lines 224 to 227 in 9015a81
As you point out, the decryptors should be created on an adhoc basic for each decryption operation. Hopefully this is a fast operation? |
cc @rok |
@andersonm-ibm in #12778 you mentioned that multi-threaded read requires an AesDecryptorImpl per column, but a column can be read by multiple threads in parallel, right? |
@thamht4190 can you shed some light on this? |
#39623 might provide some useful context. It looks like we can create multiple |
@adamreeve Do you disagree with the analysis in #43057 (comment) ? There still seems to be some amount of decryptor caching going on, at several levels. I think one radical solution would be to remove all kinds of caching and create a new decryptor for each encrypted module. We can later run perf tests to see if that really makes things slower. |
Yes I don't think the analysis in that comment is quite right. The linked code that creates a arrow/cpp/src/parquet/file_reader.cc Lines 265 to 266 in 9015a81
It looks like #39623 removed the caching of the decryptors within |
But |
And besides, even if we were creating a distinct decryptor per column chunk, the caller may still want to parallelize reading at the page level. |
Hmm yeah that could be a problem.
It looks like the arrow/cpp/src/parquet/column_reader.h Lines 156 to 158 in 990cdf7
|
It may not be obvious how to use it with other Parquet C++ APIs, but the OffsetIndex conceptually allows direct access to individual pages. So, ideally at least, and hopefully in the future, it will be possible to access individual data pages from a column in a non-sequential fashion. (cc @wgtmac @mapleFU ) arrow/cpp/src/parquet/page_index.h Lines 119 to 132 in 71389f8
|
Yes, I think we can implement a new PageReader which leverages OffsetIndex for page access. The challenge is to optimize the I/O since it is unpredictable. |
Ah OK interesting, thanks. If we support this in future by updating PageReader then maybe we'd need to document that PageReader isn't thread-safe, and that users need to create separate PageReader instances per thread? Otherwise I guess we could allow users to create per-thread decryptors and pass them to the PageReader methods. Another option would be using locking within the decryptors, and maybe the overhead of that would be OK if the locks would not usually be contended.
I tested doing a Dataset scan using uniform encryption and this does cause decryptor errors. I thought it's worth creating a separate issue for that as it's not exactly the same problem as this issue, so I've reported this as #44852. It's probably worth pointing out that specifying the same master key for columns as the footer master key isn't the same as uniform encryption. With uniform encryption the same data encryption key is used, but if you specify the same key name for columns then separate data encryption keys will be generated per column. And the uniform encryption option isn't exposed in PyArrow, so this scenario might not be that common. |
Describe the bug, including details regarding any error messages, version, and platform.
test-ubuntu-20.04-cpp seems to fail sometimes with a segmentation fault on
arrow-dataset-file-parquet-encryption-test
:It has also failed on PRs: https://github.com/ursacomputing/crossbow/actions/runs/9661748529/job/26650110979
On retry it succeeds sometimes: https://github.com/ursacomputing/crossbow/actions/runs/9671504639
Component(s)
C++, Continuous Integration
The text was updated successfully, but these errors were encountered: