-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFE]: Implement values compression for text and blob types #218
Comments
Column-based blobs compressionGeneral ideaIf client stores big blobs of data compressing data that goes into that field will reduce operations select/update footprint on both network and server. Interoperability issues
Possible implementation of serialization/deserialization1. Global variable + hack into
|
@mykaul , @Lorak-mmk , let's move discussion here |
@mykaul, this is continueation of comment
I saw that network compression is done on whole frame, which makes me think that when it both feature are working, server decompress frame, extracts column data and then compress it back when writing to the sstable. Am I correct on that ?
Best case if we leave user to decide ( |
Yes
But it can change from workload to workload. What we should do is be able to, at the end of the compression of a specific BLOB, to determine if it passed some threshold or not. If you compressed 1000 bytes to 950 bytes, it's worthless. Don't spend the cycles. |
Do we have any benchmarks that show that this additional overhead of recompression is significant enough to warrant such a feature?
There is also another drawback: the risk of data corruption. The solution relies on prepending some prefix to data, and assumes that this prefix will never occur in real data. Those are quite a big drawbacks, that's why I asked the previous question, because the issue does not explain how the problem is great enough that we are willing to tolerate such a drawback.
I assume this was meant to be number 4
If we decide to implement this, I'd be for option 3 - because of no global variables. |
@karol-kokoszka
I believe that modern compression algorithms are robust enough to avoid a possibility of such a situation. Even if the header matches the compressed archive the following attempt of decompressing of a blob which is actually not a compressed archive is going to fail. It definitely will for an ASCII input. And if a user sends binary blobs as data and wants to prevent a potential collision they can simply disable such a compression.
In use cases that allow such situations one should always compress or not compress, which the driver should allow configuring. And look what I found: https://java-driver.docs.scylladb.com/stable/manual/core/compression/ (and the corresponding https://docs.datastax.com/en/developer/java-driver/4.0/manual/core/compression/index.html) It turns out that a Scylla and Datastax Java Drivers already allow similar things. |
Didn't know about this. I assume it's a tribal knowledge scattered around various issues and there is no place I can learn about this?
I think we are talking about different things. Did you see the PR implementing this feature? #221 It uses a prefix (by default We could avoid using such mechanisms (prefix, length limit) and always compress, but that would severly limit usability of such a feature, because you could only use it with fresh table, not with existing data.
It's not a similar thing, you linked to docs about CQL compression (which I'm pretty sure goCQL already supports). |
It is a draft to quickly test feature out and no good as point of reference.
It could be mittigated.
+1 |
How could it be mitigated? |
There are two cases to address:
|
https://opensource.docs.scylladb.com/stable/troubleshooting/large-rows-large-cells-tables.html
You are right. I missed that part.
We don't have to - see my comment above. We should be able to both not add any custom prefixes and be able to identify if the blob you received is a compressed one or not.
Oh, I assumed that this is the same thing (values compression). My bad. |
Thanks!
So you propose to deserialize like follows (pseudocode):
and to serialize all blobs, regardless of length? |
Hasn't it same problem as you described in #218 (comment) ? |
Right, Not DoS, but incorrect data would be returned, because code would treat this as raw uncompressed data. It's probably worse :/ |
If you forge prefix and header properly and underlying sturctures you can cause excessive memory and cpu consumption which can lead to DoS. |
You can limit both. |
If the feature requires careful considerations from the library user and some amount of supporting code in user code because of aforementioned problems, is there a point of implementing it in the driver instead of in application code? User will need modifications and safe guards around it anyway. Method 3, custom type, is something that can easily be implemented by the user, with very limited amount of code. |
|
We could add an example compressing a BLOB for all our driver examples.
See above.
See above. |
Description
Implement an optional compression of text/ascii/blob cells: this way the DB will have to handle smaller buffers internally.
The compression should only be done for values with sizes above a configurable threshold.
The text was updated successfully, but these errors were encountered: