-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BlobCompressor to compress blob and text fields on fly #221
base: master
Are you sure you want to change the base?
Add BlobCompressor to compress blob and text fields on fly #221
Conversation
6902f8c
to
7983675
Compare
I'm missing the design and the documentation for such a feature. It should not be done hastily. We need to think of interoperability and configuration. Let's have a document on this (on the issue) first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment: This new feature should be properly documented. Right now there is no documentation at all for it.
I know that gocql isn't very good in this regard, but the best way to improve that is to add good docs and comments for new / changed code.
compressor_test.go
Outdated
} | ||
m := make(map[string]interface{}) | ||
|
||
BlobCompressor = lz4.NewBlobCompressor(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC assigning to this global variable activates the blob compressor?
This is very unintuitive and hard to use interface:
- not possible to have different compressors for different sessions / queries
- much harder to test, because your test now relies on global state, so you can't concurrently run tests that need different compressors.
- Harder for programmer to understand because this is another piece of state, but it's not present in any struct used (Session / query etc)
This new compressor interface is something very similar to a few policies / configurations that the driver already has (like RetryPolicy) and should be set the same way. It should be a field on session / query, with setters.
lz4/blob_compressor.go
Outdated
limit int | ||
} | ||
|
||
var defaultPrefix = []byte{0x01, 0x11, 0x22, 0x33} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think default prefix should exist. It's very possible for the data to have such prefix, so it's a new footgun for a driver user: if a user just uses a compressor, like you did in tests (BlobCompressor = lz4.NewBlobCompressor(100)
), it may result in corrupt data. Correct usage requires user to think about data they store in database and what prefix is safe.
Best way to make user think about this is to require explicitly passing prefix.
compressor_test.go
Outdated
// Test for Iter.MapScan() | ||
{ | ||
testMap := make(map[string]interface{}) | ||
if !session.Query(`SELECT * FROM test_blob_compressor`).Iter().MapScan(testMap) { | ||
t.Fatal("MapScan failed to work with one row") | ||
} | ||
if diff := cmp.Diff(sliceMap[0], testMap); diff != "" { | ||
t.Fatal("mismatch in returned map", diff) | ||
} | ||
} | ||
|
||
// Test for Query.MapScan() | ||
{ | ||
testMap := make(map[string]interface{}) | ||
if session.Query(`SELECT * FROM test_blob_compressor`).MapScan(testMap) != nil { | ||
t.Fatal("MapScan failed to work with one row") | ||
} | ||
if diff := cmp.Diff(sliceMap[0], testMap); diff != "" { | ||
t.Fatal("mismatch in returned map", diff) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this test checks identity of insert / select, so it would pass if there was no compressor set / marshal module didn't use the compressor, right?
If so, then it should be improved - it's an integration test so it should check that the new feature works properly with the rest of the driver, right now it doesn't do this.
lz4/blob_compressor.go
Outdated
} | ||
buf := make([]byte, len(c.prefix)+lz4.CompressBlockBound(len(data)+4)) | ||
copy(buf, c.prefix) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This +4 looks like a mistake, we compress the data with size len(data)
, not len(data)+4
// According to lz4.CompressBlock doc, it doesn't fail as long as the dst | ||
// buffer length is at least lz4.CompressBlockBound(len(data))) bytes, but | ||
// we check for error anyway just to be thorough. | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure here is a very unexpected condition, maybe it deserves some log entry?
lz4/blob_compressor.go
Outdated
if len(data) < 4+len(c.prefix) { | ||
return nil, fmt.Errorf("compressed data should be >4, got=%d", len(data)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4+len(c.prefix)
-> prefixPlusLen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This compressor should have unit tests for various scenarios:
- decompressing a data without correct prefix
- compress / decompress identity
- compress, and verify that resulting buffer is correct
- decompress with incorrect len in buffer
and preferably more scenarios that you can think of.
Cover letter for the PR could be a bit more elaborate
For examples of good PR cover letters see: |
@mykaul , @Lorak-mmk , thanks for reviewing, it. Could you please take a look at PR Description, I have layed out all the options to implement this feature. Let's take a look and decide which way to go. I myself prefere N3 and N4. This PR was indended to be Draft just to test things out, unfortunately I missclicked and could not switch it back. |
I converted it to draft |
Both btw use LZ4 (or snappy) by default - which is less suitable for JSON/TEXT blobs (as it has no entropy compression).
Again, it'll be easier to understand and discuss better on the issue than the PR. |
7983675
to
fe93f24
Compare
fe93f24
to
847b9e4
Compare
Closes #218
Column-based blobs compression
General idea
If client stores big blobs of data compressing data that goes into that field will reduce operations select/update footprint on both network and server.
Interoperability issues
gocql
driver,cqlsh
or other drivers won't be able to read itPossible implementation of serialization/deserialization
1. Global variable + hack into
marshalVarchar
/unmarshalVarchar
Pros:
Cons:
2. Option in
ClusterConfig
+ hack intoNativeType
Pros:
Global variable
Cons:
NativeType
3. Custom type
Pros:
lz4
packageCons;
3. Custom type with global compressor
Pros:
lz4
package