-
Notifications
You must be signed in to change notification settings - Fork 0
[index] Add Batch method for inserting multiple documents at a time #34
Conversation
// Validate validates the given document and returns its ID if it has one. | ||
func (d Document) Validate() ([]byte, error) { | ||
// Validate returns a bool indicating whether the document is valid. | ||
func (d Document) Validate() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this call HasID()
? maybe add a test for this case too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to require an ID here, in fact, in the segment we currently check if a document is valid before checking its ID. We might want to remove the check for the fields though in case we ever want to index just an ID. Not sure if that would ever be needed though.
@@ -80,8 +80,9 @@ func (w *writer) Open() error { | |||
} | |||
|
|||
func (w *writer) Write(d doc.Document) error { | |||
w.enc.PutUvarint(uint64(len(d.Fields))) | |||
w.enc.PutBytes(d.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encoder/decoder is beginning to look super similar to the types in https://github.com/m3db/m3db/blob/master/serialize/types.go
Any chance we can consolidate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to consolidating, though given that they live in separate repos that may be difficult at the moment. Perhaps we can track in an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, #36
generated/generics/generate.sh
Outdated
| genny -out=${GENERATED_PATH}/postingsgen/generated_map.go \ | ||
-pkg=postingsgen gen "KeyType=[]byte ValueType=postings.MutableList" | ||
cat $GENERIC_MAP_IMPL \ | ||
| genny -out=${GENERATED_PATH}/postingsgen/generated_map.go \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're here, could you change the name to .../postingsgen/map_gen.go
(for this one and others below) . _gen.go
is the convention Rob started using in m3x/m3db recently.
And also update .excludecoverage
index/types.go
Outdated
|
||
// Batch inserts a batch of metrics into the index. The documents are guaranteed to be | ||
// searchable all at once when the Batch method returns. | ||
Batch(d []doc.Document) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need something to indicate it's an insert not a read, InsertBatch
maybe?
index/segment/types.go
Outdated
Seal() error | ||
// Batch inserts a batch of documents into the segment. The documents are guaranteed to | ||
// be searchable all at once when the Batch method returns. | ||
Batch(d []doc.Document) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below, InsertBatch
?
index/segment/mem/segment.go
Outdated
// TODO: Consider supporting concurrent writes by relaxing the requirement that | ||
// inserted documents are immediately searchable. | ||
s.ids.Lock() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hahah nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be overkill, but I think clearly indicating the scope of the lock visually is helpful :)
index/segment/mem/segment.go
Outdated
NoCopyKey: true, | ||
NoFinalizeKey: true, | ||
}) | ||
i++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:why not do i++
in the post completion part of the for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, i see what you're doing in the ContainsTerm
check.
Suggestion(take/leave): change the for loop to make it more apparent:
i := 0
for i < len(docs) {
if ... {
continue
}
i++
}
index/segment/mem/segment.go
Outdated
|
||
// indexDocWithLock indexes the fields of a document in the segment's terms dictionary. It | ||
// must be called with the segment's state lock. | ||
func (s *segment) indexDocWithLock(id postings.ID, d doc.Document) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how bout renaming to indexDocWithStateRLock
to better indicate what lock you need.
PS this is why I really want to create m3db/build-tools#14
index/segment/mem/segment.go
Outdated
|
||
// storeDocWithLock stores a documents into the segment's mapping of postings IDs to | ||
// documents. It must be called with the segment's state lock. | ||
func (s *segment) storeDocWithLock(id postings.ID, d doc.Document) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, how bout renaming to storeDocWithStateRLock
to better indicate what lock you need.
// we're guaranteed to never have conflicts with docID (it's monotonically increasing), | ||
// and have checked `i.docs.data` is large enough. | ||
s.docs.data[idx] = d | ||
s.docs.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(take/leave): i actually think the '{' convention you're following is hurting readability in this example.
index/segment/mem/segment.go
Outdated
s.state.Unlock() | ||
return sgmt.ErrClosed | ||
func (s *segment) prepareDocs(ds []doc.Document) error { | ||
ids := idsgen.New(len(ds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it expensive to do this alloc for each batch insert? How would you feel about sorting the docs instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I wanted to avoid this and push it into the caller but I thought that could break some fundamental assumptions if the client wasn't careful. I also considered moving the prepareDocs
function into the writer lock so it could be reused. How does that sound as an alternative? I definitely think we're going to need to come back and optimize the segment.
index/types.go
Outdated
|
||
// InsertBatch inserts a batch of metrics into the index. The documents are guaranteed | ||
// to be searchable all at once when the Batch method returns. | ||
InsertBatch(d []doc.Document) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you indicate which documents were unable to be indexed in the return type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not reject the entire batch? The way I viewed it was that a batch was analogous to a transaction in a SQL database.
index/segment/mem/segment.go
Outdated
d := ds[i] | ||
err := d.Validate() | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of early terminating, could you continue attempting to index the remainder of the docs
index/segment/mem/segment.go
Outdated
// we need to index. | ||
ds[i], ds[len(ds)] = ds[len(ds)], ds[i] | ||
ds = ds[:len(ds)-1] | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you indicate this case within the return'd type
index/segment/mem/segment.go
Outdated
} | ||
|
||
if _, ok := s.writer.idSet.Get(d.ID); ok { | ||
return errDuplicateID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, instead of early terminating, continue to insert other docs
af23ed8
to
833f0c7
Compare
833f0c7
to
55c2e2c
Compare
index/types.go
Outdated
Insert(d doc.Document) ([]byte, error) | ||
|
||
// InsertBatch inserts a batch of metrics into the index. The documents are guaranteed | ||
// to be searchable all at once when the Batch method returns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe document the types of errors this can return so it's not a complete surprise when people are downcasting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, good idea!
s.state.Unlock() | ||
return errSegmentSealed | ||
err = s.prepareDocsWithLocks(b) | ||
if err != nil && !index.IsBatchPartialError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to check if partial updates are allowed and insert those still? would be good to add a testcase for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually handled in in prepareDocsWithLocks
. If we don't support partial updates it will return the error right away, otherwise it will continue validating documents and return a partial error (the corresponding test case is TestSegmentInsertBatchPartialError
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, good stuff.
index/segment/mem/segment.go
Outdated
|
||
if _, ok := s.writer.idSet.Get(d.ID); ok { | ||
if !b.AllowPartialUpdates { | ||
return errDuplicateID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make errDuplicateID
an exported type. I want to be able to distinguish it from other failures downstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM save the couple of pending nits
This PR adds a
Batch
method to theindex
interface for inserting a batch of documents at a time and adds a corresponding implementation to the in-memorysegment
.