Skip to content
This repository has been archived by the owner on Oct 16, 2018. It is now read-only.

[index] Add Batch method for inserting multiple documents at a time #34

Merged
merged 8 commits into from
Apr 26, 2018

Conversation

jeromefroe
Copy link
Contributor

This PR adds a Batch method to the index interface for inserting a batch of documents at a time and adds a corresponding implementation to the in-memory segment.

@jeromefroe jeromefroe requested a review from prateek April 20, 2018 13:55
// Validate validates the given document and returns its ID if it has one.
func (d Document) Validate() ([]byte, error) {
// Validate returns a bool indicating whether the document is valid.
func (d Document) Validate() error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this call HasID()? maybe add a test for this case too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to require an ID here, in fact, in the segment we currently check if a document is valid before checking its ID. We might want to remove the check for the fields though in case we ever want to index just an ID. Not sure if that would ever be needed though.

@@ -80,8 +80,9 @@ func (w *writer) Open() error {
}

func (w *writer) Write(d doc.Document) error {
w.enc.PutUvarint(uint64(len(d.Fields)))
w.enc.PutBytes(d.ID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoder/decoder is beginning to look super similar to the types in https://github.com/m3db/m3db/blob/master/serialize/types.go

Any chance we can consolidate?

Copy link
Contributor Author

@jeromefroe jeromefroe Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to consolidating, though given that they live in separate repos that may be difficult at the moment. Perhaps we can track in an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, #36

| genny -out=${GENERATED_PATH}/postingsgen/generated_map.go \
-pkg=postingsgen gen "KeyType=[]byte ValueType=postings.MutableList"
cat $GENERIC_MAP_IMPL \
| genny -out=${GENERATED_PATH}/postingsgen/generated_map.go \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here, could you change the name to .../postingsgen/map_gen.go (for this one and others below) . _gen.go is the convention Rob started using in m3x/m3db recently.

And also update .excludecoverage

index/types.go Outdated

// Batch inserts a batch of metrics into the index. The documents are guaranteed to be
// searchable all at once when the Batch method returns.
Batch(d []doc.Document) error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need something to indicate it's an insert not a read, InsertBatch maybe?

Seal() error
// Batch inserts a batch of documents into the segment. The documents are guaranteed to
// be searchable all at once when the Batch method returns.
Batch(d []doc.Document) error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below, InsertBatch?

// TODO: Consider supporting concurrent writes by relaxing the requirement that
// inserted documents are immediately searchable.
s.ids.Lock()
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahah nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be overkill, but I think clearly indicating the scope of the lock visually is helpful :)

NoCopyKey: true,
NoFinalizeKey: true,
})
i++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:why not do i++ in the post completion part of the for loop

Copy link
Collaborator

@prateek prateek Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, i see what you're doing in the ContainsTerm check.

Suggestion(take/leave): change the for loop to make it more apparent:

i := 0 
for i < len(docs) {
 if  ... {
  continue
 }
 i++
}


// indexDocWithLock indexes the fields of a document in the segment's terms dictionary. It
// must be called with the segment's state lock.
func (s *segment) indexDocWithLock(id postings.ID, d doc.Document) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how bout renaming to indexDocWithStateRLock to better indicate what lock you need.

PS this is why I really want to create m3db/build-tools#14


// storeDocWithLock stores a documents into the segment's mapping of postings IDs to
// documents. It must be called with the segment's state lock.
func (s *segment) storeDocWithLock(id postings.ID, d doc.Document) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, how bout renaming to storeDocWithStateRLock to better indicate what lock you need.

// we're guaranteed to never have conflicts with docID (it's monotonically increasing),
// and have checked `i.docs.data` is large enough.
s.docs.data[idx] = d
s.docs.RUnlock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(take/leave): i actually think the '{' convention you're following is hurting readability in this example.

s.state.Unlock()
return sgmt.ErrClosed
func (s *segment) prepareDocs(ds []doc.Document) error {
ids := idsgen.New(len(ds))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it expensive to do this alloc for each batch insert? How would you feel about sorting the docs instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I wanted to avoid this and push it into the caller but I thought that could break some fundamental assumptions if the client wasn't careful. I also considered moving the prepareDocs function into the writer lock so it could be reused. How does that sound as an alternative? I definitely think we're going to need to come back and optimize the segment.

index/types.go Outdated

// InsertBatch inserts a batch of metrics into the index. The documents are guaranteed
// to be searchable all at once when the Batch method returns.
InsertBatch(d []doc.Document) error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you indicate which documents were unable to be indexed in the return type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not reject the entire batch? The way I viewed it was that a batch was analogous to a transaction in a SQL database.

d := ds[i]
err := d.Validate()
if err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of early terminating, could you continue attempting to index the remainder of the docs

// we need to index.
ds[i], ds[len(ds)] = ds[len(ds)], ds[i]
ds = ds[:len(ds)-1]
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you indicate this case within the return'd type

}

if _, ok := s.writer.idSet.Get(d.ID); ok {
return errDuplicateID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, instead of early terminating, continue to insert other docs

@jeromefroe jeromefroe force-pushed the jeromefroe/support-batch-inserts branch from af23ed8 to 833f0c7 Compare April 24, 2018 20:32
@jeromefroe jeromefroe force-pushed the jeromefroe/support-batch-inserts branch from 833f0c7 to 55c2e2c Compare April 24, 2018 22:45
index/types.go Outdated
Insert(d doc.Document) ([]byte, error)

// InsertBatch inserts a batch of metrics into the index. The documents are guaranteed
// to be searchable all at once when the Batch method returns.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document the types of errors this can return so it's not a complete surprise when people are downcasting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, good idea!

s.state.Unlock()
return errSegmentSealed
err = s.prepareDocsWithLocks(b)
if err != nil && !index.IsBatchPartialError(err) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to check if partial updates are allowed and insert those still? would be good to add a testcase for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually handled in in prepareDocsWithLocks. If we don't support partial updates it will return the error right away, otherwise it will continue validating documents and return a partial error (the corresponding test case is TestSegmentInsertBatchPartialError )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, good stuff.


if _, ok := s.writer.idSet.Get(d.ID); ok {
if !b.AllowPartialUpdates {
return errDuplicateID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make errDuplicateID an exported type. I want to be able to distinguish it from other failures downstream

Copy link
Collaborator

@prateek prateek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM save the couple of pending nits

@jeromefroe jeromefroe merged commit 42af1f7 into master Apr 26, 2018
@jeromefroe jeromefroe deleted the jeromefroe/support-batch-inserts branch April 26, 2018 22:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants