[index] Add Batch method for inserting multiple documents at a time #34

jeromefroe · 2018-04-20T13:55:08Z

This PR adds a Batch method to the index interface for inserting a batch of documents at a time and adds a corresponding implementation to the in-memory segment.

prateek · 2018-04-20T19:39:55Z

doc/document.go

-// Validate validates the given document and returns its ID if it has one.
-func (d Document) Validate() ([]byte, error) {
+// Validate returns a bool indicating whether the document is valid.
+func (d Document) Validate() error {


should this call HasID()? maybe add a test for this case too

I don't think we need to require an ID here, in fact, in the segment we currently check if a document is valid before checking its ID. We might want to remove the check for the fields though in case we ever want to index just an ID. Not sure if that would ever be needed though.

prateek · 2018-04-20T19:42:29Z

index/encoding/doc/doc.go

@@ -80,8 +80,9 @@ func (w *writer) Open() error {
 }

 func (w *writer) Write(d doc.Document) error {
-	w.enc.PutUvarint(uint64(len(d.Fields)))
+	w.enc.PutBytes(d.ID)


The encoder/decoder is beginning to look super similar to the types in https://github.com/m3db/m3db/blob/master/serialize/types.go

Any chance we can consolidate?

I'm not opposed to consolidating, though given that they live in separate repos that may be difficult at the moment. Perhaps we can track in an issue?

prateek · 2018-04-20T19:44:04Z

generated/generics/generate.sh

-  | genny -out=${GENERATED_PATH}/postingsgen/generated_map.go            \
-    -pkg=postingsgen gen "KeyType=[]byte ValueType=postings.MutableList"
+cat $GENERIC_MAP_IMPL                                                  \
+| genny -out=${GENERATED_PATH}/postingsgen/generated_map.go            \


While you're here, could you change the name to .../postingsgen/map_gen.go (for this one and others below) . _gen.go is the convention Rob started using in m3x/m3db recently.

And also update .excludecoverage

prateek · 2018-04-20T19:46:29Z

index/types.go

+
+	// Batch inserts a batch of metrics into the index. The documents are guaranteed to be
+	// searchable all at once when the Batch method returns.
+	Batch(d []doc.Document) error


need something to indicate it's an insert not a read, InsertBatch maybe?

prateek · 2018-04-20T19:46:40Z

index/segment/types.go

-	Seal() error
+	// Batch inserts a batch of documents into the segment. The documents are guaranteed to
+	// be searchable all at once when the Batch method returns.
+	Batch(d []doc.Document) error


same as below, InsertBatch?

prateek · 2018-04-20T19:48:08Z

index/segment/mem/segment.go

-	// TODO: Consider supporting concurrent writes by relaxing the requirement that
-	// inserted documents are immediately searchable.
-	s.ids.Lock()
+	{


Might be overkill, but I think clearly indicating the scope of the lock visually is helpful :)

prateek · 2018-04-20T19:53:29Z

index/segment/mem/segment.go

+			NoCopyKey:     true,
+			NoFinalizeKey: true,
+		})
+		i++


nit:why not do i++ in the post completion part of the for loop

nvm, i see what you're doing in the ContainsTerm check.

Suggestion(take/leave): change the for loop to make it more apparent:

i := 0 for i < len(docs) { if ... { continue } i++ }

prateek · 2018-04-20T20:12:34Z

index/segment/mem/segment.go

+
+// indexDocWithLock indexes the fields of a document in the segment's terms dictionary. It
+// must be called with the segment's state lock.
+func (s *segment) indexDocWithLock(id postings.ID, d doc.Document) error {


how bout renaming to indexDocWithStateRLock to better indicate what lock you need.

PS this is why I really want to create m3db/build-tools#14

prateek · 2018-04-20T20:12:54Z

index/segment/mem/segment.go

+
+// storeDocWithLock stores a documents into the segment's mapping of postings IDs to
+// documents. It must be called with the segment's state lock.
+func (s *segment) storeDocWithLock(id postings.ID, d doc.Document) {


same here, how bout renaming to storeDocWithStateRLock to better indicate what lock you need.

prateek · 2018-04-20T20:19:38Z

index/segment/mem/segment.go

+			// we're guaranteed to never have conflicts with docID (it's monotonically increasing),
+			// and have checked `i.docs.data` is large enough.
+			s.docs.data[idx] = d
+			s.docs.RUnlock()


suggestion(take/leave): i actually think the '{' convention you're following is hurting readability in this example.

prateek · 2018-04-20T20:22:20Z

index/segment/mem/segment.go

-		s.state.Unlock()
-		return sgmt.ErrClosed
+func (s *segment) prepareDocs(ds []doc.Document) error {
+	ids := idsgen.New(len(ds))


Isn't it expensive to do this alloc for each batch insert? How would you feel about sorting the docs instead?

Yea, I wanted to avoid this and push it into the caller but I thought that could break some fundamental assumptions if the client wasn't careful. I also considered moving the prepareDocs function into the writer lock so it could be reused. How does that sound as an alternative? I definitely think we're going to need to come back and optimize the segment.

prateek · 2018-04-23T18:30:12Z

index/types.go

+
+	// InsertBatch inserts a batch of metrics into the index. The documents are guaranteed
+	// to be searchable all at once when the Batch method returns.
+	InsertBatch(d []doc.Document) error


Could you indicate which documents were unable to be indexed in the return type

Any reason to not reject the entire batch? The way I viewed it was that a batch was analogous to a transaction in a SQL database.

prateek · 2018-04-23T19:58:02Z

index/segment/mem/segment.go

+		d := ds[i]
+		err := d.Validate()
+		if err != nil {
+			return err


instead of early terminating, could you continue attempting to index the remainder of the docs

prateek · 2018-04-23T19:58:34Z

index/segment/mem/segment.go

+				// we need to index.
+				ds[i], ds[len(ds)] = ds[len(ds)], ds[i]
+				ds = ds[:len(ds)-1]
+				continue


could you indicate this case within the return'd type

prateek · 2018-04-23T19:58:51Z

index/segment/mem/segment.go

+			}
+
+			if _, ok := s.writer.idSet.Get(d.ID); ok {
+				return errDuplicateID


same as above, instead of early terminating, continue to insert other docs

prateek · 2018-04-25T15:39:36Z

index/types.go

+	Insert(d doc.Document) ([]byte, error)
+
+	// InsertBatch inserts a batch of metrics into the index. The documents are guaranteed
+	// to be searchable all at once when the Batch method returns.


Maybe document the types of errors this can return so it's not a complete surprise when people are downcasting.

Definitely, good idea!

prateek · 2018-04-25T15:43:06Z

index/segment/mem/segment.go

-		s.state.Unlock()
-		return errSegmentSealed
+		err = s.prepareDocsWithLocks(b)
+		if err != nil && !index.IsBatchPartialError(err) {


Don't you need to check if partial updates are allowed and insert those still? would be good to add a testcase for this

This is actually handled in in prepareDocsWithLocks. If we don't support partial updates it will return the error right away, otherwise it will continue validating documents and return a partial error (the corresponding test case is TestSegmentInsertBatchPartialError )

Cool, good stuff.

prateek · 2018-04-26T15:13:53Z

index/segment/mem/segment.go

+
+			if _, ok := s.writer.idSet.Get(d.ID); ok {
+				if !b.AllowPartialUpdates {
+					return errDuplicateID


Could you make errDuplicateID an exported type. I want to be able to distinguish it from other failures downstream

prateek

LGTM save the couple of pending nits

[index] Add Batch method for inserting multiple documents at a time

839c1ef

jeromefroe requested a review from prateek April 20, 2018 13:55

Jerome Froelich added 3 commits April 20, 2018 10:18

Fix metalint issues

278e968

Add generic map for IDs

02bc291

Minor style fixes from self-review

2933f25

prateek reviewed Apr 20, 2018

View reviewed changes

Jerome Froelich added 2 commits April 20, 2018 18:47

Address feedback from code review

9bdbb22

Don't let generated ids map file

09f4917

prateek reviewed Apr 23, 2018

View reviewed changes

jeromefroe force-pushed the jeromefroe/support-batch-inserts branch from af23ed8 to 833f0c7 Compare April 24, 2018 20:32

Support partial updates

55c2e2c

jeromefroe force-pushed the jeromefroe/support-batch-inserts branch from 833f0c7 to 55c2e2c Compare April 24, 2018 22:45

prateek reviewed Apr 25, 2018

View reviewed changes

prateek reviewed Apr 26, 2018

View reviewed changes

prateek approved these changes Apr 26, 2018

View reviewed changes

Explicitly state that a BatchPartialError may be returned

044873f

jeromefroe merged commit 42af1f7 into master Apr 26, 2018

jeromefroe deleted the jeromefroe/support-batch-inserts branch April 26, 2018 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[index] Add Batch method for inserting multiple documents at a time #34

[index] Add Batch method for inserting multiple documents at a time #34

jeromefroe commented Apr 20, 2018

prateek Apr 20, 2018

jeromefroe Apr 20, 2018

prateek Apr 20, 2018

jeromefroe Apr 20, 2018 •

edited

Loading

prateek Apr 23, 2018

prateek Apr 20, 2018

prateek Apr 20, 2018

prateek Apr 20, 2018

prateek Apr 20, 2018

jeromefroe Apr 20, 2018

prateek Apr 20, 2018

prateek Apr 20, 2018 •

edited

Loading

prateek Apr 20, 2018

prateek Apr 20, 2018

prateek Apr 20, 2018

prateek Apr 20, 2018

jeromefroe Apr 20, 2018

prateek Apr 23, 2018

jeromefroe Apr 23, 2018

prateek Apr 23, 2018

prateek Apr 23, 2018

prateek Apr 23, 2018

prateek Apr 25, 2018

jeromefroe Apr 26, 2018

prateek Apr 25, 2018

jeromefroe Apr 26, 2018

prateek Apr 26, 2018

prateek Apr 26, 2018

prateek left a comment

[index] Add Batch method for inserting multiple documents at a time #34

[index] Add Batch method for inserting multiple documents at a time #34

Conversation

jeromefroe commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromefroe Apr 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prateek Apr 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prateek left a comment

Choose a reason for hiding this comment

jeromefroe Apr 20, 2018 •

edited

Loading

prateek Apr 20, 2018 •

edited

Loading