Performance improvements for data loading process #4

aleSuglia · 2019-10-13T17:28:44Z

Hi,

First of all, thanks a lot for releasing this codebase to the public. I was playing with the code and I realised that you assume that all the data must be loaded in memory before starting the training process. I guess that this procedure might not scale really well with really big datasets. A solution could be defining the Img2TextDataset as an IterableDataset that supports streams of data.

I've noticed that the current implementation of the dataset has already an __iter__ method. However, it seems to me that there might be an issue in the way you sample the elements contained in a given batch. Specifically, as specified in the seq2seq_loader, for every batch you use randint(0, len(self.ex_list)-1) to sample a given example index. This is incorrect because randint won't guarantee that the sampled elements are going to be unique.

I might have soon a fix for this so I can send you a PR if you like :)

Thank you in advance for your answer!

Alessandro

The text was updated successfully, but these errors were encountered:

LuoweiZhou · 2019-10-14T04:08:53Z

Hi @aleSuglia, yes, you're right. With the current implementation (same for UniLM), the sample does not guarantee to be unique. I do not see this affect the training much but please feel free to send your PR! Thanks.

darkmatter08 · 2019-10-31T03:42:36Z

@aleSuglia I'd be interested in these improvements as well! Please do create a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements for data loading process #4

Performance improvements for data loading process #4

aleSuglia commented Oct 13, 2019

LuoweiZhou commented Oct 14, 2019 •

edited

Loading

darkmatter08 commented Oct 31, 2019

Performance improvements for data loading process #4

Performance improvements for data loading process #4

Comments

aleSuglia commented Oct 13, 2019

LuoweiZhou commented Oct 14, 2019 • edited Loading

darkmatter08 commented Oct 31, 2019

LuoweiZhou commented Oct 14, 2019 •

edited

Loading