Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for data loading process #4

Open
aleSuglia opened this issue Oct 13, 2019 · 2 comments
Open

Performance improvements for data loading process #4

aleSuglia opened this issue Oct 13, 2019 · 2 comments

Comments

@aleSuglia
Copy link

Hi,

First of all, thanks a lot for releasing this codebase to the public. I was playing with the code and I realised that you assume that all the data must be loaded in memory before starting the training process. I guess that this procedure might not scale really well with really big datasets. A solution could be defining the Img2TextDataset as an IterableDataset that supports streams of data.

I've noticed that the current implementation of the dataset has already an __iter__ method. However, it seems to me that there might be an issue in the way you sample the elements contained in a given batch. Specifically, as specified in the seq2seq_loader, for every batch you use randint(0, len(self.ex_list)-1) to sample a given example index. This is incorrect because randint won't guarantee that the sampled elements are going to be unique.

I might have soon a fix for this so I can send you a PR if you like :)

Thank you in advance for your answer!

Alessandro

@LuoweiZhou
Copy link
Owner

LuoweiZhou commented Oct 14, 2019

Hi @aleSuglia, yes, you're right. With the current implementation (same for UniLM), the sample does not guarantee to be unique. I do not see this affect the training much but please feel free to send your PR! Thanks.

@darkmatter08
Copy link
Contributor

@aleSuglia I'd be interested in these improvements as well! Please do create a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants