You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thanks a lot for releasing this codebase to the public. I was playing with the code and I realised that you assume that all the data must be loaded in memory before starting the training process. I guess that this procedure might not scale really well with really big datasets. A solution could be defining the Img2TextDataset as an IterableDataset that supports streams of data.
I've noticed that the current implementation of the dataset has already an __iter__ method. However, it seems to me that there might be an issue in the way you sample the elements contained in a given batch. Specifically, as specified in the seq2seq_loader, for every batch you use randint(0, len(self.ex_list)-1) to sample a given example index. This is incorrect because randint won't guarantee that the sampled elements are going to be unique.
I might have soon a fix for this so I can send you a PR if you like :)
Thank you in advance for your answer!
Alessandro
The text was updated successfully, but these errors were encountered:
Hi @aleSuglia, yes, you're right. With the current implementation (same for UniLM), the sample does not guarantee to be unique. I do not see this affect the training much but please feel free to send your PR! Thanks.
Hi,
First of all, thanks a lot for releasing this codebase to the public. I was playing with the code and I realised that you assume that all the data must be loaded in memory before starting the training process. I guess that this procedure might not scale really well with really big datasets. A solution could be defining the
Img2TextDataset
as anIterableDataset
that supports streams of data.I've noticed that the current implementation of the dataset has already an
__iter__
method. However, it seems to me that there might be an issue in the way you sample the elements contained in a given batch. Specifically, as specified in the seq2seq_loader, for every batch you userandint(0, len(self.ex_list)-1)
to sample a given example index. This is incorrect becauserandint
won't guarantee that the sampled elements are going to be unique.I might have soon a fix for this so I can send you a PR if you like :)
Thank you in advance for your answer!
Alessandro
The text was updated successfully, but these errors were encountered: