-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradual slowdown of training in bigger batch sizes #3050
Comments
I think the issue exists since the The training also seems to be more or less as slow as before. How could I tackle this issue? I am mostly certain that its the increased time of creating the batch is causing the slowdown. |
Hmm. It is indeed possible that the NoDuplicatesBatchSampler is introducing a larger overhead on larger batches. I don't see a clear path for improving this, however. I'm also not sure if the increasing the
|
Yeah, it seems that increasing the number of dataloader is not affecting the performance. And I have mostly made sure that NoDuplicatesBatchSampler is causing the issue, since when I used a standard batch sampler ( Would there be no possible fixes about this? |
The fix would be to speed up the batch sampler, but I don't know if there's room for improvement. Perhaps it's faster to hash each text and compare based on that rather than doing set overlap with strings: sentence-transformers/sentence_transformers/sampler.py Lines 188 to 198 in e28f97d
Or perhaps set overlap to begin with is slower than e.g. doing set membership a few times over. I'm not sure.
|
Okay, i will try looking into this. Along with it, the bottleneck could also be avoided if you have less duplicates in your training data right? |
I appreciate it!
|
Yeah.. that makes sense. Thank you for the replies. |
I am facing a very weird issue here.
Issue
As you can see the training slows down for larger batches.
Experiment Details
Training Code
Deepspeed Config (standard config but providing incase it helps)
GPU utilisation for different batch sizes
The VRAM usage is fairly constant through training and does not fluctuate much (barely 0.5-1%)
Conclusion
I am unable to understand why the training is slowing down as shown, for larger batch sizes. For using MNR loss, bigger batch sizes are preferred, and the training will also be done faster ideally, given the training works without this issue.
I have spent quite some time to understand what the issue here is, but have been unable to do so. Any help will be appreciated. Thanks!
The text was updated successfully, but these errors were encountered: