Replies: 1 comment 3 replies
-
@pawopawo I've had no issues with distributed (multi-process) training on one node (machine). Multi-node (multiple GPU + multiple machines) I've also tested briefly hacking my distributed_train.sh with master ip. I do 2x GPU all of the time, and more recently I've been doing 4x and 8x setups with no problems. I know others have used 8x with this codebase.
I doubt your problem has anything to do with the default code here. If a model or training process has some sort of conditional path that isn't followed equivalently on all nodes you could end up with nodes going out of sync and getting stuck on a distibuted gather/broadcast/reduce primitive. I don't think such a path would exist without additional modifications. Your driver and cuda version are a bit old, could try updating, or using latest NGC containers to see if that helps. |
Beta Was this translation helpful? Give feedback.
-
Hi, the code will always stay in
when I using multi node and training to the fifth epoch. The gpu utilization will suddenly become 0
Beta Was this translation helpful? Give feedback.
All reactions