multi node training #363

pawopawo · 2021-01-14T06:14:31Z

pawopawo
Jan 14, 2021

Hi, the code will always stay in

loss_scaler(loss, optimizer, clip_grad=max_norm,
                    parameters=model.parameters(), create_graph=is_second_order)

when I using multi node and training to the fifth epoch. The gpu utilization will suddenly become 0

rwightman · 2021-01-14T18:18:30Z

rwightman
Jan 14, 2021
Maintainer

@pawopawo I've had no issues with distributed (multi-process) training on one node (machine). Multi-node (multiple GPU + multiple machines) I've also tested briefly hacking my distributed_train.sh with master ip. I do 2x GPU all of the time, and more recently I've been doing 4x and 8x setups with no problems. I know others have used 8x with this codebase.

loss_scaler is a small wrapper where backward() and optimizer.setp() are called when using mixed-precision training so it makes sense that it would/could get stuck there. It's a significant synchronization point. You could try both --native-amp and --apex-amp to see if there is any difference between the APEX AMP and DDP vs native PyTorch 1.6/1.7 AMP + DDP.

I doubt your problem has anything to do with the default code here. If a model or training process has some sort of conditional path that isn't followed equivalently on all nodes you could end up with nodes going out of sync and getting stuck on a distibuted gather/broadcast/reduce primitive. I don't think such a path would exist without additional modifications.

Your driver and cuda version are a bit old, could try updating, or using latest NGC containers to see if that helps.

3 replies

charismaticchiu Aug 18, 2022

Hi @rwightman, could you kindly share your training example of multi-node training? I could not find an example. I can get it running on single machine but not multiple machines. If you have an example using SLURM, it would be even better! Thank you!

abhishektyaagi Sep 3, 2024

@charismaticchiu I was wondering if you were able to figure out how to train using 2 nodes say each with 4 GPUs? If you were able to figure it out, can you please share your training script?

rwightman Sep 3, 2024
Maintainer

@abhishektyaagi you use torchrun in front of the timm train.py script, run same command on each node, use one of the nodes as the endpoint for rdzv, pick a port that's free and reachable from the other ... if you're running same setup frequently, create a .sh script with most of your common args baked in, in the absence of an orchestration/scheduling system like slurm can use a quick and dirty solution like pssh to launch same script on each node in parallel from one of them, or a remote node.

https://pytorch.org/docs/stable/elastic/run.html

In older setups you used to have to run each script with a different node_rank in the arguments, that shouldn't be necessary with rdzv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi node training #363

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

multi node training #363

pawopawo Jan 14, 2021

Replies: 1 comment · 3 replies

rwightman Jan 14, 2021 Maintainer

charismaticchiu Aug 18, 2022

abhishektyaagi Sep 3, 2024

rwightman Sep 3, 2024 Maintainer

pawopawo
Jan 14, 2021

Replies: 1 comment 3 replies

rwightman
Jan 14, 2021
Maintainer

rwightman Sep 3, 2024
Maintainer