Hello all, hoping someone can help me. I’m having a problem training with DDP on two GPUs.
Training hangs at what looks to be the final batch of my second epoch. I’m sure this is thread-locked somewhere, but I do not think it’s anywhere in my code based on the pseudocode here. I say this because I implemented most of those hooks such as on_train_batch_start
with print statements as well as a print statement in the first line of training_step
. The code gets into on_train_batch_start
but not training_step
. Here is some output of the last iteration and the final hung iteration:
Epoch 1: 100%|█████████████████████████████████████████████▉| 9417/9418 [15:28<00:00, 10.14it/s, v_num=16, train_loss=0.0147, val_loss/dataloader_idx_0=1.360]
on_train_batch_end
on_train_batch_end
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start
starting training_step
starting training_step
on_before_optimizer
on_before_optimizer
on_train_batch_end
Epoch 1: 100%|███████████████████████████████████████████████| 9418/9418 [15:29<00:00, 10.14it/s, v_num=16, train_loss=0.014, val_loss/dataloader_idx_0=1.360
]
on_train_batch_end
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start
A few curious things I noticed. Most importantly, I see it only tries to use one GPU in this final step. Second, on_before_batch_transfer
and on_after_batch_transfer
seem to be executed before on_train_batch_start
, which is different from the linked docs above. I’m not as worried about the latter, but it is inconsistent with the docs so figured it would be noteworthy.
This does not happen when I use a single GPU.
Can anyone think of a reason this would be hanging, a place it could be hanging, or some more debugging I can do to figure this out? Thank you so much in advance.