Hello all, hoping someone can help me. I’m having a problem training with DDP on two GPUs.
Training hangs at what looks to be the final batch of my second epoch. I’m sure this is thread-locked somewhere, but I do not think it’s anywhere in my code based on the pseudocode here. I say this because I implemented most of those hooks such as
on_train_batch_start with print statements as well as a print statement in the first line of
training_step. The code gets into
on_train_batch_start but not
training_step. Here is some output of the last iteration and the final hung iteration:
Epoch 1: 100%|█████████████████████████████████████████████▉| 9417/9418 [15:28<00:00, 10.14it/s, v_num=16, train_loss=0.0147, val_loss/dataloader_idx_0=1.360] on_train_batch_end on_train_batch_end on_before_batch_transfer for dataloader_idx 0 on_after_batch_transfer for dataloader_idx 0 on_train_batch_start on_before_batch_transfer for dataloader_idx 0 on_after_batch_transfer for dataloader_idx 0 on_train_batch_start starting training_step starting training_step on_before_optimizer on_before_optimizer on_train_batch_end Epoch 1: 100%|███████████████████████████████████████████████| 9418/9418 [15:29<00:00, 10.14it/s, v_num=16, train_loss=0.014, val_loss/dataloader_idx_0=1.360 ] on_train_batch_end on_before_batch_transfer for dataloader_idx 0 on_after_batch_transfer for dataloader_idx 0 on_train_batch_start
A few curious things I noticed. Most importantly, I see it only tries to use one GPU in this final step. Second,
on_after_batch_transfer seem to be executed before
on_train_batch_start, which is different from the linked docs above. I’m not as worried about the latter, but it is inconsistent with the docs so figured it would be noteworthy.
This does not happen when I use a single GPU.
Can anyone think of a reason this would be hanging, a place it could be hanging, or some more debugging I can do to figure this out? Thank you so much in advance.