DDP training hangs after `on_train_batch_start` and before `training_step`

Hello all, hoping someone can help me. I’m having a problem training with DDP on two GPUs.

Training hangs at what looks to be the final batch of my second epoch. I’m sure this is thread-locked somewhere, but I do not think it’s anywhere in my code based on the pseudocode here. I say this because I implemented most of those hooks such as on_train_batch_start with print statements as well as a print statement in the first line of training_step. The code gets into on_train_batch_start but not training_step. Here is some output of the last iteration and the final hung iteration:

Epoch 1: 100%|█████████████████████████████████████████████▉| 9417/9418 [15:28<00:00, 10.14it/s, v_num=16, train_loss=0.0147, val_loss/dataloader_idx_0=1.360]
on_train_batch_end
on_train_batch_end
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start
starting training_step
starting training_step
on_before_optimizer
on_before_optimizer
on_train_batch_end
Epoch 1: 100%|███████████████████████████████████████████████| 9418/9418 [15:29<00:00, 10.14it/s, v_num=16, train_loss=0.014, val_loss/dataloader_idx_0=1.360
]
on_train_batch_end 
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start

A few curious things I noticed. Most importantly, I see it only tries to use one GPU in this final step. Second, on_before_batch_transfer and on_after_batch_transfer seem to be executed before on_train_batch_start, which is different from the linked docs above. I’m not as worried about the latter, but it is inconsistent with the docs so figured it would be noteworthy.

This does not happen when I use a single GPU.

Can anyone think of a reason this would be hanging, a place it could be hanging, or some more debugging I can do to figure this out? Thank you so much in advance.

Fixed my own problem. It turned out that my GPU’s got different versions of the training dataloader. So one GPU was waiting for more batches, but there weren’t any left.

This happened because I used reload_dataloaders_every_n_epochs set to 1. When reloading, I used a instance variable that was different for each GPU. I changed that variable to a MeanMetric, it synced automatically, and my code was good to go.

Hey @mgaschenbeck

That’s awesome. Thanks for coming back to describe the solution! Indeed, dealing with uneven data across processes is nasty and easiest is usually to truncate the data to an evenly divisible size.