DDP training hangs after `on_train_batch_start` and before `training_step`

mgaschenbeck · June 8, 2023, 1:58am

Hello all, hoping someone can help me. I’m having a problem training with DDP on two GPUs.

Training hangs at what looks to be the final batch of my second epoch. I’m sure this is thread-locked somewhere, but I do not think it’s anywhere in my code based on the pseudocode here. I say this because I implemented most of those hooks such as on_train_batch_start with print statements as well as a print statement in the first line of training_step. The code gets into on_train_batch_start but not training_step. Here is some output of the last iteration and the final hung iteration:

Epoch 1: 100%|█████████████████████████████████████████████▉| 9417/9418 [15:28<00:00, 10.14it/s, v_num=16, train_loss=0.0147, val_loss/dataloader_idx_0=1.360]
on_train_batch_end
on_train_batch_end
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start
starting training_step
starting training_step
on_before_optimizer
on_before_optimizer
on_train_batch_end
Epoch 1: 100%|███████████████████████████████████████████████| 9418/9418 [15:29<00:00, 10.14it/s, v_num=16, train_loss=0.014, val_loss/dataloader_idx_0=1.360
]
on_train_batch_end 
on_before_batch_transfer for dataloader_idx 0
on_after_batch_transfer for dataloader_idx 0
on_train_batch_start

A few curious things I noticed. Most importantly, I see it only tries to use one GPU in this final step. Second, on_before_batch_transfer and on_after_batch_transfer seem to be executed before on_train_batch_start, which is different from the linked docs above. I’m not as worried about the latter, but it is inconsistent with the docs so figured it would be noteworthy.

This does not happen when I use a single GPU.

Can anyone think of a reason this would be hanging, a place it could be hanging, or some more debugging I can do to figure this out? Thank you so much in advance.

mgaschenbeck · June 8, 2023, 7:27pm

Fixed my own problem. It turned out that my GPU’s got different versions of the training dataloader. So one GPU was waiting for more batches, but there weren’t any left.

This happened because I used reload_dataloaders_every_n_epochs set to 1. When reloading, I used a instance variable that was different for each GPU. I changed that variable to a MeanMetric, it synced automatically, and my code was good to go.

awaelchli · June 8, 2023, 8:36pm

Hey @mgaschenbeck

That’s awesome. Thanks for coming back to describe the solution! Indeed, dealing with uneven data across processes is nasty and easiest is usually to truncate the data to an evenly divisible size.

Topic		Replies	Views
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3587	January 18, 2023
Training not proceeding DDP/GPU	0	891	August 4, 2022
DDP Training Stuck while GPU utilization is 100% implementation help	3	3726	November 22, 2022
Training stuck on resume Trainer	1	961	May 31, 2023
CUDA OOM while initializing DDP DDP/GPU	1	4171	November 17, 2020

DDP training hangs after `on_train_batch_start` and before `training_step`

Related topics