DDP Training Stuck while GPU utilization is 100%


I am recently using 16 GPUs to train a model with DDP strategy.

Sometimes, the program will stuck at one step of the training while the utilization of all the 16 GPUs is 100%. What makes more strange is that not every time this will happen.

I print the pstack of the process for one gpu, it seems it’s waiting in the synchronize function of nccl, so I guess some information is missing during the reduce step or something alike.

So my first question is that, has anyone else met something like this, and how to handle this problem.

And my second question is that, if my guess is correct, I can track the time of every training step, if it get larger than a threshold, I just rerun that step. So any suggestion about the implementation? It seems that a Callback is not a very simple way to achieve that.

Thanks a lot

Hi @lsy643, may I ask which Lightning version are you using? There were some bugs identified with DDP which has been already fixed.
If you are still facing this issue with the latest version please create an issue on Github.

Hi, I met the same problem as the title described. My pytorch lightning version is 1.4.2. Is this problem caused by the version?

Same problem on A100 with in this version 1.8.0.post1. It just happens sometimes. But works well in RTX-3090.