I am recently using 16 GPUs to train a model with DDP strategy.
Sometimes, the program will stuck at one step of the training while the utilization of all the 16 GPUs is 100%. What makes more strange is that not every time this will happen.
I print the pstack of the process for one gpu, it seems it’s waiting in the synchronize function of nccl, so I guess some information is missing during the reduce step or something alike.
So my first question is that, has anyone else met something like this, and how to handle this problem.
And my second question is that, if my guess is correct, I can track the time of every training step, if it get larger than a threshold, I just rerun that step. So any suggestion about the implementation? It seems that a Callback is not a very simple way to achieve that.
Thanks a lot