Training freezes at "initializing ddp: GLOBAL_RANK ..."

I’m trying to train LoFTR on 4 RTX 3090 GPUs on Ubuntu 18.04. When I start training, the output gets stuck on “initializing ddp: GLOBAL_RANK” and the terminal freezes (Ctrl + C won’t work anymore).

I saw that others had that problem with certain Pytorch / Pytorch Lightning versions, however i’m using pytorch-lightning==1.3.5 and pytorch=1.8.1, which noone else seemed to have problems with. Also, the authors trained LoFTR with the same environment, so I think the problem has to be somewhere else.

Does anyone have an idea what else apart from the Pytorch / Pytorch Lightning versions could be the problem?

Thanks!

P.S. LoFTR does not use slurm, which would have been another source of error.

Hey man, I have the same issue some time ago. But in my case i was working with 6 RTX A6000 in a SuperMicro server. The issue was related to the way the GPUs intercomunicates.

I solve the problem setting NCCL_P2P_DISABLE as 1, using the following:

export NCCL_P2P_DISABLE=1

and then runing the .py file in the cmd.

In my understand, the GPUs try to communicate by the NVlink (witch i dont have) so They just stop in the process trying to do that, and the Only way to end this is to kill the process.

I hope this help you, good luck.

3 Likes

Thank you! In the end, my GPU wasn’t compatible with the cuda version I was using.

@LuisVMoura thank you so much for this, fixed freezing training with ddp and nccl for me.

holy sh*t, spent the whole week and found your solution. thanks man