I’m trying to train LoFTR on 4 RTX 3090 GPUs on Ubuntu 18.04. When I start training, the output gets stuck on “initializing ddp: GLOBAL_RANK” and the terminal freezes (Ctrl + C won’t work anymore).
I saw that others had that problem with certain Pytorch / Pytorch Lightning versions, however i’m using pytorch-lightning==1.3.5 and pytorch=1.8.1, which noone else seemed to have problems with. Also, the authors trained LoFTR with the same environment, so I think the problem has to be somewhere else.
Does anyone have an idea what else apart from the Pytorch / Pytorch Lightning versions could be the problem?
Thanks!
P.S. LoFTR does not use slurm, which would have been another source of error.
Hey man, I have the same issue some time ago. But in my case i was working with 6 RTX A6000 in a SuperMicro server. The issue was related to the way the GPUs intercomunicates.
I solve the problem setting NCCL_P2P_DISABLE as 1, using the following:
export NCCL_P2P_DISABLE=1
and then runing the .py file in the cmd.
In my understand, the GPUs try to communicate by the NVlink (witch i dont have) so They just stop in the process trying to do that, and the Only way to end this is to kill the process.