I’m trying to train LoFTR on 4 RTX 3090 GPUs on Ubuntu 18.04. When I start training, the output gets stuck on “initializing ddp: GLOBAL_RANK” and the terminal freezes (Ctrl + C won’t work anymore).
I saw that others had that problem with certain Pytorch / Pytorch Lightning versions, however i’m using pytorch-lightning==1.3.5 and pytorch=1.8.1, which noone else seemed to have problems with. Also, the authors trained LoFTR with the same environment, so I think the problem has to be somewhere else.
Does anyone have an idea what else apart from the Pytorch / Pytorch Lightning versions could be the problem?
P.S. LoFTR does not use slurm, which would have been another source of error.