Training freezes at "initializing ddp: GLOBAL_RANK ..."

nonathecoda · June 19, 2023, 4:28pm

I’m trying to train LoFTR on 4 RTX 3090 GPUs on Ubuntu 18.04. When I start training, the output gets stuck on “initializing ddp: GLOBAL_RANK” and the terminal freezes (Ctrl + C won’t work anymore).

I saw that others had that problem with certain Pytorch / Pytorch Lightning versions, however i’m using pytorch-lightning==1.3.5 and pytorch=1.8.1, which noone else seemed to have problems with. Also, the authors trained LoFTR with the same environment, so I think the problem has to be somewhere else.

Does anyone have an idea what else apart from the Pytorch / Pytorch Lightning versions could be the problem?

Thanks!

P.S. LoFTR does not use slurm, which would have been another source of error.

LuisVMoura · June 21, 2023, 1:42am

Hey man, I have the same issue some time ago. But in my case i was working with 6 RTX A6000 in a SuperMicro server. The issue was related to the way the GPUs intercomunicates.

I solve the problem setting NCCL_P2P_DISABLE as 1, using the following:

export NCCL_P2P_DISABLE=1

and then runing the .py file in the cmd.

In my understand, the GPUs try to communicate by the NVlink (witch i dont have) so They just stop in the process trying to do that, and the Only way to end this is to kill the process.

I hope this help you, good luck.

nonathecoda · June 22, 2023, 11:55am

Thank you! In the end, my GPU wasn’t compatible with the cuda version I was using.

Tom · February 17, 2024, 11:17pm

@LuisVMoura thank you so much for this, fixed freezing training with ddp and nccl for me.

tianyori90 · May 9, 2024, 6:25pm

holy sh*t, spent the whole week and found your solution. thanks man

Topic		Replies	Views
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3892	January 18, 2023
DDP Training Stuck while GPU utilization is 100% implementation help	3	3805	November 22, 2022
Multi-GPU with SLURM failed at initialization DDP/GPU	1	1549	April 4, 2022
Stucks on 8gpu training setting	2	2223	February 25, 2021
Error with Pytorch Lightning ddp_spawn on SLURM DDP/GPU	0	1339	October 1, 2023

Training freezes at "initializing ddp: GLOBAL_RANK ..."

Related topics