Ddp on 2 GPUs: No rendezvous handler for env://

jimtorch · January 28, 2021, 8:09am

I am testing a model with lightning, it has been working fine with 1 GPU. After added 2nd GPU today however, the following error happened:
(with gpus=2, distributed_backend=‘ddp’ been added to pl.Trainer )

raise RuntimeError(“No rendezvous handler for {}://”.format(result.scheme))
RuntimeError: No rendezvous handler for env://

I am on Windows 10, PyTorch 1.7.1, pytorch_lightning 1.1.4, cuda 11.0

how should I fix or work around this problem?
Thanks!

jimtorch · February 5, 2021, 11:37am

Fixed this problem myself, which requires some hack into ddp_plugin.py
Basically, need to use gloo backend, and create a local rendezvous file instead.

carlomarxdk · March 3, 2021, 9:31pm

@jimtorch what exactly did you change? I am in a similar situation and I have no idea what to do.
Thanks in advance.

Topic		Replies	Views
Multi-GPU with SLURM failed at initialization DDP/GPU	1	1548	April 4, 2022
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3886	January 18, 2023
Ddp2 in multi node and multi gpu failing on pytorch lightning	0	555	November 7, 2021
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! DDP/GPU	0	775	February 6, 2024
Runing ddp accross two machines DDP/GPU	3	1393	March 3, 2023

Ddp on 2 GPUs: No rendezvous handler for env://

Related topics