Multi-GPU with SLURM failed at initialization

Shiheng_Duan · March 9, 2022, 11:56pm

Hi all,

I’m using PyTorch Lightning on a server with SLURM as the job submission system. When I request 1 GPU and use single-GPU training, everything works well. When I request 2 GPUs and use DDP, the script will stuck at initialization processes and ends with a time-out error.

Any idea why?

Thanks!

goku · April 4, 2022, 1:59pm

We have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only after some time.

Thank you