Multi-GPU with SLURM failed at initialization

Hi all,

I’m using PyTorch Lightning on a server with SLURM as the job submission system. When I request 1 GPU and use single-GPU training, everything works well. When I request 2 GPUs and use DDP, the script will stuck at initialization processes and ends with a time-out error.

Any idea why?


We have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only after some time.

Thank you