Hi all,
I’m using PyTorch Lightning on a server with SLURM as the job submission system. When I request 1 GPU and use single-GPU training, everything works well. When I request 2 GPUs and use DDP, the script will stuck at initialization processes and ends with a time-out error.
Any idea why?
Thanks!