DDP MultiGPU Training does not reduce training time


I want to do multiGPU training with a model. I have a node with 4 GPUs. Training with just 1 GPU, each epoch takes 9 hours. Training with 4 GPUs, each epoch takes 9 hours. There is no reduction whatsoever nor in the number of batches needed to train nor the time. It seems it’s not having any effect.

The way I am calling the trainer is:

trainer = pl.Trainer(

Lines I see in the logs when training with 4 GPUs are:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

All distributed processes registered. Starting with 1 processes


How can I actually check whether I am indeed working with 4 GPUs? I know that my system can see them (I checked this with torch.cuda.device_count()).

Finally, I use ddp_find_unused_parameters_true instead of ddp cause I use a torch.nn.Embedding and not in every minibatch I retrieve all indices, which apparently provokes some problems:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

Torch version: '2.0.1+cu117'
Pytorch lightning version: '2.0.6'


Hello @alejandrotejada

To make sure your training is using all GPUs, you can set devices=4 explicitly and check the the output of nvidia-smi in your terminal.

Hi, thank you!

Actually I solved that issue but now the script gets stuck at the point of building the parallelization, it seems a problem with the backend.

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:15094 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to []:15094 (errno: 97 - Address family not supported by protocol).

I cannot solve this even though I have seen some similar issues, however there was no clear answer whatsoever.


I have the same issue. Have you managed to solve it?