DDP MultiGPU Training does not reduce training time

alejandrotejada · July 28, 2023, 5:00pm

Hello!

I want to do multiGPU training with a model. I have a node with 4 GPUs. Training with just 1 GPU, each epoch takes 9 hours. Training with 4 GPUs, each epoch takes 9 hours. There is no reduction whatsoever nor in the number of batches needed to train nor the time. It seems it’s not having any effect.

The way I am calling the trainer is:

trainer = pl.Trainer(
                    min_epochs=1,
                    max_epochs=100,
                    check_val_every_n_epoch=2, 
                    logger=wandb_logger,
                    accelerator='gpu',
                    devices=-1, 
                    strategy="ddp_find_unused_parameters_true")

Lines I see in the logs when training with 4 GPUs are:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3     ]

How can I actually check whether I am indeed working with 4 GPUs? I know that my system can see them (I checked this with torch.cuda.device_count()).

Finally, I use ddp_find_unused_parameters_true instead of ddp cause I use a torch.nn.Embedding and not in every minibatch I retrieve all indices, which apparently provokes some problems:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

Torch version: '2.0.1+cu117'
Pytorch lightning version: '2.0.6'

Thanks!!!

awaelchli · July 30, 2023, 9:04pm

Hello @alejandrotejada

To make sure your training is using all GPUs, you can set devices=4 explicitly and check the the output of nvidia-smi in your terminal.

alejandrotejada10 · July 31, 2023, 6:34am

Hi, thank you!

Actually I solved that issue but now the script gets stuck at the point of building the parallelization, it seems a problem with the backend.

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:15094 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to []:15094 (errno: 97 - Address family not supported by protocol).

I cannot solve this even though I have seen some similar issues, however there was no clear answer whatsoever.

Thanks!!

xenomarz · November 8, 2023, 10:52am

I have the same issue. Have you managed to solve it?

Topic		Replies	Views
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	305	May 8, 2024
RuntimeError: Parameters that were not used in producing the loss returned by training_step DDP/GPU	0	1879	January 13, 2024
Correct usage of DDP and find_unused_parameters DDP/GPU	2	10714	June 10, 2023
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	4019	January 18, 2023
Reproduce one GPU score/loss using DDP - Disrepancy DDP/GPU	1	384	January 28, 2024

DDP MultiGPU Training does not reduce training time

Related topics