I want to do multiGPU training with a model. I have a node with 4 GPUs. Training with just 1 GPU, each epoch takes 9 hours. Training with 4 GPUs, each epoch takes 9 hours. There is no reduction whatsoever nor in the number of batches needed to train nor the time. It seems it’s not having any effect.
The way I am calling the trainer is:
trainer = pl.Trainer( min_epochs=1, max_epochs=100, check_val_every_n_epoch=2, logger=wandb_logger, accelerator='gpu', devices=-1, strategy="ddp_find_unused_parameters_true")
Lines I see in the logs when training with 4 GPUs are:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 distributed_backend=nccl All distributed processes registered. Starting with 1 processes LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3 ]
How can I actually check whether I am indeed working with 4 GPUs? I know that my system can see them (I checked this with
Finally, I use
ddp_find_unused_parameters_true instead of
ddp cause I use a
torch.nn.Embedding and not in every minibatch I retrieve all indices, which apparently provokes some problems:
RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
Pytorch lightning version: