Hello!
I want to do multiGPU training with a model. I have a node with 4 GPUs. Training with just 1 GPU, each epoch takes 9 hours. Training with 4 GPUs, each epoch takes 9 hours. There is no reduction whatsoever nor in the number of batches needed to train nor the time. It seems it’s not having any effect.
The way I am calling the trainer is:
trainer = pl.Trainer(
min_epochs=1,
max_epochs=100,
check_val_every_n_epoch=2,
logger=wandb_logger,
accelerator='gpu',
devices=-1,
strategy="ddp_find_unused_parameters_true")
Lines I see in the logs when training with 4 GPUs are:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3 ]
How can I actually check whether I am indeed working with 4 GPUs? I know that my system can see them (I checked this with torch.cuda.device_count()
).
Finally, I use ddp_find_unused_parameters_true
instead of ddp
cause I use a torch.nn.Embedding
and not in every minibatch I retrieve all indices, which apparently provokes some problems:
RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
Torch version: '2.0.1+cu117'
Pytorch lightning version: '2.0.6'
Thanks!!!