DistributedDataParallel multi GPU barely faster than single GPU

I’ve added a second 3090 to my system trying to speed up fine-tuning of a BERT model. I’ve added sync_dist=True to training_step and validation_step and specified in Trainer:

trainer = pl.Trainer(
  logger=logger,
  max_epochs=N_EPOCHS,
  callbacks=[checkpoint_callback],
  accelerator="gpu",
  devices="auto",
  strategy="ddp_find_unused_parameters_false"
)

As far as I’m aware, this is all that needs to be done for PyTorch Lightning to use multi GPU. With one GPU (RTX 3090) I was able to run batch_size=8 and lr=2e-5 which resulted in roughly ~41 minutes per epoch. After adding a second 3090, with the same batch_size (although I believe it should now technically be a global batch size of 16) it’s ~36 minutes per epoch, only a 5 minute difference. What am I doing wrong?

Both GPUs are at ~100% utilisation and I can see that ~22GB of VRAM is being used each during training.

I’ve tried adding pin_memory=True to the Dataloader, as well as increasing num_workers, both of which had negligible effects.

I assumed that doubling the VRAM would allow double the batch size and so nearly half the training time. The full code I’m following is in a Colab notebook here. I’m using bert-large-cased.

Interestingly I can run batch_size=10 on a single GPU, but when I try to run that on 2 GPU I run out of memory and training never starts. I would think that if 10 fits on one, 10 would fit on both?

I’ve tried using precision=16 which improved things slightly but required BCEWithLogitsLoss() which causes loss to be higher, so would require more epochs to convergence. I ran profiler='simple' on a reduced dataset and got the following results:

Single GPU

Multi GPU

Nothing immediately jumps out as being incorrect, but someone else may be able to parse these results better than me.

I tried using colossalAI as I thought maybe the model was too complex but that also requires precision=16 and doesn’t solve my initial confusion as to why doubling the resources doesn’t ~0.5x the training time.

I’m skimming over this and clicked into your Colab notebook, and I noticed you have it set to anyone can edit. You’ll probably want to change it so that it’s read-only for those with the link.

If I was evil I could erase it all!