I’ve added a second 3090 to my system trying to speed up fine-tuning of a BERT model. I’ve added sync_dist=True
to training_step
and validation_step
and specified in Trainer
:
trainer = pl.Trainer(
logger=logger,
max_epochs=N_EPOCHS,
callbacks=[checkpoint_callback],
accelerator="gpu",
devices="auto",
strategy="ddp_find_unused_parameters_false"
)
As far as I’m aware, this is all that needs to be done for PyTorch Lightning to use multi GPU. With one GPU (RTX 3090) I was able to run batch_size=8
and lr=2e-5
which resulted in roughly ~41 minutes per epoch. After adding a second 3090, with the same batch_size
(although I believe it should now technically be a global batch size of 16) it’s ~36 minutes per epoch, only a 5 minute difference. What am I doing wrong?
Both GPUs are at ~100% utilisation and I can see that ~22GB of VRAM is being used each during training.
I’ve tried adding pin_memory=True
to the Dataloader, as well as increasing num_workers
, both of which had negligible effects.
I assumed that doubling the VRAM would allow double the batch size and so nearly half the training time. The full code I’m following is in a Colab notebook here. I’m using bert-large-cased
.
Interestingly I can run batch_size=10
on a single GPU, but when I try to run that on 2 GPU I run out of memory and training never starts. I would think that if 10 fits on one, 10 would fit on both?
I’ve tried using precision=16
which improved things slightly but required BCEWithLogitsLoss()
which causes loss to be higher, so would require more epochs to convergence. I ran profiler='simple'
on a reduced dataset and got the following results:
Nothing immediately jumps out as being incorrect, but someone else may be able to parse these results better than me.
I tried using colossalAI as I thought maybe the model was too complex but that also requires precision=16
and doesn’t solve my initial confusion as to why doubling the resources doesn’t ~0.5x the training time.