I’ve added a second 3090 to my system trying to speed up fine-tuning of a BERT model. I’ve added
validation_step and specified in
trainer = pl.Trainer( logger=logger, max_epochs=N_EPOCHS, callbacks=[checkpoint_callback], accelerator="gpu", devices="auto", strategy="ddp_find_unused_parameters_false" )
As far as I’m aware, this is all that needs to be done for PyTorch Lightning to use multi GPU. With one GPU (RTX 3090) I was able to run
lr=2e-5 which resulted in roughly ~41 minutes per epoch. After adding a second 3090, with the same
batch_size (although I believe it should now technically be a global batch size of 16) it’s ~36 minutes per epoch, only a 5 minute difference. What am I doing wrong?
Both GPUs are at ~100% utilisation and I can see that ~22GB of VRAM is being used each during training.
I’ve tried adding
pin_memory=True to the Dataloader, as well as increasing
num_workers, both of which had negligible effects.
I assumed that doubling the VRAM would allow double the batch size and so nearly half the training time. The full code I’m following is in a Colab notebook here. I’m using
Interestingly I can run
batch_size=10 on a single GPU, but when I try to run that on 2 GPU I run out of memory and training never starts. I would think that if 10 fits on one, 10 would fit on both?
I’ve tried using
precision=16 which improved things slightly but required
BCEWithLogitsLoss() which causes loss to be higher, so would require more epochs to convergence. I ran
profiler='simple' on a reduced dataset and got the following results:
Nothing immediately jumps out as being incorrect, but someone else may be able to parse these results better than me.
I tried using colossalAI as I thought maybe the model was too complex but that also requires
precision=16 and doesn’t solve my initial confusion as to why doubling the resources doesn’t ~0.5x the training time.