I started training a model on two GPUs, using the following trainer:
trainer = pl.Trainer(
devices = [0,2], accelerator='gpu', precision=16, max_epochs=2000,
callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'),
gradient_clip_val=5.0, gradient_clip_algorithm='norm')
This is set to save the best three epochs (based on the validation loss) and the last epoch:
checkpoint_callback = ModelCheckpoint(
monitor="val_loss",
save_top_k=3,
mode="min",
save_last=True
)
Training halted unexpectedly and I now want to resume it, which I did by configuring my trainer as follows:
trainer = pl.Trainer(devices=[2,0], accelerator="gpu", precision=16, max_epochs=2000,
callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'),
gradient_clip_val=5.0, gradient_clip_algorithm='norm',
resume_from_checkpoint="path/to/checkpoint.ckpt")
But, after initializing the two distributed processes and completing the validation sanity check, this crashes on starting the first step of the new training epoch, giving a long error stack that ends with:
File "/home/username/miniconda3/lib/
python3.8/site-packages/torch/optim/_functional.py", line 86, in adam
exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!
So somehow it seems that it’s not correctly dividing all the tensors onto the two GPUs. I wonder if this has to do with how it’s loading the checkpoint. Am I doing something wrong here? Is this even possible, and if so how do I do it correctly?
(If I try to resume with a trainer that’s set to use just one GPU, there’s no problem.)