Error resuming from checkpoint with multiple GPUs

rubseb · January 12, 2022, 8:55am

I started training a model on two GPUs, using the following trainer:

     trainer = pl.Trainer(
          devices = [0,2], accelerator='gpu', precision=16, max_epochs=2000, 
          callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'), 
          gradient_clip_val=5.0, gradient_clip_algorithm='norm')

This is set to save the best three epochs (based on the validation loss) and the last epoch:

 checkpoint_callback = ModelCheckpoint(
        monitor="val_loss",
        save_top_k=3, 
        mode="min",
        save_last=True         
    )

Training halted unexpectedly and I now want to resume it, which I did by configuring my trainer as follows:

    trainer = pl.Trainer(devices=[2,0], accelerator="gpu", precision=16, max_epochs=2000, 
         callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'), 
         gradient_clip_val=5.0, gradient_clip_algorithm='norm', 
         resume_from_checkpoint="path/to/checkpoint.ckpt")

But, after initializing the two distributed processes and completing the validation sanity check, this crashes on starting the first step of the new training epoch, giving a long error stack that ends with:

File "/home/username/miniconda3/lib/
python3.8/site-packages/torch/optim/_functional.py", line 86, in adam 
   exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!

So somehow it seems that it’s not correctly dividing all the tensors onto the two GPUs. I wonder if this has to do with how it’s loading the checkpoint. Am I doing something wrong here? Is this even possible, and if so how do I do it correctly?

(If I try to resume with a trainer that’s set to use just one GPU, there’s no problem.)

goku · January 29, 2022, 10:24pm

hey @rubseb

The configuration looks good here but there might be a problem in your LightningModule. Can you share the code for it or possibly a re-producible script if possible.

Also, we have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only soon.

Thank you

rubseb · January 30, 2022, 10:22am

Hi @goku,

Thanks for you reply and sorry for wasting your time. I noticed shortly after posting this that the forum had moved and I posted a duplicate of my question on Github Discussions, but then didn’t think to remove it here (as I figured the old forum was dead).

In brief, I found a solution/workaround myself pretty soon which was to switch the distributed computing strategy from DDPSpawn to DDP, which turned out to be better in general. I wish I had the time to go back and replicate the issue with DDSpawn and my old code (since I’ve made other changes since then too) in order to be of help to others or the development team, but unfortunately I don’t right now. If I do find some time, or if it comes up again, I’ll report back with more info!

goku · January 30, 2022, 3:02pm

glad you found the solution

Topic		Replies	Views
Training stuck on resume Trainer	1	961	May 31, 2023
How to resume training Trainer	9	42641	July 31, 2023
Resume from checkpoint with elastic training	5	3338	September 16, 2020
Cannot load hyper parameters properly from a checkpoint Trainer	4	3892	December 16, 2020
How can I train a model using DDP on two GPUs, but only test on one GPU? DDP/GPU	4	1842	August 17, 2023

Error resuming from checkpoint with multiple GPUs

Related topics