Training stuck on resume

kagrawal31 · May 4, 2023, 8:06am

Running a multi-gpu setup, where the model was trained for 12 epochs.
On resuming training from checkpoint, the model gets stuck and does not move after “resuming checkpoint”.

What could be the issue?

Here is the trainer object

    trainer = pl.Trainer(
        accelerator="gpu",
        devices=4,
        strategy="ddp_notebook",
        callbacks=[lr_monitor,checkpoint_callback,RichProgressBar()],
        precision=32,
        sync_batchnorm=True,
        log_every_n_steps=50,
        max_epochs=20,
        resume_from_checkpoint="output/isnet_checkpoints/epoch=11-step=21240.ckpt",
        logger=wandb_logger
    )
    
   trainer.fit(model, train_dataloader )

wqlevi · May 31, 2023, 9:50pm

Hi, I’ve got a similar issue, where the resumed training hangs one epoch after loading the checkpoint.

Topic		Replies	Views
Error resuming from checkpoint with multiple GPUs	3	3155	January 30, 2022
How to resume training Trainer	9	43061	July 31, 2023
DDP training hangs after `on_train_batch_start` and before `training_step` DDP/GPU	2	1503	June 8, 2023
Stucks on 8gpu training setting	2	2210	February 25, 2021
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3696	January 18, 2023

Training stuck on resume

Related topics