Training stuck on resume

Running a multi-gpu setup, where the model was trained for 12 epochs.
On resuming training from checkpoint, the model gets stuck and does not move after “resuming checkpoint”.

What could be the issue?

Here is the trainer object

    trainer = pl.Trainer(
        accelerator="gpu",
        devices=4,
        strategy="ddp_notebook",
        callbacks=[lr_monitor,checkpoint_callback,RichProgressBar()],
        precision=32,
        sync_batchnorm=True,
        log_every_n_steps=50,
        max_epochs=20,
        resume_from_checkpoint="output/isnet_checkpoints/epoch=11-step=21240.ckpt",
        logger=wandb_logger
    )
    
   trainer.fit(model, train_dataloader )

Hi, I’ve got a similar issue, where the resumed training hangs one epoch after loading the checkpoint.