Running a multi-gpu setup, where the model was trained for 12 epochs.
On resuming training from checkpoint, the model gets stuck and does not move after “resuming checkpoint”.
What could be the issue?
Here is the trainer object
trainer = pl.Trainer(
accelerator="gpu",
devices=4,
strategy="ddp_notebook",
callbacks=[lr_monitor,checkpoint_callback,RichProgressBar()],
precision=32,
sync_batchnorm=True,
log_every_n_steps=50,
max_epochs=20,
resume_from_checkpoint="output/isnet_checkpoints/epoch=11-step=21240.ckpt",
logger=wandb_logger
)
trainer.fit(model, train_dataloader )