I don’t understand how to resume the training (from the last checkpoint).
The following: trainer = pl.Trainer(gpus=1, default_root_dir=save_dir)
saves but does not resume from the last checkpoint.
The following code starts the training from scratch (but I read that it should resume):
@davide Try to initiate new instance of Trainer object with param “resume_from_checkpoint” equal to path to .ckpt file you stored afer your previous training:
If you want to automatically resume from the best weights according to some metric you can setup ModeCheckpoint to monitor a particular metric and track the best one, then you can use glob.glob('./checkpoints/) and do some parsing to get the path of the best metric
Hello,
I think you forget to specify that you need to add more epochs to the trainer (e.g. * pl.Trainer(max_epochs=7, resume_from_checkpoint='./checkpoints/last.ckpt')). For exemple, if you last checkpoint is saved at epoch 3(max_epochs=3) than you need to add more epochs (max_epochs=7) in order to the training to begin otherwise it will not do anything (I tested that and it took me hours to figure this out )
Hey,
There’s as well the argument ckpt_file in the trainer.fit() where :
ckpt_path: Path/URL of the checkpoint from which training is resumed. Could also be one of two special
keywords “last” and “hpc”. If there is no checkpoint file at the path, an exception is raised. If resuming from mid-epoch checkpoint, training will start from the beginning of the next epoch.
Hope it helps.