I don’t understand how to resume the training (from the last checkpoint).
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir)
saves but does not resume from the last checkpoint.
The following code starts the training from scratch (but I read that it should resume):
logger = TestTubeLogger(save_dir=save_dir, name="default", version=0)
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir, logger = logger)
this is not what I wanted, I would like an automatic resume from the last checkpoint
I don’t think that’s possible since a new Trainer instance won’t have any info regarding the checkpoint state saved in the previous training.
@davide Try to initiate new instance of Trainer object with param “resume_from_checkpoint” equal to path to .ckpt file you stored afer your previous training:
trainer = pl.Trainer(gpus=1, logger = logger, resume_from_checkpoint = "path/to/ckpt/file/checkopoint.ckpt")
This should start training from epoch your checkpoint is.
@davide +1 to above you’ll need to tackle this in two separate parts:
- locate path to checkpoint
- pass in a consistent filepath into
- automatically load that checkpoint +1 to @andrey_s
If you want to automatically resume from the best weights according to some metric you can setup
ModeCheckpoint to monitor a particular metric and track the best one, then you can use
glob.glob('./checkpoints/) and do some parsing to get the path of the best metric
I think you forget to specify that you need to add more epochs to the trainer (e.g. *
pl.Trainer(max_epochs=7, resume_from_checkpoint='./checkpoints/last.ckpt')). For exemple, if you last checkpoint is saved at epoch 3
(max_epochs=3) than you need to add more epochs
(max_epochs=7) in order to the training to begin otherwise it will not do anything (I tested that and it took me hours to figure this out )
Hope it helps,
Peace and out!
Thanks for mentioning
max_epochs argument. I am able to resume training from the last saved checked point (.ckpt file).
There’s as well the argument ckpt_file in the trainer.fit() where :
ckpt_path: Path/URL of the checkpoint from which training is resumed. Could also be one of two special
keywords “last” and “hpc”. If there is no checkpoint file at the path, an exception is raised. If resuming from mid-epoch checkpoint, training will start from the beginning of the next epoch.
Hope it helps.