My version of pytorch lighting is 1.6
First I trained the model for 100000 steps, with 12500 steps per epoch and the last one saved by checkpoint. Now I want to continue training with 5000 steps more, but I want to add some regularization to the loss function. So the change I made was to change the loss function in the training steps function and increase the Trainer’s max_steps variable to set the resume_from_checkpoint path (I don’t know if this is the right thing to do).
Then when we re-enter the training instructions, we find that the training has not continued, but started at epoch=0, and the loss is nan. Is it supposed to start training at epoch=8? Does it mean that the training was not successfully resumed?
And a new version file is generated for each training.
How can I change the code to meet my needs?
‘