About resume training

1152001045 · March 29, 2024, 12:30pm

My version of pytorch lighting is 1.6
First I trained the model for 100000 steps, with 12500 steps per epoch and the last one saved by checkpoint. Now I want to continue training with 5000 steps more, but I want to add some regularization to the loss function. So the change I made was to change the loss function in the training steps function and increase the Trainer’s max_steps variable to set the resume_from_checkpoint path (I don’t know if this is the right thing to do).

Then when we re-enter the training instructions, we find that the training has not continued, but started at epoch=0, and the loss is nan. Is it supposed to start training at epoch=8? Does it mean that the training was not successfully resumed?
And a new version file is generated for each training.
How can I change the code to meet my needs?

‘

Topic		Replies	Views
How to resume training Trainer	9	43061	July 31, 2023
How to continue training for more epochs? Trainer	1	1393	March 25, 2023
Resume from checkpoint with elastic training	5	3341	September 16, 2020
Limit steps per epoch Trainer	10	2944	July 4, 2023
Run Validation and Checkpoint every n steps implementation help	0	263	April 5, 2024

About resume training

Related topics