Saving model checkpoint during training epoch

anupsingh15 · October 27, 2024, 6:29pm

I’m currently training a model on a large dataset and need to save checkpoints at regular intervals during each training epoch. I’ve set up ModelCheckpoint with the following configuration:

checkpoint_callback = ModelCheckpoint(every_n_train_steps=1000, save_top_k=-1, save_on_train_epoch_end=False)

This saves a checkpoint every 1000 steps as expected. However, when I reload a checkpoint, training resumes from the start of the last epoch rather than continuing from the exact step where the checkpoint was saved.

Does anyone know what might be causing this issue? Any guidance would be much appreciated!

Topic		Replies	Views
Save checkpoints after specific number of steps instead of epochs callbacks	2	7094	September 28, 2020
ModelCheckpoint docs for every_n_epochs==None Trainer	1	731	January 29, 2022
Saving model checkpoint during the epoch callbacks	1	2163	December 31, 2020
Save checkpoint without overwrite callbacks	1	578	January 29, 2022
Saving checkpoint by val loss AND last checkpoint	2	3562	September 22, 2020

Saving model checkpoint during training epoch

Related topics