I’m currently training a model on a large dataset and need to save checkpoints at regular intervals during each training epoch. I’ve set up ModelCheckpoint
with the following configuration:
checkpoint_callback = ModelCheckpoint(every_n_train_steps=1000, save_top_k=-1, save_on_train_epoch_end=False)
This saves a checkpoint every 1000 steps as expected. However, when I reload a checkpoint, training resumes from the start of the last epoch rather than continuing from the exact step where the checkpoint was saved.
Does anyone know what might be causing this issue? Any guidance would be much appreciated!