When I reduce the size of my dataset (one epoch takes only about 10 seconds), the saved checkpoints of the model have the appendix ‘-v1’, e.g. ‘mycheckpoint-v1.ckpt’. According to the docs this should only happen if save_top_k >= 2:
If save_top_k >= 2
and the callback is called multiple times inside an epoch, the name of the saved file will be appended with a version count starting with v1
.
My checkpoint callback and trainer look as follows:
checkpoint_cb = ModelCheckpoint(save_top_k=1,
monitor=None,
dirpath=f"checkpoints/{dataset}/{run_name}/",
filename=f"mycheckpoint",
save_weights_only=True)
trainer = Trainer(max_epochs=epochs,
accelerator='cpu', devices=1,
check_val_every_n_epoch=1,
callbacks=[lr_monitor_cb, checkpoint_cb],
log_every_n_steps=10)
Any ideas why this may happen? I rely on the checkpoints having a specific name with no appendix.
In your code the checkpoint path is parameterized with run_name. Did you change this between trainer runs? You wouldn’t want two different runs writing checkpoints to the same directory.
We generally append the -v version to the names because we don’t want to overwrite checkpoints of users (and hence delete them), as this could be very damaging. Lightning can’t know whether the checkpoint was from a different run or the current one if the names collide.
Thanks for the answer.
The run_name
is determined by wandb and will never be the same for different runs.
Currently, for a single run, there is only one checkpoint (with -v1
) in the directory after all epochs have been completed. The directory is empty otherwise. Hence I would be surprised if -v1
was appended not to overwrite something else - as nothing else is there.
The problem also only occurs if I reduce the size of the dataset - if I train on the full dataset (and a single epoch takes much longer) the checkpoints have the correct name.
Is it possible for you to provide a minimal (single file) runnable example that shows this happening (can also be a Google Colab)? I can’t quite see which setting would contribute to these observations, or why it would relate to the dataset size.
Note that with save_top_k, the checkpoint path needs to include a value (metric) that relates to the score of the checkpoint being saved (like a val_acc for example). With your current path: checkpoints/{dataset}/{run_name}, the checkpoints would have all the same name and so all top K checkpoints would overwrite themselves to one file.
I suggest you define it something like
checkpoints/{dataset}/{run_name}-{val_loss}
assuming you log a metric val_loss.