CheckpointCallback saves checkpoint with '-v1' despite save_top_k=1

epistoteles · November 28, 2022, 7:52pm

When I reduce the size of my dataset (one epoch takes only about 10 seconds), the saved checkpoints of the model have the appendix ‘-v1’, e.g. ‘mycheckpoint-v1.ckpt’. According to the docs this should only happen if save_top_k >= 2:

If save_top_k >= 2 and the callback is called multiple times inside an epoch, the name of the saved file will be appended with a version count starting with v1.

My checkpoint callback and trainer look as follows:

checkpoint_cb = ModelCheckpoint(save_top_k=1,
                                monitor=None,
                                dirpath=f"checkpoints/{dataset}/{run_name}/",
                                filename=f"mycheckpoint",
                                save_weights_only=True)

trainer = Trainer(max_epochs=epochs,
                  accelerator='cpu', devices=1,
                  check_val_every_n_epoch=1, 
                  callbacks=[lr_monitor_cb, checkpoint_cb],
                  log_every_n_steps=10)

Any ideas why this may happen? I rely on the checkpoints having a specific name with no appendix.

awaelchli · November 30, 2022, 12:52pm

In your code the checkpoint path is parameterized with run_name. Did you change this between trainer runs? You wouldn’t want two different runs writing checkpoints to the same directory.

We generally append the -v version to the names because we don’t want to overwrite checkpoints of users (and hence delete them), as this could be very damaging. Lightning can’t know whether the checkpoint was from a different run or the current one if the names collide.

epistoteles · November 30, 2022, 2:55pm

Thanks for the answer.

The run_name is determined by wandb and will never be the same for different runs.

Currently, for a single run, there is only one checkpoint (with -v1) in the directory after all epochs have been completed. The directory is empty otherwise. Hence I would be surprised if -v1 was appended not to overwrite something else - as nothing else is there.

The problem also only occurs if I reduce the size of the dataset - if I train on the full dataset (and a single epoch takes much longer) the checkpoints have the correct name.

awaelchli · December 3, 2022, 8:23pm

Is it possible for you to provide a minimal (single file) runnable example that shows this happening (can also be a Google Colab)? I can’t quite see which setting would contribute to these observations, or why it would relate to the dataset size.

awaelchli · December 3, 2022, 8:26pm

Note that with save_top_k, the checkpoint path needs to include a value (metric) that relates to the score of the checkpoint being saved (like a val_acc for example). With your current path: checkpoints/{dataset}/{run_name}, the checkpoints would have all the same name and so all top K checkpoints would overwrite themselves to one file.

I suggest you define it something like

checkpoints/{dataset}/{run_name}-{val_loss}

assuming you log a metric val_loss.

Topic		Replies	Views
Checkpoints are overwritten automatically callbacks	1	1483	February 7, 2022
ModelCheckpoint filename callbacks	2	1463	February 7, 2022
Save checkpoint without overwrite callbacks	1	580	January 29, 2022
Saving checkpoints inside version folder callbacks	2	639	July 3, 2023
Checkpoint without saving to a file	2	2570	November 13, 2020

CheckpointCallback saves checkpoint with '-v1' despite save_top_k=1

Related topics