I am using stratedy=ddp
and saving checkpoints by using the appropriate class named ModelCheckpoint
which I pass to the trainer with the argument callbacks
.
The checkpoint saved is for example epoch=204-val_cer=0.0459-val_loss=1.1404.ckpt
as I set to put there the data of validation cer and validation loss. However, when I run the checkpoint with the validation set the validation cer is different (using the same batch size and number of gpus as what was used during training).
If I see the tensorboard logs they agree with the validation cer displayed in the checkpoint name.
The cer is implemented through the torchmetrics.CharErrorRate()
implementation.
So, what am I doing wrong?
Here, I report the code for the checkpoint:
checkpoint_callback_val_cer = ModelCheckpoint(
save_top_k=1,
monitor="val_cer",
filename='{epoch}-{val_cer:.4f}-{val_loss:.4f}'
)