I am trying to save checkpoints of my model while training (using the validation loss metric) but I do not see any .ckpt files being saved under the checkpoints directory as you can see in the sample screenshot below.
Here are what my validation_step and validation_epoch_end functions look like:
def validation_step(self, val_batch, batch_idx):
grouped_pooled_outs = val_batch['grouped_pooled_outs']
src_key_padding_mask = val_batch['src_key_padding_mask']
targets = val_batch['success_label']
logits = self.forward(grouped_pooled_outs, src_key_padding_mask)
y_prob = self.softmaxer(logits)[:, 1]
y_pred = (y_prob>0.5).float()
loss = self.cross_entropy_loss(logits, targets)
return {'val_loss': loss, 'preds': y_pred, 'targets': targets.tolist()}
def validation_epoch_end(self, val_step_outputs):
y_pred = []
y_true = []
for x in val_step_outputs:
y_pred.extend(x['preds'].tolist())
y_true.extend(x['targets'])
f1_res = f1_score(y_true, y_pred, average = 'weighted')
avg_val_loss = torch.tensor([x['val_loss'] for x in val_step_outputs]).mean()
log_dict = {
'val_loss': avg_val_loss,
'val_f1': f1_res
}
self.log('val_loss', avg_val_loss, prog_bar=True)
self.log('val_f1', f1_res, prog_bar=True)
return {'val_loss': avg_val_loss, 'log': log_dict}
Then I am instantiating a checkpoint callback to monitor the val_loss (across the epoch).
checkpoint_callback = pl.callbacks.ModelCheckpoint(monitor="val_loss")
trainer = pl.Trainer(log_every_n_steps=1, gpus=1, max_epochs=2, callbacks=[checkpoint_callback], num_sanity_val_steps=0)
model = LightningToBERT(nhead=1, num_layers=1, dropout=0.3)
datamodule = GoodReadsDataModule()
trainer.fit(model, datamodule)
Why aren’t any checkpoints being saved?