On_training_epoch_end does not get called

I’m training a CNN and I wanted to log some metrics at the end of each training epoch. However, I’ve noticed that on_training_epoch_end is never called while on_validation_epoch_end works just fine Here’s an excerpt of the model containing those two:

      def training_step(self, batch):
                images, labels = batch 
                pred = self(images)
                train_loss = F.cross_entropy(pred, labels)
                correct=pred.argmax(dim=1).eq(labels).sum().item()
                total=len(labels)
                batch_dictionary={
                      "loss": train_loss,
                      "correct": correct,
                      "total": total
                }
                self.training_step_outputs.append(batch_dictionary)
                return batch_dictionary
          

      def validation_step(self, batch, batch_idx):
          images, labels = batch 
          pred = self(images)
          val_loss = F.cross_entropy(pred, labels)
          correct=pred.argmax(dim=1).eq(labels).sum().item()
          total=len(labels)
          val_acc = correct/total
          batch_dictionary={
                "loss": val_loss,
                "acc": val_acc,
                "correct": correct,
                "total": total
          }
          self.validation_step_outputs.append(batch_dictionary)
          return batch_dictionary
          
      def on_training_epoch_end(self):
          print('training epoch')
          outputs = self.training_step_outputs;
          batch_losses = [x['loss'] for x in outputs]
          epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
          epoch_acc = sum([x['acc'] for x in outputs])/len(outputs)
          print("Training accuracy : ", epoch_acc)
          print("Training loss : ", epoch_loss)
          self.training_step_outputs.clear()  # free memory
          
      def on_validation_epoch_end(self):
          outputs = self.validation_step_outputs;
          batch_losses = [x['loss'] for x in outputs]
          epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
          epoch_acc = sum([x['acc'] for x in outputs])/len(outputs)
          print("\nValidation accuracy : ", epoch_acc)
          print("Validation loss : ", epoch_loss)
          val_acc.append(epoch_acc)
          self.validation_step_outputs.clear()  # free memory

I’ve looked around and couldn’t find any explanation as to why this is happening or how to fix it.

Hey @bavshehata

The hook is actually called on_train_epoch_end. If you rename it in your code, it will get called. I see that we got that wrong in our release notes, I will fix that. If you got this info from somewhere else, let me know and I can double check that it is correct.

1 Like

Works like a charm. Thanks @awaelchli!

The only other instance of on_training_epoch_end I could find is in the paragraph above the code here in the docs

1 Like

The only other instance of on_training_epoch_end I could find is in the paragraph above the code here in the docs

Thanks for checking! This got recently fixed already. You will see it if you replace “stable” with “latest” in the docs link you posted. So we should be good.

1 Like