On_training_epoch_end does not get called

bavshehata · March 21, 2023, 1:22pm

I’m training a CNN and I wanted to log some metrics at the end of each training epoch. However, I’ve noticed that on_training_epoch_end is never called while on_validation_epoch_end works just fine Here’s an excerpt of the model containing those two:

      def training_step(self, batch):
                images, labels = batch 
                pred = self(images)
                train_loss = F.cross_entropy(pred, labels)
                correct=pred.argmax(dim=1).eq(labels).sum().item()
                total=len(labels)
                batch_dictionary={
                      "loss": train_loss,
                      "correct": correct,
                      "total": total
                }
                self.training_step_outputs.append(batch_dictionary)
                return batch_dictionary
          

      def validation_step(self, batch, batch_idx):
          images, labels = batch 
          pred = self(images)
          val_loss = F.cross_entropy(pred, labels)
          correct=pred.argmax(dim=1).eq(labels).sum().item()
          total=len(labels)
          val_acc = correct/total
          batch_dictionary={
                "loss": val_loss,
                "acc": val_acc,
                "correct": correct,
                "total": total
          }
          self.validation_step_outputs.append(batch_dictionary)
          return batch_dictionary
          
      def on_training_epoch_end(self):
          print('training epoch')
          outputs = self.training_step_outputs;
          batch_losses = [x['loss'] for x in outputs]
          epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
          epoch_acc = sum([x['acc'] for x in outputs])/len(outputs)
          print("Training accuracy : ", epoch_acc)
          print("Training loss : ", epoch_loss)
          self.training_step_outputs.clear()  # free memory
          
      def on_validation_epoch_end(self):
          outputs = self.validation_step_outputs;
          batch_losses = [x['loss'] for x in outputs]
          epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
          epoch_acc = sum([x['acc'] for x in outputs])/len(outputs)
          print("\nValidation accuracy : ", epoch_acc)
          print("Validation loss : ", epoch_loss)
          val_acc.append(epoch_acc)
          self.validation_step_outputs.clear()  # free memory

I’ve looked around and couldn’t find any explanation as to why this is happening or how to fix it.

awaelchli · March 21, 2023, 1:46pm

Hey @bavshehata

The hook is actually called on_train_epoch_end. If you rename it in your code, it will get called. I see that we got that wrong in our release notes, I will fix that. If you got this info from somewhere else, let me know and I can double check that it is correct.

bavshehata · March 21, 2023, 1:59pm

Works like a charm. Thanks @awaelchli!

The only other instance of on_training_epoch_end I could find is in the paragraph above the code here in the docs

awaelchli · March 21, 2023, 3:35pm

The only other instance of on_training_epoch_end I could find is in the paragraph above the code here in the docs

Thanks for checking! This got recently fixed already. You will see it if you replace “stable” with “latest” in the docs link you posted. So we should be good.

Topic		Replies	Views
Training_epoch_end is never called LightningModule	3	1558	February 22, 2021
Does not run validation step after epoch when running with all data implementation help	5	2460	May 1, 2023
Validation_step and validation_epoch_end won't get called in trainer.fit() routine LightningModule	4	6843	November 2, 2022
Is there a way to only log on epoch end using the new Result APIs? Results object	7	13939	August 27, 2020
Logging metrics on validation epoch end	4	4126	December 9, 2020

On_training_epoch_end does not get called

Related topics