Confusion matrix in on_test_epoch_end() - argument error

I have a problem in which I do binary classification for 2 outputs. I want to create a confusion matrix for each of them, + some other metrics, so I was looking for the best way to do this (still learning pl).

Currently I thought to put them in on_test_epoch_end(), but with the latest update of lightening I did today, this stopped working and gives me an error. It worked before when the function was called test_epoch_end(). The error is:

on_test_epoch_end() missing 1 required positional argument: 'outputs'

I tried putting (self, trainer, module_pl), but that gives errors too, all about the arguments. Any ideas? Maybe I am putting this confusion matrix in a very wrong place?

My code is:

    def test_step(self, batch, batch_idx):
        x, y = batch  #not sure if instead should be batch[0], batch[1]
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        # result = pl.EvalResult()
        self.log('test_loss', loss)
        self.log('length y hat', len(y_hat))
        
        # accuracy = functional.accuracy(y_hat, y, task = 'binary')
        # f1_score_pred = functional.f1_score(y_hat, y, task = 'binary'). #gives me zero, so something is wrong
        # auroc = functional.auroc(y_hat, y, task = 'binary')
        self.log("train_loss", loss)
        # self.log("train_accuracy", accuracy)
        # self.log("train_f1", f1_score_pred)
        # self.log("train_auroc", auroc)

        return {'preds' : y_hat, 'targets' : y}

    def on_test_epoch_end(self, outputs):
        preds = torch.cat([tmp['preds'] for tmp in outputs])
        targets = torch.cat([tmp['targets'] for tmp in outputs])
        
        confusion_matrix = torchmetrics.ConfusionMatrix(task = 'binary', num_classes=2)
        confusion_matrix(preds, targets.int())

        confusion_matrix_computed = confusion_matrix.compute().detach().cpu().numpy().astype(int)

        df_cm = pd.DataFrame(confusion_matrix_computed)
        plt.figure(figsize = (10,7))
        fig_ = sns.heatmap(df_cm, annot=True, cmap='Spectral').get_figure()
        plt.close(fig_)
        # self.logger("Confusion matrix: ") 
        self.loggers[0].experiment.add_figure("Confusion matrix", fig_, self.current_epoch)

I think the API has changed in 2.0.0, I noticed that when I upgraded too.
From the documentation I understand that you should accumulate predictions yourself.
As you can see in LightningModule — PyTorch Lightning 2.0.0 documentation.

2 Likes

Yes exactly, as @hynky said.
Here you find a quick explanation and easy to follow guide how to upgrade the “x_epoch_end” code:
https://lightning.ai/pages/releases/2.0.0/#bc-changes-pytorch

Thanks that was helpful. For the record, this is how I am doing it now:

def __init__(self, 
                 n_features, 
                 hidden_size, 
                 seq_len, 
                 batch_size,
                 num_layers, 
                 dropout, 
                 learning_rate,
                 criterion):
...
        self.validation_step_y_hats = []
        self.validation_step_ys = []

...

 def test_step(self, batch, batch_idx):
        x, y = batch  #not sure if instead should be batch[0], batch[1]
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        self.log('test_loss', loss)
        self.log('length y hat', len(y_hat))
        
        treshold = 0.05
        
        accuracy = functional.accuracy(y_hat, y, task = 'binary')
        f1_score_pred = functional.f1_score(y_hat, y, task = 'binary', average = 'weighted', threshold = treshold)
        confmat = functional.confusion_matrix(y_hat, y, task="binary")
        precision = functional.precision(y_hat, y, task = 'binary', threshold = treshold)
        recall = functional.recall(y_hat, y, task = 'binary', threshold = treshold)
        self.log("Precision",precision)
        self.log("Recall",recall)
        self.log("test_accuracy", accuracy)
        self.log("train_f1", f1_score_pred)
        self.validation_step_y_hats.append(y_hat)
        self.validation_step_ys.append(y)
        
        return {'preds' : y_hat, 'targets' : y}


    def on_test_epoch_end(self):
        y_hat = torch.cat(self.validation_step_y_hats)
        y = torch.cat(self.validation_step_ys)

        confusion_matrix = torchmetrics.ConfusionMatrix(task = 'binary', num_classes=2, threshold=0.05)
        confusion_matrix(y_hat, y.int())

        confusion_matrix_computed = confusion_matrix.compute().detach().cpu().numpy().astype(int)

        df_cm = pd.DataFrame(confusion_matrix_computed)
        plt.figure(figsize = (10,7))
        fig_ = sns.heatmap(df_cm, annot=True, cmap='Spectral').get_figure()
        plt.close(fig_)
        self.loggers[0].experiment.add_figure("Confusion matrix", fig_, self.current_epoch)

Hello,

What is the purpose of calculating the confusion matrix twice in test_step with functional also what is functional?:

confmat = functional.confusion_matrix(y_hat, y, task="binary")
     

and second use in on_test_epoch_end:

confusion_matrix = torchmetrics.ConfusionMatrix(task = 'binary', num_classes=2, threshold=0.05)
confusion_matrix(y_hat, y.int())
...

So is test_step essentially useless since the deprecation of on_test_epoch_end? If one can’t combine all predictions/targets from each batch in order to measure correctly a metric of his interest, what is the point of test_step? If I understand correctly, at the end of trainer.test(model), one gets back a dictionary with the metric and its value, the latter being average(Bi) where Bi is metric value calculated on the i-th batch. However, reporting such an average value doesn’t make any sense (at least if we want to report it as the performance of the model measured in the “test set”).