Understanding epoch metrics: acc_train_epoch does not appear to be the average of acc_train_step

cfhammill · August 31, 2022, 1:44pm

Hi lightning devs and users,

I’m using lightning to train some models for work and I’m having trouble understanding how the epoch level metrics are getting aggregated and computed. In my model the acc_train_step hits perfect accuracy and maintains that for 1000s of steps, where the acc_train_epoch stays < 0.7. From reading the documentation I would expect the acc_train_epoch to be the average acc_train_step for each step of the epoch, but then shouldn’t acc_train_epoch be 1 as well?

Can someone help me understand why these two graphs would be so different?

I’m using pytorch-lightning 1.6.3
python 3.9.10

Thanks!

extra details:
I’m training with parallel strategy dp on 4 gpus,
my training step looks like

def training_step(self, batch, batch_idx):
        x = batch["image"][tio.DATA]
        y = batch["label"]

        preds = self(x)
        y = y.view(y.shape[0], 1).float()
        loss = self.criterion(preds, y)
        acc = ((y > 0.5) == (preds > 0.5)) \
                   .type(torch.FloatTensor).mean()

        # perform logging
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape[0])
        self.log("train_acc", acc, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape[0])

Topic		Replies	Views
How to obtain per-class accuracy at the end of each epoch? implementation help	2	2065	November 3, 2023
Logger in Lightning	0	190	March 14, 2022
Confusions about torchmetrics in pytorch_lightning Trainer	6	610	March 1, 2024
Computing validation accuracy at the end of each epoch implementation help	1	4264	September 18, 2020
Log PyTorch Lightning metric over full validation data loader (for the full epoch) Trainer	2	954	August 26, 2020

Understanding epoch metrics: acc_train_epoch does not appear to be the average of acc_train_step

Related topics