Hi lightning devs and users,
I’m using lightning to train some models for work and I’m having trouble understanding how the epoch level metrics are getting aggregated and computed. In my model the acc_train_step
hits perfect accuracy and maintains that for 1000s of steps, where the acc_train_epoch
stays < 0.7. From reading the documentation I would expect the acc_train_epoch
to be the average acc_train_step
for each step of the epoch, but then shouldn’t acc_train_epoch
be 1 as well?
Can someone help me understand why these two graphs would be so different?
I’m using pytorch-lightning 1.6.3
python 3.9.10
Thanks!
extra details:
I’m training with parallel strategy dp on 4 gpus,
my training step looks like
def training_step(self, batch, batch_idx):
x = batch["image"][tio.DATA]
y = batch["label"]
preds = self(x)
y = y.view(y.shape[0], 1).float()
loss = self.criterion(preds, y)
acc = ((y > 0.5) == (preds > 0.5)) \
.type(torch.FloatTensor).mean()
# perform logging
self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape[0])
self.log("train_acc", acc, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape[0])