I am using the following pytorch lightning code with a WandbLogger. (The code is inside a LightningModule
)
def training_step(self, batch, batch_idx):
"""Training step"""
loss, acc, bleu = self._step(batch)
self.log_dict(
{"train/loss": loss, "train/accuracy": acc, "train/bleu_score": bleu},
on_epoch=True,
batch_size=batch[0].shape[1],
)
return loss
def validation_step(self, batch, batch_idx):
"""Validation step"""
loss, acc, bleu = self._step(batch)
self.log_dict(
{"val/loss": loss, "val/accuracy": acc, "val/bleu_score": bleu},
on_epoch=True,
on_step=False,
batch_size=batch[0].shape[1],
)
return loss
def test_step(self, batch, batch_idx):
"""Test step"""
loss, acc, bleu = self._step(batch)
self.log_dict(
{"test/loss": loss, "test/accuracy": acc, "test/bleu_score": bleu},
on_epoch=True,
on_step=False,
batch_size=batch[0].shape[1],
)
return loss
def _step(self, batch: torch.Tensor):
source, target = batch
logits = self(source, target[:-1, :])
with torch.no_grad():
bleu = self._batch_bleu(logits, target)
logits = logits.reshape(-1, logits.shape[2])
target = target[1:].reshape(-1)
loss = F.cross_entropy(logits, target, ignore_index=self.source_pad_idx)
with torch.no_grad():
acc = accuracy(
logits,
target,
task="multiclass",
num_classes=self.hparams.target_vocab_size,
ignore_index=self.source_pad_idx,
top_k=1,
)
return loss, acc.item() * 100, bleu.item() * 100
The weird thing is I get significantly worse values for all 3 metrics in training compared to validation and testing. When I tried to run a validation epoch with the training loader, I got what one would expect: slightly better results on training, which tells me the problem is not with computing the metrics but in the way trainimg_step is logging them. I looked at the documentation of Trainer, WandbLogger, and LightningModule, but found nothing. I also tried logging with tensorboard with no change in the result.
What is lightning doing differently in training vs in evaluation when it comes to logging?
I am using: lightning2.0.0
, pytorch2.0
Loss graph: