I am trying to train a new network with pytorch lighting (testing out the framework) and am seeing very strange behavior that seems to show that checkpoint is not loaded correctly and that learning rate is changing under my feet somehow.
The graph shows a plot of the training loss for two consecutive runs.
The optimizer is configured using
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
The training is then run twice:
trainer = pl.Trainer(gpus=4)
trainer.fit(net, train_loader, test_loader)
net = Net.load_from_checkpoint('mlruns/1/0081837fe90f4eeebf806752f31af51d/checkpoints/epoch=655-test_loss=3.051074266433716.ckpt')
trainer.fit(net, train_loader, test_loader)
And there are two weird behaviors
(1) There is a very big jump about halfway through in the loss. Location is very consistent on different experiments, suggesting that learning rate or some other parameter is changed at that point behind my back
(2) The second run after loading the checkpoint seems to show that the checkpoint results are not actually used
I am using this code to save the checkpoint (passed to callbacks in the trainer, based on command line output it is being used).
chkpnt_cb = ModelCheckpoint(
monitor='test_loss',
verbose=True,
save_top_k=3,
save_weights_only=False,
mode='min',
period=1,
filename='{epoch}-{test_loss}')
What am I missing here? (I tried passing LearningRateMonitor(logging_interval=‘step’) to callbacks as well to get feedback regarding learning rate but I do not see anything in the logs)