Strange checkpoint loading and learning and behaviour

laughingrice · December 16, 2020, 9:43am

I am trying to train a new network with pytorch lighting (testing out the framework) and am seeing very strange behavior that seems to show that checkpoint is not loaded correctly and that learning rate is changing under my feet somehow.

The graph shows a plot of the training loss for two consecutive runs.

The optimizer is configured using

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

The training is then run twice:

trainer = pl.Trainer(gpus=4)
trainer.fit(net, train_loader, test_loader) 

net = Net.load_from_checkpoint('mlruns/1/0081837fe90f4eeebf806752f31af51d/checkpoints/epoch=655-test_loss=3.051074266433716.ckpt')
trainer.fit(net, train_loader, test_loader)

And there are two weird behaviors

(1) There is a very big jump about halfway through in the loss. Location is very consistent on different experiments, suggesting that learning rate or some other parameter is changed at that point behind my back

(2) The second run after loading the checkpoint seems to show that the checkpoint results are not actually used

I am using this code to save the checkpoint (passed to callbacks in the trainer, based on command line output it is being used).

chkpnt_cb = ModelCheckpoint(
    monitor='test_loss',
    verbose=True,
    save_top_k=3,
    save_weights_only=False,
    mode='min',
    period=1,
    filename='{epoch}-{test_loss}')

What am I missing here? (I tried passing LearningRateMonitor(logging_interval=‘step’) to callbacks as well to get feedback regarding learning rate but I do not see anything in the logs)

goku · December 16, 2020, 6:00pm

you are using 'test_loss' in monitor? is it created in the test_step?

laughingrice · December 16, 2020, 6:07pm

Yes. I saw that lightning likes a slightly different terminology, but that one is personal history (scaling in there is based on data so test_loss == 1 is roughly 100% data range RMSE)

        x, y = batch
        z = self(x)

        loss = F.mse_loss(z, y)

        self.log('test_loss', loss.detach().sqrt() * (1/5e-5))

        return loss

goku · December 17, 2020, 5:13pm

if something is logged in test_step then it won’t be monitored because ModelCheckpoint is called when you call trainer.fit and in trainer.fit, test_step/test_epoch_end… won’t be called.

Topic		Replies	Views
Cannot load hyper parameters properly from a checkpoint Trainer	4	3892	December 16, 2020
What does PyTorch Lightning module do with logged validation losses?	10	3010	March 6, 2024
Change/reset ModelCheckpoint.best_model_score upon loading checkpoint implementation help	1	706	December 1, 2022
Is it possible to load a checkpoint halfway through the fit using a callback?	1	413	March 9, 2023
Unable to save optimized checkpoints (tried both pl.EvalResult and checkpoint_callback) callbacks	1	1634	February 22, 2021

Strange checkpoint loading and learning and behaviour

Related topics