What does PyTorch Lightning module do with logged validation losses?

Suppose we are running some classification neural network for certain number of epochs. In each epoch, we compare validation loss to the validation loss of previous epoch and if it is better, save the better model. During testing we use the saved model as our best fit model.

I was wondering, when we log our validation loss in the PyTorch Lightning validation_step, does PyTorch / PyTorch Lightning do something similar? Does it save the best model somewhere and return it as the final model?

Please excuse me if the question was naive on my part. Just starting out.

Thanks in advance!

You can enable checkpointing the best model, or the top-K best models in Lightning. See an example here.

from lightning.pytorch.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(dirpath="my/path/", save_top_k=2, monitor="val_loss")
trainer = Trainer(callbacks=[checkpoint_callback])
trainer.fit(model)

This will save files to dirpath with the best models (the ones that have the lowest val_loss.

You can then load back the best model if you want by doing

best_model = YourModel.load_from_checkpoint(checkpoint_callback.best_model_path)
trainer.test(best_model)

Thank you for your help. I appreciate it!

I am running into the following issue when trying to use checkpointing.

model_checkpoint = ModelCheckpoint(save_top_k=2,
                                   dirpath=os.path.join(tb_logger.log_dir, "checkpoints"),
                                   monitor="val_loss",
                                   save_last=True)
trainer = Trainer(logger=tb_logger,
                  callbacks=[
                      model_checkpoint,
                  ],
                  **config['trainer_params'])
trainer.fit(process, datamodule=data)
best_model = process.load_from_checkpoint(model_checkpoint.best_model_path)

The error is:
TypeError: init() missing 2 required positional arguments: ‘model’ and ‘params’

model and params are the two parameters of process. Do I need supplying them somewhere when calling load_from_checkpoint?

When you call process.load_from_checkpoint you need to supply the arguments that were used to instantiate the model.

Note that load_from_checkpoint is a class method. When you call it, it re-instantiates your model. Change your code to:

best_model = Process.load_from_checkpoint(model_checkpoint.best_model_path, model=..., params=...)

Note the Process vs. process change. Process is the class of your model.

Documentation

Writing the following code also worked. Will there be any issue with doing it as follows:

best_model = torch.load(model_checkpoint.best_model_path)
process.load_state_dict(best_model["state_dict"])
process.some_method(process.model, "Best_Model.png")

From what I understand, it is taking the best model, and copying its state to the current process that is available. So now, process becomes the best model. Is that so? Or am I interpreting it incorrectly?

Yes that is also going to work.

1 Like

Thank you so much for your help!

1 Like

One last question

I see that the checkpoints saved have file name as epoch=140-step=8459 for example. What does the step indicate? Initially I thought it might be batch runs. However, I have validation data=12000, batch size=1000, which means 12 batch runs of validation step in each epoch, 140*12=1680. The numbers don’t match. So, got confused as to what that might be.

The step number indicates how many times the LightningModule training_step was executed. Validation has no influence on this.

1 Like

Thanks! I appreciate your help.

1 Like

What val_loss does the trainer use to “rank” models? The val_loss at the end of every epoch?