What does PyTorch Lightning module do with logged validation losses?

Curious · December 3, 2022, 9:40pm

Suppose we are running some classification neural network for certain number of epochs. In each epoch, we compare validation loss to the validation loss of previous epoch and if it is better, save the better model. During testing we use the saved model as our best fit model.

I was wondering, when we log our validation loss in the PyTorch Lightning validation_step, does PyTorch / PyTorch Lightning do something similar? Does it save the best model somewhere and return it as the final model?

Please excuse me if the question was naive on my part. Just starting out.

Thanks in advance!

awaelchli · December 4, 2022, 11:04pm

You can enable checkpointing the best model, or the top-K best models in Lightning. See an example here.

from lightning.pytorch.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(dirpath="my/path/", save_top_k=2, monitor="val_loss")
trainer = Trainer(callbacks=[checkpoint_callback])
trainer.fit(model)

This will save files to dirpath with the best models (the ones that have the lowest val_loss.

You can then load back the best model if you want by doing

best_model = YourModel.load_from_checkpoint(checkpoint_callback.best_model_path)
trainer.test(best_model)

Curious · December 7, 2022, 2:22am

Thank you for your help. I appreciate it!

I am running into the following issue when trying to use checkpointing.

model_checkpoint = ModelCheckpoint(save_top_k=2,
                                   dirpath=os.path.join(tb_logger.log_dir, "checkpoints"),
                                   monitor="val_loss",
                                   save_last=True)
trainer = Trainer(logger=tb_logger,
                  callbacks=[
                      model_checkpoint,
                  ],
                  **config['trainer_params'])
trainer.fit(process, datamodule=data)
best_model = process.load_from_checkpoint(model_checkpoint.best_model_path)

The error is:
TypeError: init() missing 2 required positional arguments: ‘model’ and ‘params’

model and params are the two parameters of process. Do I need supplying them somewhere when calling load_from_checkpoint?

awaelchli · December 7, 2022, 5:54am

When you call process.load_from_checkpoint you need to supply the arguments that were used to instantiate the model.

Note that load_from_checkpoint is a class method. When you call it, it re-instantiates your model. Change your code to:

best_model = Process.load_from_checkpoint(model_checkpoint.best_model_path, model=..., params=...)

Note the Process vs. process change. Process is the class of your model.

Documentation

Curious · December 7, 2022, 7:03am

Writing the following code also worked. Will there be any issue with doing it as follows:

best_model = torch.load(model_checkpoint.best_model_path)
process.load_state_dict(best_model["state_dict"])
process.some_method(process.model, "Best_Model.png")

From what I understand, it is taking the best model, and copying its state to the current process that is available. So now, process becomes the best model. Is that so? Or am I interpreting it incorrectly?

awaelchli · December 7, 2022, 8:35am

Yes that is also going to work.

Curious · December 7, 2022, 4:23pm

Thank you so much for your help!

Curious · December 7, 2022, 6:31pm

One last question

I see that the checkpoints saved have file name as epoch=140-step=8459 for example. What does the step indicate? Initially I thought it might be batch runs. However, I have validation data=12000, batch size=1000, which means 12 batch runs of validation step in each epoch, 140*12=1680. The numbers don’t match. So, got confused as to what that might be.

awaelchli · December 7, 2022, 7:32pm

The step number indicates how many times the LightningModule training_step was executed. Validation has no influence on this.

Curious · December 7, 2022, 7:40pm

Thanks! I appreciate your help.

diamantidisno3 · March 6, 2024, 10:31pm

What val_loss does the trainer use to “rank” models? The val_loss at the end of every epoch?

Topic		Replies	Views
Auto-saving model weights callbacks	3	2361	September 3, 2020
Unable to save optimized checkpoints (tried both pl.EvalResult and checkpoint_callback) callbacks	1	1637	February 22, 2021
Change/reset ModelCheckpoint.best_model_score upon loading checkpoint implementation help	1	725	December 1, 2022
Logger in Lightning	0	191	March 14, 2022
Checkpoints are overwritten automatically callbacks	1	1483	February 7, 2022

What does PyTorch Lightning module do with logged validation losses?

Related topics