Suppose we are running some classification neural network for certain number of epochs. In each epoch, we compare validation loss to the validation loss of previous epoch and if it is better, save the better model. During testing we use the saved model as our best fit model.
I was wondering, when we log our validation loss in the PyTorch Lightning validation_step, does PyTorch / PyTorch Lightning do something similar? Does it save the best model somewhere and return it as the final model?
Please excuse me if the question was naive on my part. Just starting out.
From what I understand, it is taking the best model, and copying its state to the current process that is available. So now, process becomes the best model. Is that so? Or am I interpreting it incorrectly?
I see that the checkpoints saved have file name as epoch=140-step=8459 for example. What does the step indicate? Initially I thought it might be batch runs. However, I have validation data=12000, batch size=1000, which means 12 batch runs of validation step in each epoch, 140*12=1680. The numbers don’t match. So, got confused as to what that might be.