Reposting this discussion question here because I read in another discussion that lightning wants to move from discussions to this forum:
I have a LightningModule, DataModule and Trainer that I am using on a Regression Problem. I have observed that as epochs increase, the iterations/s on the tqdm bar decrease significantly by a factor of about 2-5. To look into this I used the SimpleProfiler
and recorded run_training_epoch
at each epoch inside on_train_epoch_end()
:
run_training_epoch = self.trainer.profiler.recorded_durations["run_training_epoch"][-1]
self.log("run_training_epoch", run_training_epoch)
when I plot these after 1000 epochs, I get the following:
I cannot share the full example that produced the plot above, but I tried to create a small toy example in a google colab notebook. The trend is not as severe as it is in the above picture but still there, so I am wondering where else the source of this could be as an individual training batch or the optimization step is not having this stark linear trend.
I have tried with lightning=2.0.2
and pytorch_lighning=1.9.5
.