Run_training_epoch duration increases with more epochs

nleh · May 25, 2023, 7:26pm

Reposting this discussion question here because I read in another discussion that lightning wants to move from discussions to this forum:

I have a LightningModule, DataModule and Trainer that I am using on a Regression Problem. I have observed that as epochs increase, the iterations/s on the tqdm bar decrease significantly by a factor of about 2-5. To look into this I used the SimpleProfiler and recorded run_training_epoch at each epoch inside on_train_epoch_end():

run_training_epoch = self.trainer.profiler.recorded_durations["run_training_epoch"][-1]
self.log("run_training_epoch", run_training_epoch)

when I plot these after 1000 epochs, I get the following:

I cannot share the full example that produced the plot above, but I tried to create a small toy example in a google colab notebook. The trend is not as severe as it is in the above picture but still there, so I am wondering where else the source of this could be as an individual training batch or the optimization step is not having this stark linear trend.

I have tried with lightning=2.0.2 and pytorch_lighning=1.9.5.

Topic		Replies	Views
Training_epoch_end is never called LightningModule	3	1566	February 22, 2021
How to continue training for more epochs? Trainer	1	1392	March 25, 2023
Trainer prints every step in validation Trainer	2	2049	May 17, 2023
Does not run validation step after epoch when running with all data implementation help	5	2524	May 1, 2023
Loop over epochs instead of batches implementation help	1	608	April 4, 2022

Run_training_epoch duration increases with more epochs

Related topics