Limit steps per epoch

sunagadbr · June 17, 2023, 3:10pm

Is there a way to limit the number of steps per epoch?
The epoch completion time is much larger than the slurm task time and whenever the training resumes the epoch is restarted. Although the global step resumes properly the epoch is always reset which I assume will be a problem for plotting the training curve. Also, the validation epoch never starts but this could be solved by val_check_interval flag.
Please let me know if there are any ways to solve this issue.

awaelchli · June 18, 2023, 6:57am

Hi @sunagadbr

The epoch size can be limited using

trainer = Trainer(limit_train_batches=100)  # 100 batches
# or a fraction
trainer = Trainer(limit_train_batches=0.5)  # half of the dataset

Documentation

The trainer does not support resuming mid-epoch. So I think your approach to limit the size is reasonable.

sunagadbr · June 18, 2023, 8:12am

Hello @awaelchli ,
Thank you for your response. I did go through this parameter however it seemed more like a debugging feature in the documentation, I was just confused about whether the subset of batches sampled per epoch would remain fixed during the entire training as it wouldn’t convert the entire dataset.
If the subset of batches is sampled randomly for every epoch, then that is exactly what I’m looking for.

awaelchli · June 19, 2023, 1:26pm

I was just confused about whether the subset of batches sampled per epoch would remain fixed during the entire training

That would depend on whether random sampling is enabled (shuffle=True in the torch DataLoader). Lightning doesn’t control the sampling, so I think what you could do is

Enable shuffling
Don’t set a seed or set a different seed everytime you resume. This ensures sampling of the previous (partial) epoch is not repeated.

Could this work?

sunagadbr · June 19, 2023, 1:58pm

Yes, that should work. Shuffling was set to true but I was using the same seed, so not setting a seed would be the best option.

Thank you.

sunagadbr · June 19, 2023, 3:05pm

I also found out that using
#90 seconds before training ends
SBATCH --signal=SIGUSR1@90
https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html

resumes the training at the exact step that the job was terminated at i.e, the epoch resumes.

This is an ideal solution for my case. But it saves the temporary checkpoint in the root directory (even though there is a separate checkpoint directory) and this is a problem as I run multiple training instances it would be ambiguous if all models save a temporary checkpoint in the same root directory. Is there a way to change the temporary checkpoint dir?

sunagadbr · June 21, 2023, 10:55am

I resolved it by setting default_root_dir for the trainer.
Works perfectly now.

sunagadbr · June 30, 2023, 1:14pm

Hi,
The issue with point 2 is that I use DDP and it is necessary to set a seed for it. I tried doing this seed_everything(random.randint(1, 100)) so that whenever the training resumes a different seed is set but with this the seed set on every GPU is different.
Is there a way to set a random seed when we resume the training while having the same seed across all GPUs?

awaelchli · July 1, 2023, 4:34am

I don’t know a good way to do that automatically. Perhaps you could use a fixed value as seed, for example the job_id:

seed_everything(job_id)  # job_id is same across ranks, but will be different every run.
seed_everything(random.randint(1, 100))

sunagadbr · July 1, 2023, 7:49am

I might be wrong here but doesn’t slurm auto-requeue requeue the job with same job id?

how about reload_data_loader_every_n_epochs?

another possible solution would be to not use limit_train_batches (as slurm auto-requeue can resume mid-epoch) and use val_check_interval. But I should still use limit_val_batches else validation epoch will be very large.

sunagadbr · July 4, 2023, 7:38am

I solved it by generating the random in the slurm script, whenever the script is re-submitted it passes a different random number which will be set for all DDP processes.

Topic		Replies	Views
Training for a set number of iterations without setting epochs?	4	7749	September 16, 2020
Custom steps per epoch independent of dataset size Trainer	0	417	October 4, 2023
How to continue training for more epochs? Trainer	1	1393	March 25, 2023
Global_step increased at new epoch regardless of gradient accumulation Trainer	2	987	March 26, 2023
About resume training implementation help	0	282	March 29, 2024

Limit steps per epoch

Related topics