Is there a way to limit the number of steps per epoch?
The epoch completion time is much larger than the slurm task time and whenever the training resumes the epoch is restarted. Although the global step resumes properly the epoch is always reset which I assume will be a problem for plotting the training curve. Also, the validation epoch never starts but this could be solved by val_check_interval flag.
Please let me know if there are any ways to solve this issue.
Hi @sunagadbr
The epoch size can be limited using
trainer = Trainer(limit_train_batches=100) # 100 batches
# or a fraction
trainer = Trainer(limit_train_batches=0.5) # half of the dataset
The trainer does not support resuming mid-epoch. So I think your approach to limit the size is reasonable.
Hello @awaelchli ,
Thank you for your response. I did go through this parameter however it seemed more like a debugging feature in the documentation, I was just confused about whether the subset of batches sampled per epoch would remain fixed during the entire training as it wouldn’t convert the entire dataset.
If the subset of batches is sampled randomly for every epoch, then that is exactly what I’m looking for.
I was just confused about whether the subset of batches sampled per epoch would remain fixed during the entire training
That would depend on whether random sampling is enabled (shuffle=True in the torch DataLoader). Lightning doesn’t control the sampling, so I think what you could do is
- Enable shuffling
- Don’t set a seed or set a different seed everytime you resume. This ensures sampling of the previous (partial) epoch is not repeated.
Could this work?
Yes, that should work. Shuffling was set to true but I was using the same seed, so not setting a seed would be the best option.
Thank you.
I also found out that using
#90 seconds before training ends
SBATCH --signal=SIGUSR1@90
https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html
resumes the training at the exact step that the job was terminated at i.e, the epoch resumes.
This is an ideal solution for my case. But it saves the temporary checkpoint in the root directory (even though there is a separate checkpoint directory) and this is a problem as I run multiple training instances it would be ambiguous if all models save a temporary checkpoint in the same root directory. Is there a way to change the temporary checkpoint dir?
I resolved it by setting default_root_dir for the trainer.
Works perfectly now.
Hi,
The issue with point 2 is that I use DDP and it is necessary to set a seed for it. I tried doing this seed_everything(random.randint(1, 100)) so that whenever the training resumes a different seed is set but with this the seed set on every GPU is different.
Is there a way to set a random seed when we resume the training while having the same seed across all GPUs?
I don’t know a good way to do that automatically. Perhaps you could use a fixed value as seed, for example the job_id:
seed_everything(job_id) # job_id is same across ranks, but will be different every run.
seed_everything(random.randint(1, 100))
I might be wrong here but doesn’t slurm auto-requeue requeue the job with same job id?
how about reload_data_loader_every_n_epochs?
another possible solution would be to not use limit_train_batches (as slurm auto-requeue can resume mid-epoch) and use val_check_interval. But I should still use limit_val_batches else validation epoch will be very large.
I solved it by generating the random in the slurm script, whenever the script is re-submitted it passes a different random number which will be set for all DDP processes.