This may perhaps be more of a suggestion than a question, unless there is a solution that I have missed. In some scenarios, it would be nice to run a fixed number of steps for each epoch, say 1000, irregardless of the actual dataset size. If my dataset has 220 instances, I wish for the dataloader to simply sample randomly from these until 1000 steps has been reached. It could also work the other way around if the dataset contains more than 1000 instances, in which case the trainer would undersample the dataset.
A lot of hooks in lightning are tied together with beginning/ending epochs, and IMO, it seems a bit arbitrary to have one epoch coincide with exactly one pass of the training set. I think it would be nice to have the option to completely disconnect dataset size from the frequency of callbacks and validation. If I pretrain on a corpus and then finetune on dataset 1/10th the size, I do not wish for my epoch callbacks and validation loop to fire 10 times as frequently.
I know that there are certain workarounds such as limit_train_batches
, check_val_every_n_epoch
, wrappers, and custom callbacks that keep track of the number of steps. However, I feel like these solutions are treating the symptoms of the same problem: that the frequency of validation and epoch callbacks depend on the size of the traininig dataset and that this “arbitrary” number often does not align with desired frequency.
What are your thoughts? Perhaps there is a simple solution to my problem that I completely missed.