How frequently are train_dataloader and val_dataloader called?

justusschock · September 3, 2020, 2:53pm

A way to prevent long data loadings to be done repeatedly is to use a data module, which already loads the dataset during init like this and later just creates the loader on the fly which just wraps the already existing dataset:

class MyFancyDataModule(LightningDataModule):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.train_ds, self.valid_ds, self.test_ds = self.create_datasets(*args, **kwargs)

    def create_datasets(self, *args, **kwargs):
        # DO YOUR DATASET CREATION LOGIC HERE
        return train_ds, valid_ds, test_ds

    def train_dataloader(self):
        return DataLoader(self.train_ds)

    def val_dataloader(self):
        return DataLoader(self.valid_ds)

This has the advantage that you always load your data only once, but the disadvantage that you also load all your trainset before running the checks (annoying during debugging).

You can however overcome this issue, by just caching the datasets the first time they were loaded and reuse the datasets. Just make sure not to reuse the data loader