I have my training and validation datasets stored as parquet files (output of pyspark-based data prep pipeline) - I am using Petastorm to load it currently like so -
class ItemsageDataModule(pl.LightningDataModule):
def __init__(self, **kwargs):
super().__init__()
self.train_path = kwargs['train_path']
self.val_path = kwargs['val_path']
self.batch_size = kwargs['bsz']
def setup(self, stage=None):
pass
def train_dataloader(self):
self.reader_train = make_reader(self.train_path, num_epochs=1, seed=1, shuffle_rows=True)
return DataLoader(self.reader_train, batch_size=self.batch_size)
def val_dataloader(self):
self.reader_val = make_reader(self.val_path, num_epochs=1, seed=1, shuffle_rows=False)
return DataLoader(self.reader_val, batch_size=self.batch_size)
I read in a related post that if i am using Petastorm, I might need to load the datasets every epoch, this is how my trainer looks like -
trainer = Trainer(
callbacks= callbacks ,
max_epochs=model_args['n_epochs'],
max_steps = 1000,
# num_sanity_val_steps = 0,
accelerator="gpu",
devices=4,
# num_nodes = -1,
strategy="deepspeed",
deterministic=True,
# precision='16-mixed',
default_root_dir=log_dir,
reload_dataloaders_every_n_epochs=1,
benchmark = True,
# use_distributed_sampler=True,
enable_progress_bar=True,
enable_model_summary=True,
check_val_every_n_epoch=1,
# precision='32-mixed',
logger=logger)
Is this the right way to go about it or is there something more efficient?
For context, my training set has ~13B examples and val set has 5% of that population.