Dealing with large dataset

Nguy_n_Van_Nga_Adtec · December 3, 2022, 5:02am

I have an extremely large data that can not be fitted in the memory. Therefore, I split it into smaller data parts. As a result, I will have multiple dataset instances and can be trained sequentially (only loading the underlying data when needed). But I don’t know how to make the Trainer.fit() method execute like I expected. Can any one help me?

awaelchli · December 3, 2022, 5:13pm

Hey @Nguy_n_Van_Nga_Adtec

It is very common in deep learning that the datasets don’t fit into memory. In PyTorch, people usually have their data stored on disk and load it using the DataLoader. Here is a tutorial from PyTorch how to do that.

I suggest you go for that, since it is the simplest. When you call trainer.fit(), just pass in your training dataloader like so:

trainer.fit(dataloader)

The introduction tutorial here shows you how to use Lightning with your dataloader with an MNIST data example.

Topic		Replies	Views
Multiple dataloaders in training_step() and use them separately implementation help	0	374	September 13, 2023
Why does the dataloader run multiple times and take up all the RAM?	2	2739	November 12, 2022
Training when data is stored in batches Trainer	2	525	May 21, 2023
Custom Image Lightning Dataloader DataModule	0	579	April 29, 2023
How to use multiple train dataloaders with different lengths LightningModule	1	8297	September 27, 2020

Dealing with large dataset

Related topics