Dealing with large dataset

I have an extremely large data that can not be fitted in the memory. Therefore, I split it into smaller data parts. As a result, I will have multiple dataset instances and can be trained sequentially (only loading the underlying data when needed). But I don’t know how to make the Trainer.fit() method execute like I expected. Can any one help me?

Hey @Nguy_n_Van_Nga_Adtec

It is very common in deep learning that the datasets don’t fit into memory. In PyTorch, people usually have their data stored on disk and load it using the DataLoader. Here is a tutorial from PyTorch how to do that.

I suggest you go for that, since it is the simplest. When you call trainer.fit(), just pass in your training dataloader like so:

trainer.fit(dataloader)

The introduction tutorial here shows you how to use Lightning with your dataloader with an MNIST data example.