How do I control where datasets are located in system memory when using a LightningDataModule? I understand that ptl manages data location (meaning “on the CPU-side RAM or on the GPU-side RAM”) automatically when training and testing models. I am using large datasets that I want to keep CPU-side to train models that I wan to keep on the GPU RAM. I only want to read single data batches to the GPU during each batch training iteration.
Can I do this with pytorch-lightning? I see at several points in the documentation warnings to not manually set data locations using tensor_name.to(device), but that would be required for what I’m doing. Also, should I use the prepare_method() or setup() method to do this? The example DataModules all define the datasets within setup(), but this is called on each separate GPU. This would cause overwrites of CPU-side data tensors, right?
Please let me know if I’m way off base with my understanding of ptl data handling or if this is a possibility.
I have the same question. I’m completely baffled why there’s an assumption in LightningDataModules that data processing (particularly train/val/test splits) is happening on GPUs. I want a DataLoader to send batches to the GPUs as needed, but to do the actual loading on the CPU (what’s he point of num_workers otherwise?). I’m trying to figure out how to do that now and it feels like no matter what I do, I see the dataset load itself again for each GPU on DDP.
I had forgotten about this; thank you for the ping.
My solution for this was to move my data into HDF5 records. The trial data can then be sampled like a torch tensor or a numpy array without loading the entire dataset into cpu memory first. You can write your dataloader to send each trial to the desired device.