Hello everyone,
I am still quite new to pytorch. I read that you should create a torch tensor on the device itself, to avoid transfer (mentioned in the docs)
I have a custom dataclass that handles my data and looks like this:
class MyDataset(Dataset):
def __init__(self, labels: np.array, data_path):
self.targets = torch.from_numpy(labels)
self.list_of_arrays = [np.memmap(memmap_path, dtype='float32', mode='r', shape=(30316, 1, 160, 392)) for memmap_path in data_path]
self.transform = torchvision.transforms.GaussianBlur(kernel_size=(51, 51), sigma=(0.1, 2))
def __getitem__(self, index):
x = torch.from_numpy(np.stack([item[index] for item in self.list_of_arrays]))
y = self.targets[index]
return self.transform(x), y
def __len__(self):
return len(self.targets)
To give a short explanation: My data is distributed over several memmaps. Whenever data is requested I have to collect all the different pieces from those files.
in the init I create an array that stores all the numpy memmaps.
This dataclass is used by lightningdatamodule to get all the neccessary data.
I therefore have several questions:
- should I create the targets/label on the device or leave it like this?
- in getitem I first load my data and then apply the transformation. Can I create the data directly on the GPU at this point?
- Does Lightning actually stop me from messing up the device on which things are created? I am worried about something happening like “create tensor on gpu → gets moved to cpu for some calculations → gets moved to gpu again”. I have no idea if all calculations will be done on the target device (in this case GPU) once they are created there or if ,e.g, the transforms might cause trouble .
- Is this actually the way to apply the transformation to my data?
All advice is appreciated.