Cannot pickle torch._C.Generator object — Multi-GPU training

I would like to train a LightningModule model on a machine having multiple (3) GPUs.
In my program, I create the train/val DataLoaders, with the specification on the generator as follows
train_loader = DataLoader(dataset, batch_size, generator = torch.Generator(device = 'cuda'))
otherwise the random number generator returns items being on cpu, which does legitimately raise a TypeError.

By thus doing I get the error:
TypeError: cannot pickle 'torch._C.Generator' object
which I think comes from the fact that the generator object instantiated in the DataLoader constructor can not be serialized and pickled, being it on the GPU device.

If I use only 1 GPU (passing the parameter gpus = 1 in the Trainer constructor and declaring and environment variable CUDA_VISIBLE_DEVICES="0"), no error occurs.

I use a custom torch.utils.data.Dataset-inherited class as dataset. It is a map-style dataset containing only numerical structured data. The overridden __getitem__ method returns a sequence of 2D images.

Apparently none of the solution found googling the error could match this problem. Thanks in advance for any hint.

Hi @MatteoZambra

This is a simple limitation of the default multi-device strategy we use to launch processes. You have two options to avoid the problem:

  1. Return your dataloader from the LightningModule.train_dataloader() hook
  2. Use Trainer(strategy=“ddp”, …) when using multiple devices.

Hope this helps!

Hi @awaelchli ,
Thanks for your answer. I tried both the options you suggested but I am afraid that none works. I forgot to mention a detail that may be crucial. The cuda version that the machine I use has installed is 11.3 and the pytorch lightning version is 1.4.6, due to the cuda release. So the strategy flag is accelerator instead.

To give more context, what happens in my program is the following

class CDmod(pl.LightningDataModule):
    def __init__(self, data):
        super(CDmod, self).__init__()
       
        self.data = data
        self.setup()
    #end
     
    def setup(self):
        
        # prepare x_train, x_val, x_test as tensors
        
        # The following are map-style torch.utils.data.Dataset instances
        self.train_set = DSet(x_train)
        self.val_set   = DSet(x_val)
        self.test_set  = DSet(x_test)
    #end
    
    def train_dataloader(self):
        return DataLoader(self.train_set, batch_size = 8, generator = torch.Generator(DEVICE))
    #end
    
    def val_dataloader(self):
        return DataLoader(self.val_set, batch_size = 8, generator = torch.Generator(DEVICE))
    #end
    
    def test_dataloader(self):
        return DataLoader(self.test_set, batch_size = 8, generator = torch.Generator(DEVICE))
    #end
#end

# Definition of LitModel ...

if __name__ == '__main__':
    
    EPOCHS = 5
    GPUS = 3
    
    x = torch.normal(0, 1, (1500, 24, 200, 200))
    cdm = CDmod(x)
    
    model = Net()
    lit_model = LitModel(model, cdm)
    
    profiler_kwargs = {
        'max_epochs' : EPOCHS, 
        'log_every_n_steps' : 1
    }
    
    if torch.cuda.is_available():
        profiler_kwargs.update({'accelerator'  : 'ddp'})
        profiler_kwargs.update({'gpus'         : GPUS})
    #end
    
    trainer = pl.Trainer(**profiler_kwargs)
    trainer.fit(lit_model, cdm.train_dataloader(), cdm.val_dataloader())
    trainer.test(lit_model, cdm.test_dataloader())
    
#end

Which produces the error:
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
The same error is obtained if I use the lit_model.train_dataloader() hook, as
trainer.fit(lit_model, lit_model.train_dataloader(), lit_model.val_dataloader())

Thanks again for your help.


EDIT The difference I could spot with respect to the issue raised in my previous post, is that using accelerator = 'ddp' in the Trainer constructor, the pickling error for the Generator does not appear anymore, but at the price of having that on the CPU again.