Cannot pickle torch._C.Generator object — Multi-GPU training

MatteoZambra · February 16, 2023, 3:22pm

I would like to train a LightningModule model on a machine having multiple (3) GPUs.
In my program, I create the train/val DataLoaders, with the specification on the generator as follows
train_loader = DataLoader(dataset, batch_size, generator = torch.Generator(device = 'cuda'))
otherwise the random number generator returns items being on cpu, which does legitimately raise a TypeError.

By thus doing I get the error:
TypeError: cannot pickle 'torch._C.Generator' object
which I think comes from the fact that the generator object instantiated in the DataLoader constructor can not be serialized and pickled, being it on the GPU device.

If I use only 1 GPU (passing the parameter gpus = 1 in the Trainer constructor and declaring and environment variable CUDA_VISIBLE_DEVICES="0"), no error occurs.

I use a custom torch.utils.data.Dataset-inherited class as dataset. It is a map-style dataset containing only numerical structured data. The overridden __getitem__ method returns a sequence of 2D images.

Apparently none of the solution found googling the error could match this problem. Thanks in advance for any hint.

awaelchli · February 17, 2023, 12:59am

Hi @MatteoZambra

This is a simple limitation of the default multi-device strategy we use to launch processes. You have two options to avoid the problem:

Return your dataloader from the LightningModule.train_dataloader() hook
Use Trainer(strategy=“ddp”, …) when using multiple devices.

Hope this helps!

MatteoZambra · February 20, 2023, 1:32pm

Hi @awaelchli ,
Thanks for your answer. I tried both the options you suggested but I am afraid that none works. I forgot to mention a detail that may be crucial. The cuda version that the machine I use has installed is 11.3 and the pytorch lightning version is 1.4.6, due to the cuda release. So the strategy flag is accelerator instead.

To give more context, what happens in my program is the following

class CDmod(pl.LightningDataModule):
    def __init__(self, data):
        super(CDmod, self).__init__()
       
        self.data = data
        self.setup()
    #end
     
    def setup(self):
        
        # prepare x_train, x_val, x_test as tensors
        
        # The following are map-style torch.utils.data.Dataset instances
        self.train_set = DSet(x_train)
        self.val_set   = DSet(x_val)
        self.test_set  = DSet(x_test)
    #end
    
    def train_dataloader(self):
        return DataLoader(self.train_set, batch_size = 8, generator = torch.Generator(DEVICE))
    #end
    
    def val_dataloader(self):
        return DataLoader(self.val_set, batch_size = 8, generator = torch.Generator(DEVICE))
    #end
    
    def test_dataloader(self):
        return DataLoader(self.test_set, batch_size = 8, generator = torch.Generator(DEVICE))
    #end
#end

# Definition of LitModel ...

if __name__ == '__main__':
    
    EPOCHS = 5
    GPUS = 3
    
    x = torch.normal(0, 1, (1500, 24, 200, 200))
    cdm = CDmod(x)
    
    model = Net()
    lit_model = LitModel(model, cdm)
    
    profiler_kwargs = {
        'max_epochs' : EPOCHS, 
        'log_every_n_steps' : 1
    }
    
    if torch.cuda.is_available():
        profiler_kwargs.update({'accelerator'  : 'ddp'})
        profiler_kwargs.update({'gpus'         : GPUS})
    #end
    
    trainer = pl.Trainer(**profiler_kwargs)
    trainer.fit(lit_model, cdm.train_dataloader(), cdm.val_dataloader())
    trainer.test(lit_model, cdm.test_dataloader())
    
#end

Which produces the error:
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
The same error is obtained if I use the lit_model.train_dataloader() hook, as
trainer.fit(lit_model, lit_model.train_dataloader(), lit_model.val_dataloader())

Thanks again for your help.

EDIT The difference I could spot with respect to the issue raised in my previous post, is that using accelerator = 'ddp' in the Trainer constructor, the pickling error for the Generator does not appear anymore, but at the price of having that on the CPU again.

Topic		Replies	Views
CUDA multiprocessing asks to use "spawn" start metod DDP/GPU	1	1244	August 21, 2023
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! DDP/GPU	0	797	February 6, 2024
Cuda IndexKernel error, device side assert triggered Trainer	1	3651	July 12, 2021
Multi-GPU Training Error: ProcessExitedException: process 0 terminated with signal SIGSEGV DDP/GPU	7	4477	March 4, 2024
Multiple GPU runs the scipt twice DDP/GPU	10	420	February 8, 2024

Cannot pickle torch._C.Generator object — Multi-GPU training

Related topics