RuntimeError: Cannot re-initialize CUDA in forked subprocess

santurini · December 12, 2022, 1:06pm

Hello all,
I was trying to run a PyLightning Trainer on multiple-opus on a Kaggle notebook as this:

class StarsDataset(Dataset):
    def __init__(self, split, transform=None):        
        self.img = data_split[split]
        random.shuffle(self.img)
        self.img_dir = im_dir
        self.transform = transform

    def __len__(self):
        return len(self.img)

    def __getitem__(self, idx):
        im_id = self.img[idx]
        anno = annotations[im_id]
        bboxes = anno['box_examples_coordinates']

        rects = list()
        for bbox in bboxes:
            x1 = bbox[0][0]
            y1 = bbox[0][1]
            x2 = bbox[2][0]
            y2 = bbox[2][1]
            rects.append([y1, x1, y2, x2])

        dots = np.array(anno['points'])
        image = np.array(Image.open(im_dir + im_id))
        density = np.load(gt_dir + im_id[:-4] + '.npy').astype('float32')   
        m_flag = 0

        boxes = list()
        for box in rects:
            y1, x1, y2, x2 = [int(k) for k in box]  
            bbox = Image.fromarray(image[y1:y2+1, x1:x2+1, :])
            bbox = transforms.Resize((64, 64))(bbox)
            boxes.append(transforms.ToTensor()(bbox))
        boxes = torch.stack(boxes)

        if self.transform!=None:
            aug = self.transform(image=image, mask=density)
            image = aug['image']
            density = aug['mask']
        
        # boxes shape [3,3,64,64], image shape [3,384,384], density shape[384,384]   
        norm = A.Normalize()(image = image, mask = density)

        return norm['image'].transpose(2, 0, 1), norm['mask'], boxes, m_flag

batch_size = 8
train_dataset = StarsDataset('train', t_transform)
val_dataset = StarsDataset('val')
train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
val_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=False, num_workers=0)

class CounTrModel(pl.LightningModule):
    def __init__(self, model, optimizer, criterion, metric=None):
        super().__init__()
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        self.metric = metric
        
    def forward(self, x, boxes, shot_num):
        return self.model(x, boxes, shot_num)
    
    def shared_step(self, batch, stage):
        samples, gt_density, boxes, m_flag = batch
        shot_num = random.randint(0, 3)
        output = self.forward(samples, boxes, shot_num)
        loss = self.criterion(output, gt_density)
        mae = self.metric(output, gt_density)
        self.log(f'{stage}_loss', loss, prog_bar=True)
        self.log(f'{stage}_mae', mae, prog_bar=True)
        return {"loss": loss, "mae": mae, "boxes": boxes[0], "samples": samples[0], "output": output[0], "gt_density": gt_density[0]}
    
    def shared_epoch_end(self, outputs, stage):
        avg_loss = torch.stack([x["loss"] for x in outputs]).mean()
        avg_mae = torch.tensor([x["mae"] for x in outputs]).mean()
        output = outputs[0]["output"]
        gt_density = outputs[0]["gt_density"]
        boxes = outputs[0]["boxes"]
        samples = outputs[0]["samples"]
        fig = output[0].unsqueeze(0).repeat(3,1,1)
        f1 = gt_density[0].unsqueeze(0).repeat(3,1,1)
        self.logger.experiment.add_scalar(f"mae/{stage}", avg_mae, self.current_epoch)
        self.logger.experiment.add_scalar(f"loss/{stage}", avg_loss, self.current_epoch)
        self.logger.experiment.add_images('bboxes', (boxes[0]), self.current_epoch, dataformats='CHW')
        self.logger.experiment.add_images('gt_density', (samples[0]/2 + f1/10), self.current_epoch, dataformats='CHW')
        self.logger.experiment.add_images('density map', (fig/20), self.current_epoch, dataformats='CHW')
        self.logger.experiment.add_images('density map overlay', (samples[0]/2+fig/10), self.current_epoch, dataformats='CHW')
        epoch_dictionary={f'{stage}_loss': avg_loss}
        return epoch_dictionary

    def training_step(self, batch, batch_idx):
        return self.shared_step(batch, "train") 

    def training_epoch_end(self, outputs):
        return self.shared_epoch_end(outputs, "train")    

    def validation_step(self, batch, batch_idx):
        return self.shared_step(batch, "val")

    def validation_epoch_end(self, outputs):
        return self.shared_epoch_end(outputs, "val")

    def configure_optimizers(self):
        optimizer = self.optimizer
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0)
        return [optimizer], [scheduler]

pl_model = CounTrModel(model, optimizer, criterion, metric)
trainer = pl.Trainer(callbacks=cbs, accelerator='gpu', devices=2, max_epochs=20, logger=logger)
trainer.fit(pl_model, train_dl, val_dl)

But when I run the code it gives me the following error:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

What am I doing wrong?

awaelchli · December 12, 2022, 3:16pm

@santurini Which version of Lightning are you using? The support for forking in notebooks was a recent addition, and some fixes have already been made.

santurini · December 12, 2022, 3:31pm

I am using version 1.7.7, should I upgrade? If yes, do I have to change the code or is it correct like this?

awaelchli · December 12, 2022, 9:45pm

Yes, there should’t be anything that you have to change in your code.

santurini · December 13, 2022, 8:34am

I changed the code but now it gives me this error:

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.

I think the error is given by this line of code but looking at the documentation of NativeMixedPrecisionPlugin it says that optionally it can be passed a torch.cuda.amp.GradScaler():

plugin = pl.plugins.precision.NativeMixedPrecisionPlugin(precision=16, device=device,scaler=torch.cuda.amp.GradScaler())

trainer = pl.Trainer(callbacks=cbs, accelerator="gpu", devices=2, max_epochs=50, plugins=[plugin])

Is there any option to use a loss scaler together with multiple GPU’s?
Without multiprocessing I was just using the scaler inside the train step as follows and it worked:

torch.cuda.amp.GradScaler() and self.scaler.scale(loss)

santurini · December 13, 2022, 8:53am

Update:

I removed the plugin and all the cuda calls, I also removed the scaler but the error stays the same.

I think that upgrading the version now it gives me error because in the definition of the model there are some .to(device) calls, I will remove them and update you!

awaelchli · December 15, 2022, 10:07am

You don’t need to do this:

plugin = pl.plugins.precision.NativeMixedPrecisionPlugin(precision=16, device=device,scaler=torch.cuda.amp.GradScaler())

The grad scaler gets created and used automatically when precision=16 is specified in the Trainer. The call to torch.cuda.amp.GradScaler is the reason why you are seeing the error. In more recent versions of Lightning, the error message has more info with the hint to avoid torch.cuda.* calls in the notebook before spawning processes. There is nothing else we can do unfortunately.

If we want to use multi-GPU in jupyter notebooks, we have to live with these restrictions, and otherwise run the experiments in a regular script.

Topic		Replies	Views
CPU / CUDA:0 RuntimeError - Help please! implementation help	1	6820	November 17, 2022
RuntimeError: CUDA error: out of memory DDP/GPU	2	3602	February 26, 2021
CUDA multiprocessing asks to use "spawn" start metod DDP/GPU	1	1179	August 21, 2023
Multi-GPU Training Error: ProcessExitedException: process 0 terminated with signal SIGSEGV DDP/GPU	7	4175	March 4, 2024
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! DDP/GPU	0	737	February 6, 2024

RuntimeError: Cannot re-initialize CUDA in forked subprocess

Update:

Related topics