Multi-GPU Training Error: ProcessExitedException: process 0 terminated with signal SIGSEGV

manitadayon · July 16, 2023, 7:06am

I am trying to use DDP to do multi-GPU training of my model, however I am facing the following error:

ProcessExitedException: process 0 terminated with signal SIGSEGV

I am using PyTorch lightening with the following configuration for the trainer:
devices = -1, num_nodes = 1, strategy = 'ddp_notebook'

My code works perfectly for a single GPU machine.
My system environment is as follows:
Python: 3.10-6
Cuda: 11.4
Torch: 2.0.1+cu117

I searched in the internet and people were talking about downgrading Python version to 3.8 from 3.9 however all those posts are old, and wondering if there is any solution to this problem (since downgrading python version may not be an option for me especially going to Python 3.8).

Following are some more info about this error:

awaelchli · July 16, 2023, 10:50pm

@manitadayon

It looks like you are running this from inside a Jupyter notebook?
Does the following torch code snippet result in the same error (please run it in a new notebook)?

import torch
import torch.multiprocessing as mp


def run(rank):
    print(rank)

    device = torch.device("cuda", rank)
    torch.cuda.set_device(rank)
    model = torch.nn.Linear(2, 2).to(device)
    loss = model(torch.randn(2, 2).to(device)).sum()
    loss.backward()


mp.start_processes(run, nprocs=2, start_method="fork")

manitadayon · July 18, 2023, 2:44am

@awaelchli
Yes I am running it from notebook.
Your code works well in my notebook and returns 0, 1.

awaelchli · July 21, 2023, 1:38pm

Hey @manitadayon
I just debugged some code from another user who had the same error as you. Did you set up your training like this?

train_dataloader = ...
trainer = Trainer(...)
trainer.fit(model, train_dataloader)

It seems that there is a memory sharing issue that prevents us from passing the datalodaers to trainer.fit in the main process. So my suggestion is to define the dataloaders in the LightningModule:

# in LightningModule, define
def train_dataloader(self):
    return DataLoader(...)


trainer = Trainer(...)
trainer.fit(model)  # don't pass dataloader in here

This seems to resolve the memory sharing issue. I’ll think about how we can document this better!

manitadayon · July 22, 2023, 7:48pm

@awaelchli, thanks I am defining my data loader exactly like what you said:

train_dataloader = ...
trainer = Trainer(...)
trainer.fit(model, train_dataloader)

Now in the new definition, when you say do not pass data loader there:
How would my trainer knows about train_dataloader/val_data_loader?

This is what I have:

def train(self):
     self.trainer.fit(self.model, train_dataloader = self.train_dataloader, val_dataloader = val_dataloader)

awaelchli · July 22, 2023, 11:42pm

Not this way. Sorry that my answer wasn’t clear. Here is a full skeleton code:

import torch
from torch.utils.data import DataLoader, Dataset

from lightning.pytorch import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def train_dataloader(self):
        # Return your dataloader here
        return DataLoader(RandomDataset(32, 64), batch_size=2)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    model = BoringModel()
    trainer = Trainer(max_epochs=1, strategy="ddp_notebook", devices=2)
    trainer.fit(model)


if __name__ == "__main__":
    run()

As you can see, the dataloader is defined in the LightningModule and gets requested by the Trainer automatically after the multi-GPU processes have been created. This should avoid your segmentation fault error.

zhiqiangdon · August 31, 2023, 11:18pm

@awaelchli Is it also related to size of shared memory in the system? Does it help if increasing /dev/shm ?

amansharif · March 4, 2024, 7:50am

I’ve encountered the same error with a slightly different code. I’m running my code in Jupyter which is hosted on top of Kubernetes.

Here’s the error

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name     | Type       | Params
----------------------------------------
0 | layer1   | Sequential | 728   
1 | layer2   | Sequential | 39.3 K
2 | drop_out | Dropout    | 0     
3 | fc1      | Linear     | 2.7 M 
4 | fc2      | Linear     | 10.0 K
----------------------------------------
2.8 M     Trainable params
0         Non-trainable params
2.8 M     Total params
11.180    Total estimated model params size (MB)
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
Cell In[2], line 64
     60     trainer.fit(model)
     63 if __name__ == "__main__":
---> 64     run()

Cell In[2], line 60, in run()
     57 trainer = Trainer(accelerator="gpu",
     58     devices=[0, 1], max_epochs=5)
     59 # Train the model
---> 60 trainer.fit(model)

File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    541 self.state.status = TrainerStatus.RUNNING
    542 self.training = True
--> 543 call._call_and_handle_interrupt(
    544     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    545 )

File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:43, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     41 try:
     42     if trainer.strategy.launcher is not None:
---> 43         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     44     return trainer_fn(*args, **kwargs)
     46 except _TunerExitException:

File /opt/conda/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py:144, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
    136 process_context = mp.start_processes(
    137     self._wrapping_function,
    138     args=process_args,
   (...)
    141     join=False,  # we will join ourselves to get the process references
    142 )
    143 self.procs = process_context.processes
--> 144 while not process_context.join():
    145     pass
    147 worker_output = return_queue.get()

File /opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
    138 if exitcode < 0:
    139     name = signal.Signals(-exitcode).name
--> 140     raise ProcessExitedException(
    141         "process %d terminated with signal %s" %
    142         (error_index, name),
    143         error_index=error_index,
    144         error_pid=failed_process.pid,
    145         exit_code=exitcode,
    146         signal_name=name
    147     )
    148 else:
    149     raise ProcessExitedException(
    150         "process %d terminated with exit code %d" %
    151         (error_index, exitcode),
   (...)
    154         exit_code=exitcode
    155     )

ProcessExitedException: process 1 terminated with signal SIGSEGV

Here’s my code

import os
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from torchvision.datasets import MNIST
from lightning.pytorch import LightningModule, Trainer
 
class LitCNN(LightningModule):
    def __init__(self):
        super().__init__()
        # Define your model architecture here
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 28, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(28, 56, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.drop_out = nn.Dropout()
        self.fc1 = nn.Linear(7 * 7 * 56, 1000)
        self.fc2 = nn.Linear(1000, 10)
 
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.drop_out(out)
        out = self.fc1(out)
        return self.fc2(out)
 
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss
 
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)
 
    def train_dataloader(self):
        # Transformations applied on each image
        transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
 
        # Download training data from open datasets.
        dataset = MNIST(os.getcwd(), download=True, train=True, transform=transform)
        # DataLoader
        return DataLoader(dataset, batch_size=64, num_workers=4, shuffle=True)
 

def run():
    model = LitCNN()
    # Setup the trainer
    trainer = Trainer(accelerator="gpu",
        devices=[0, 1], max_epochs=5)
    # Train the model
    trainer.fit(model)


if __name__ == "__main__":
    run()

I have observed that, the mentioned error occurs only if I’m running the code for second time i.e. first time it works, second time onwards it fails with segmentation error. However, if I restart the kernel and run it again, it works perfectly. It seems like I have to restart the kernel before every time I run the code snippet, otherwise it’ll end up in segmentation fault error. Can you please help me to understand why ?

Topic		Replies	Views
NCCL error related to multi gpu processing DDP/GPU	0	1271	December 12, 2021
CUDA multiprocessing asks to use "spawn" start metod DDP/GPU	1	1148	August 21, 2023
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3589	January 18, 2023
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	211	May 8, 2024
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! DDP/GPU	0	715	February 6, 2024

Multi-GPU Training Error: ProcessExitedException: process 0 terminated with signal SIGSEGV

Related topics