I am trying to use DDP to do multi-GPU training of my model, however I am facing the following error:
ProcessExitedException: process 0 terminated with signal SIGSEGV
I am using PyTorch lightening with the following configuration for the trainer:
devices = -1, num_nodes = 1, strategy = 'ddp_notebook'
My code works perfectly for a single GPU machine.
My system environment is as follows:
Python: 3.10-6
Cuda: 11.4
Torch: 2.0.1+cu117
I searched in the internet and people were talking about downgrading Python version to 3.8 from 3.9 however all those posts are old, and wondering if there is any solution to this problem (since downgrading python version may not be an option for me especially going to Python 3.8).
Following are some more info about this error:
@manitadayon
It looks like you are running this from inside a Jupyter notebook?
Does the following torch code snippet result in the same error (please run it in a new notebook)?
import torch
import torch.multiprocessing as mp
def run(rank):
print(rank)
device = torch.device("cuda", rank)
torch.cuda.set_device(rank)
model = torch.nn.Linear(2, 2).to(device)
loss = model(torch.randn(2, 2).to(device)).sum()
loss.backward()
mp.start_processes(run, nprocs=2, start_method="fork")
@awaelchli
Yes I am running it from notebook.
Your code works well in my notebook and returns 0, 1.
Hey @manitadayon
I just debugged some code from another user who had the same error as you. Did you set up your training like this?
train_dataloader = ...
trainer = Trainer(...)
trainer.fit(model, train_dataloader)
It seems that there is a memory sharing issue that prevents us from passing the datalodaers to trainer.fit in the main process. So my suggestion is to define the dataloaders in the LightningModule:
# in LightningModule, define
def train_dataloader(self):
return DataLoader(...)
trainer = Trainer(...)
trainer.fit(model) # don't pass dataloader in here
This seems to resolve the memory sharing issue. I’ll think about how we can document this better!
@awaelchli, thanks I am defining my data loader exactly like what you said:
train_dataloader = ...
trainer = Trainer(...)
trainer.fit(model, train_dataloader)
Now in the new definition, when you say do not pass data loader there:
How would my trainer knows about train_dataloader/val_data_loader?
This is what I have:
def train(self):
self.trainer.fit(self.model, train_dataloader = self.train_dataloader, val_dataloader = val_dataloader)
Not this way. Sorry that my answer wasn’t clear. Here is a full skeleton code:
import torch
from torch.utils.data import DataLoader, Dataset
from lightning.pytorch import LightningModule, Trainer
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def train_dataloader(self):
# Return your dataloader here
return DataLoader(RandomDataset(32, 64), batch_size=2)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
model = BoringModel()
trainer = Trainer(max_epochs=1, strategy="ddp_notebook", devices=2)
trainer.fit(model)
if __name__ == "__main__":
run()
As you can see, the dataloader is defined in the LightningModule and gets requested by the Trainer automatically after the multi-GPU processes have been created. This should avoid your segmentation fault error.
@awaelchli Is it also related to size of shared memory in the system? Does it help if increasing /dev/shm ?