I’ve tried to run very basic example from one of the tutorials on a small fraction of the MNIST dataset, with ‘ddp’, but encounter RuntimeError: CUDA error: out of memory. .
It works fine with 2 GPUs, but crashes with 4 GPUs
On the machine, I am running on, there are 8 GPUs Tesla K40 with 12Gb RAM each and CUDA Version 11.1
Here is the very minimal example.
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
class MNISTModel(pl.LightningModule):
def init(self):
super(MNISTModel, self).init()
# not the best model…
self.l1 = torch.nn.Linear(28 * 28, 10)
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_nb):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
tensorboard_logs = {'train_loss': loss}
return {'loss': loss, 'log': tensorboard_logs}
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
def train_dataloader(self):
return DataLoader(MNIST("~/data", train=True, download=True,
transform=transforms.ToTensor()), batch_size=4)
mnist_model = MNISTModel()
trainer = pl.Trainer(max_epochs=5, limit_train_batches=0.1, gpus=4, accelerator=‘ddp’)
trainer.fit(mnist_model)
Could someone please help me to understand what I am doing wrong or what is the problem and how to fix it ?
These are my packages:
cudatoolkit 11.0.221
python 3.7.4
pytorch 1.7.1
pytorch-lightning 1.1.4
torchvision 0.8.2
Thank you in advance!