Lightning giving out of CUDA error

Chetan_Pandey · April 3, 2022, 6:47am

Hi everyone,
I recently tried Pytorch lightning. I converted by pytorch code to pytorch lightning but it would give out of cuda error after few iterations where as the same code ran fine when I used pytorch. What could be the possible mistake I am making here??

aniketmaurya · April 3, 2022, 7:57am

Hi @Chetan_Pandey, would it be possible to provide a reproducible code?

goku · April 4, 2022, 2:07pm

hey @Chetan_Pandey, as @aniketmaurya suggested, i’d be great to get a reproducible script to check the issue.

We have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only after some time.

Thank you

Chetan_Pandey · April 4, 2022, 6:17pm

class MNISTTrainer(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_efficientnet_b4', pretrained=True)
        self.model.fc = NAC(2, 2048, 1000, 28)
        self.loss = nn.CrossEntropyLoss()
        
    def forward(self, X_batch):
        preds = self.model(X_batch)
        return preds
    
    def training_step(self, batch, batch_idx):
        img = batch['image']
        target = batch['target']
        preds = self.model(img)
        
        loss_val = self.loss(preds, target)
        self.log("Train Loss : ", loss_val)
        
        return loss_val
    
    def validation_step(self, batch, batch_idx):
        img = batch['image']
        target = batch['target']
        preds = self.model(img)
        
        loss_val = self.loss(preds, target)
        self.log("Validation Loss : ", loss_val)
        
        return loss_val
        
    def test_step(self, batch, batch_idx):
        img = batch['image'].float()
        target = batch['target']
        preds = self.model(img)
        
        loss_val = self.loss(preds, target)
        self.log("Test Loss : ", loss_val)
        
        return loss_val
    
    def predict_step(self, batch, batch_idx):
        img = batch['image']
        target = batch['target']
        
        return self.model(img)
    
    def configure_optimizers(self):
        optimizer = Adam(self.model.parameters(), lr=5e-4)
        return optimizer

classifier = MNISTTrainer()

trainer = pl.Trainer(max_epochs = 30, log_every_n_steps=20, gpus=1)

trainer.fit(classifier, train_dataloader, val_dataloader)

Sorry for late reply

Topic		Replies	Views
Cuda IndexKernel error, device side assert triggered Trainer	1	3656	July 12, 2021
GPU memory surge after training epochs causing CUDA memory error Trainer	0	2435	August 23, 2021
Training fails: , but found at least two devices, cuda:0 and cpu Trainer	1	10764	February 5, 2021
Terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: initialization error implementation help	3	4075	August 30, 2021
RuntimeError: CUDA error: out of memory DDP/GPU	2	3627	February 26, 2021

Lightning giving out of CUDA error

Related topics