RuntimeError: but found at least two devices, cpu and cuda:0!

Wen_C · April 9, 2022, 3:49pm

When using Pytorch-ligahtning 1.5.8, I got this error:
“RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)”
Traceback is as follows:

Here my is code.

def training_step(self, batch):
        x = batch['feature'].float()
        y_app = batch['app_label'].long()
        x_tra_all, y_tra_all = drop_na(batch)

        app_out = self(x)
        tra_all_out = self(x_tra_all)
        out_app = nn.Linear(in_features = 50, out_features = 17)
        out_tra = nn.Linear(in_features = 50, out_features = 12)
        out_all = nn.Linear(in_features = 50, out_features = 6 )

        y_hat_app = out_app(app_out)
        y_hat_tra = out_tra(tra_all_out)   
        y_hat_all = out_all(tra_all_out)

        entropy_app = F.cross_entropy(y_hat_app, y_app)
        entropy_tra = F.cross_entropy(y_hat_tra, y_tra_all)
        entropy_all = F.cross_entropy(y_hat_all, y_tra_all)
        entropy = (entropy_app + entropy_tra + entropy_all) / 3.0
        self.log('training_loss', entropy, prog_bar=True, logger=True, on_step=True, on_epoch=True)
        loss = {'loss': entropy}

        return loss

Then I tryed the code another, and got the same error.

 def training_step(self, batch):
        x = batch['feature'].float()
        y_app = batch['app_label'].long()
        x_tra_all, y_tra_all = drop_na(batch)

        app_out = self(x)
        
        app_out = app_out.type_as(x)
        
        tra_all_out = self(x_tra_all)
        out_app = nn.Linear(in_features = 50, out_features = 17)
        out_tra = nn.Linear(in_features = 50, out_features = 12)
        out_all = nn.Linear(in_features = 50, out_features = 6 )

        #.to(device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
        y_hat_app = torch.zeros(batch['app_label'].shape).long()
        y_hat_app = y_hat_app.type_as(y_app)
        y_hat_app = out_app(app_out)
        
        
        
        y_hat_tra = out_tra(tra_all_out)   
        y_hat_all = out_all(tra_all_out)

        entropy_app = F.cross_entropy(y_hat_app, y_app)
        entropy_tra = F.cross_entropy(y_hat_tra, y_tra_all)
        entropy_all = F.cross_entropy(y_hat_all, y_tra_all)
        entropy = (entropy_app + entropy_tra + entropy_all) / 3.0
        self.log('training_loss', entropy, prog_bar=True, logger=True, on_step=True, on_epoch=True)
        loss = {'loss': entropy}

        return loss

Any advice on what happened?
Thank you very much!

aniketmaurya · April 9, 2022, 4:55pm

Hi @Wen_C, when you initialize nn.Linear inside training_step it will not be move to correct device by PL. Also, is there any specific reason to create it in training_step? I would recommend you to create any model inside __init__ method so that you can take advantage of all the features of PL.

Also, we have migrated from this forum to Github Discussions so I would recommend you to ask your questions there for quicker response.

Thanks

Topic		Replies	Views
CPU / CUDA:0 RuntimeError - Help please! implementation help	1	6864	November 17, 2022
Training fails: , but found at least two devices, cuda:0 and cpu Trainer	1	10727	February 5, 2021
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! DDP/GPU	0	803	February 6, 2024
RuntimeError: CUDA error: out of memory DDP/GPU	2	3627	February 26, 2021
Devide missmatch with DP training DDP/GPU	1	1985	June 16, 2021

RuntimeError: but found at least two devices, cpu and cuda:0!

Related topics