Creating torch.Tensor in callback does not use pl_module.device by default

Falco · March 28, 2023, 8:12am

Hello there, I encountered an unexpected error while setting up my callbacks.

When I do not manually initialize Tensors at start, they are landing on the wrong device. While the Trainer object in my case is on a cuda:0, an initialized Tensor is on cpu by default.

My workaround is to use the pl_module to identify the chosen device, but I asked myself, whether this is intended?

    def on_train_start(self, trainer, pl_module):
        """when training starts, best value is set to inf"""
        self.best_metric_value = torch.Tensor([float("Inf")]).to(pl_module.device)

awaelchli · March 29, 2023, 9:37am

The code you posted is correct. Newly created tensors land on CPU by default. This is the default in PyTorch as well, Lightning doesn’t change that. When you create tensors, you can also just specify the device argument so you avoid the host-device transfer. Example: torch.rand(2, 2, device=pl_module.device)

Falco · March 29, 2023, 10:30am

Thanks for the reply!

Yes I fixed it myself, I was just wondering because I use the callbacks within the Trainer structure and expected everything to be moved onto the right device. Maybe here is the wrong place to discuss that and I should open a GitHub issue.

Nice to know that I do not have to use to(pl_module.device), but set device in tensor construction already. Also I should use lowercase "torch.tensor" instead of "torch.Tensor", because the latter is a legacy constructor as mentioned here

awaelchli · March 29, 2023, 11:12am

If you think this is beneficial to Lightning users, yes feel free to open a feature request. For this one, it would probably require some convincing work though

I personally don’t think the behavior should be changed. It would probably break a lot of code and since it is not the default behavior in PyTorch itself, it could throw people off.

Topic		Replies	Views
Integration with package that creates and moves tensor to device	1	583	December 22, 2020
Training fails: , but found at least two devices, cuda:0 and cpu Trainer	1	10722	February 5, 2021
How automatically move model attributes to the correct device? DDP/GPU	1	3182	August 27, 2020
How to move new torch tensor to device automatically LightningModule	1	3025	August 27, 2020
Cuda IndexKernel error, device side assert triggered Trainer	1	3652	July 12, 2021

Creating torch.Tensor in callback does not use pl_module.device by default

Related topics