Hello there, I encountered an unexpected error while setting up my callbacks.
When I do not manually initialize Tensors at start, they are landing on the wrong device. While the Trainer object in my case is on a cuda:0, an initialized Tensor is on cpu by default.
My workaround is to use the pl_module to identify the chosen device, but I asked myself, whether this is intended?
def on_train_start(self, trainer, pl_module):
"""when training starts, best value is set to inf"""
self.best_metric_value = torch.Tensor([float("Inf")]).to(pl_module.device)
The code you posted is correct. Newly created tensors land on CPU by default. This is the default in PyTorch as well, Lightning doesn’t change that. When you create tensors, you can also just specify the device argument so you avoid the host-device transfer. Example: torch.rand(2, 2, device=pl_module.device)
Yes I fixed it myself, I was just wondering because I use the callbacks within the Trainer structure and expected everything to be moved onto the right device. Maybe here is the wrong place to discuss that and I should open a GitHub issue.
Nice to know that I do not have to use to(pl_module.device), but set device in tensor construction already. Also I should use lowercase "torch.tensor" instead of "torch.Tensor", because the latter is a legacy constructor as mentioned here
If you think this is beneficial to Lightning users, yes feel free to open a feature request. For this one, it would probably require some convincing work though
I personally don’t think the behavior should be changed. It would probably break a lot of code and since it is not the default behavior in PyTorch itself, it could throw people off.