Training on TPU

sachinruk · March 8, 2021, 8:38pm

Hi all, I created this blog (and Colab) using Lightning, but I could not get it to work on TPUs by just doing tpu_cores=8. It works perfectly fine on GPUs.

Anything you can suggest on how to convert it to TPUs?

I thought it was the fact that I used the tokenizer in the training_step which could be where I went wrong, but it’s probably not what’s going wrong.

def common_step(self, batch: Tuple[torch.Tensor, List[str]]) -> torch.Tensor:
        images, text = batch
        device = images.device
        text_dev = {k: v.to(device) for k, v in self.tokenizer(text).items()}

^I’m using image.device rather than specifying which device since I won’t know which of the 8 devices I’m on in the TPU.

I tried moving the tokenizing step inside my DataSet class without much luck ether, unless I did that wrong.

Any thought on how to debug it as well? TPU mode doesn’t give any understandable stack trace for me to follow.

Topic		Replies	Views
Running meta-learning tutorial on TPU TPU	0	293	April 2, 2023
Training works when using 1 TPU Core, but ProcessExitedException when try to use 8 cores TPU	4	3716	April 21, 2021
Troubleshooting TPU Usage: Are TPUs Properly Running on Google Cloud? TPU	1	516	July 30, 2023
Error running on TPU google colab TPU	3	2265	November 30, 2020
Training hangs at Epoch 0 / 0% on TPU TPU	2	2205	February 1, 2024

Training on TPU

Related topics