Hi,
I am very new to PyTorch-Lightning and to Deep Learning as well! I am converting a PyTorch project into Lightning. On Google Colab, when I run the trainer on CPU or GPU it trains the model as expected although I haven’t checked the output model so far but it does something. It can find batch_size, find the initial learning rate, fast_dev_run
also runs smoothly.
But when I try to run it on TPU it hangs at
Epoch 0: 0% 0/2 [00:00<?, ?it/s]
I tried with and without fast_dev_run
, with 1 and 8 TPU_cores, with a batch_size of 32 and 2, but it always hangs there. I let it run for 45 minutes and it is still there. How can I know where the code is hanging and what I have to change ?
Thank you very much for helping