Training hangs at Epoch 0 / 0% on TPU

CaraDuf · November 29, 2021, 2:37pm

Hi,

I am very new to PyTorch-Lightning and to Deep Learning as well! I am converting a PyTorch project into Lightning. On Google Colab, when I run the trainer on CPU or GPU it trains the model as expected although I haven’t checked the output model so far but it does something. It can find batch_size, find the initial learning rate, fast_dev_run also runs smoothly.

But when I try to run it on TPU it hangs at

Epoch 0: 0% 0/2 [00:00<?, ?it/s]

I tried with and without fast_dev_run, with 1 and 8 TPU_cores, with a batch_size of 32 and 2, but it always hangs there. I let it run for 45 minutes and it is still there. How can I know where the code is hanging and what I have to change ?

Thank you very much for helping

Aviv_Alloni · February 23, 2022, 3:07pm

Same issue here.
Did you solve it?

kyitharheinjob · February 1, 2024, 8:17am

I am facing the same issue. Have you solve it?

Topic		Replies	Views
Stucks on 8gpu training setting	2	2222	February 25, 2021
Epochs Stuck at 0% Completion During Training Trainer	0	407	February 24, 2024
Code stops executing after 1 epoch TPU	2	2981	January 22, 2021
Training works when using 1 TPU Core, but ProcessExitedException when try to use 8 cores TPU	4	3716	April 21, 2021
Model training stops at the first epoch (epoch 0) Trainer	0	289	May 15, 2024

Training hangs at Epoch 0 / 0% on TPU

Related topics