Hello,
I am using Pytorch Lightning Framework to train the Text to Text Transformer model (google/mt5-base at main).
I trained them on 1, 4, 5, 8 gpu environment using DDP.
However, all of 8gpu and 5gpu training attempts, are stuck and failed at a specific point in a specific epoch (54).
This is the last log before stuck, as it seems, its end of an epoch, so I assume that training is stuck due to data loading for next epoch in 8gpu or 5gpu environment.
This issue also occurred regardless of num_worker
in DataLoader
or different batch_size (32, 16)
Epoch 54: 100%|ββββββββββ| 2921/2931 [43:38<00:08, 1.12it/s, loss=.., v_num=0]
Epoch 54: 100%|ββββββββββ| 2925/2931 [43:41<00:05, 1.12it/s, loss=.., v_num=0]
Validating: 99%|ββββββββββ| 280/282 [02:32<00:01, 1.59it/s]e[A
Epoch 54: 100%|ββββββββββ| 2931/2931 [44:01<00:00, 1.11it/s, loss=.., v_num=0]
Any comment or suggestion would be appreciated.
Thank you.
(Note: I posted this question to the Pytorch Forum. since I used Pytorch Lightning, I also post question to here.)