Stucks on 8gpu training setting

pizza · January 14, 2021, 2:47am

Hello,

I am using Pytorch Lightning Framework to train the Text to Text Transformer model (google/mt5-base at main).

I trained them on 1, 4, 5, 8 gpu environment using DDP.
However, all of 8gpu and 5gpu training attempts, are stuck and failed at a specific point in a specific epoch (54).

This is the last log before stuck, as it seems, its end of an epoch, so I assume that training is stuck due to data loading for next epoch in 8gpu or 5gpu environment.

This issue also occurred regardless of num_worker in DataLoader or different batch_size (32, 16)

Epoch 54: 100%|█████████▉| 2921/2931 [43:38<00:08,  1.12it/s, loss=.., v_num=0]
Epoch 54: 100%|█████████▉| 2925/2931 [43:41<00:05,  1.12it/s, loss=.., v_num=0]
Validating:  99%|█████████▉| 280/282 [02:32<00:01,  1.59it/s]e[A
Epoch 54: 100%|██████████| 2931/2931 [44:01<00:00,  1.11it/s, loss=.., v_num=0]

Any comment or suggestion would be appreciated.

Thank you.

(Note: I posted this question to the Pytorch Forum. since I used Pytorch Lightning, I also post question to here.)

jirka · February 22, 2021, 8:09pm

in past, we have observed some difficulty with non-zero numbers of workers for some combinations of the platform (OS) and PT version, mind try to set the nb workers to 0 if it helps?

pizza · February 25, 2021, 3:34pm

This issue is caused by my code mistake.
I accidentally give a small (54) to max epoch value, after adjust the max epoch value, works fine on DP or CPU environment.

DDP also hangs after the max epoch which seems is interpretable behavior, but not normal behavior. (currently I don’t know why.)

@jirka Thank you for the comment. I will try.

Topic		Replies	Views
Training hangs at Epoch 0 / 0% on TPU TPU	2	2234	February 1, 2024
Collective mismatch at end of training epoch DDP/GPU	0	1091	July 30, 2022
DDP Training Stuck while GPU utilization is 100% implementation help	3	3818	November 22, 2022
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	4020	January 18, 2023
Training freezes at "initializing ddp: GLOBAL_RANK ..." DDP/GPU	4	2934	May 9, 2024

Stucks on 8gpu training setting

Related topics