Multi-GPU training crashes after some time due to NVLink error (xid74)

MathiesW · November 24, 2022, 10:19am

I want to train my model on a dual GPU set-up using Trainer(gpus=2, strategy=‘ddp’). To my understanding, Lightning sets up Distributed training under the hood. The training starts as expected but after a few iterations, one of my GPUs crashes. nvidia-smi lists the GPU as “GPU is lost”, syslog shows Xid error 74, which according to Nvidia documentation relates to fatal NVLink error on all four links. Shortly after, the “GPU has fallen off the bus” and only a hard reset restores my system. When using only one GPU, the training does not crash. Is this a problem with Lightning or my system?

Thank you in advance

System:
2xRTX3090 with NVLink bridge, 4 links with 14.062GB/s bandwidth each (nvidia-smi nvlink -s)
Ubuntu 22.04, CUDA 11.7.99 with cudnn 8.5.1, nccl 2.14.3

awaelchli · November 26, 2022, 11:12am

Hey @MathiesW

This is most definitely not an issue with Lightning/PyTorch. I’ve had this happen in the past multiple times and it was always due to a faulty/old GPU.

MathiesW · November 26, 2022, 11:29am

Hi @awaelchli, thank you for responding. I ran CUDA mem test without any errors so I was hoping it’s not my valuable hardware

Well, I will have another look into it and test my code on a similar system once I get the chance to.

Topic		Replies	Views
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	211	May 8, 2024
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation DDP/GPU	3	3590	January 18, 2023
Collective mismatch at end of training epoch DDP/GPU	0	1057	July 30, 2022
Multi-GPU with SLURM failed at initialization DDP/GPU	1	1473	April 4, 2022
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1568	November 8, 2023

Multi-GPU training crashes after some time due to NVLink error (xid74)

Related topics