I want to train my model on a dual GPU set-up using Trainer(gpus=2, strategy=‘ddp’). To my understanding, Lightning sets up Distributed training under the hood. The training starts as expected but after a few iterations, one of my GPUs crashes. nvidia-smi lists the GPU as “GPU is lost”, syslog shows Xid error 74, which according to Nvidia documentation relates to fatal NVLink error on all four links. Shortly after, the “GPU has fallen off the bus” and only a hard reset restores my system. When using only one GPU, the training does not crash. Is this a problem with Lightning or my system?
Thank you in advance
System:
2xRTX3090 with NVLink bridge, 4 links with 14.062GB/s bandwidth each (nvidia-smi nvlink -s)
Ubuntu 22.04, CUDA 11.7.99 with cudnn 8.5.1, nccl 2.14.3