DDP strategy only uses the first GPU

I’m trying to train a model on 2 GPUs with the DDP strategy, but it seems like it’s only using the first GPU. I have two RTX8000 on a slurm node I’ve started in interactive mode. CUDA_VISIBLE_DEVICES: [0,1]. When run this specifically settings devices=[0,1], it seems like my batch size effectively doubles but nvidia-smi indicates that only the first GPU is being run. Here’s what I see on the command line:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[2023-09-17 13:30:03,387][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[13:30:03] INFO     Added key: store_based_barrier_key:1 to store for rank: 0
[2023-09-17 13:30:03,392][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
           INFO     Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

libibverbs: Warning: couldn't load driver 'libqedr-rdmav34.so': libqedr-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'librxe-rdmav34.so': librxe-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libmlx4-rdmav34.so': libmlx4-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libi40iw-rdmav34.so': libi40iw-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libipathverbs-rdmav34.so': libipathverbs-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libhns-rdmav34.so': libhns-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libcxgb4-rdmav34.so': libcxgb4-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libocrdma-rdmav34.so': libocrdma-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libbnxt_re-rdmav34.so': libbnxt_re-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libmthca-rdmav34.so': libmthca-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libhfi1verbs-rdmav34.so': libhfi1verbs-rdmav34.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'libvmw_pvrdma-rdmav34.so': libvmw_pvrdma-rdmav34.so: cannot open shared object file: No such file or directory
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

And here’s what nvidia-smi outputs:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 8000                On  | 00000000:27:00.0 Off |                  Off |
| 41%   68C    P2             255W / 260W |  43447MiB / 49152MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000                On  | 00000000:C3:00.0 Off |                  Off |
| 33%   24C    P8               4W / 260W |      3MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     41543      C   python                                    43426MiB |
+---------------------------------------------------------------------------------------+

My wandb logging indicates the same: only the first GPU is being run. How do I resolve this?

In [3]: lightning.__version__
Out[3]: '2.0.6'
patrick.mineault@cn-c014:~$ python --version
Python 3.9.17

I think it’s a CUDA issue, ie there was a failure in setting up the GPUs and Pytorch can only pick up 1 GPU.
I usually add some environment variables to get a read on what’s happening under the hood:

export NCCL_DEBUG=INFO  
export CUDA_LAUNCH_BLOCKING=1

Try adding these before your run. Also, in your Trainer set up, you could set devices="auto" so the code uses whatever devices it finds.
I face the same issues ( training on SLURM with rtx2080), so restart the run. If there is a Cuda error thrown then it could be version mismatch or a need to reboot the node.

@patrickmineault-1EZg Did you find a solution to this? I am facing the same issue.