0/1% GPU Utilization when using 1 GPU, but Higher GPU Utilization with 2+ GPUS

Hello,

I have a tricky bug where just 1 GPU for training gets little to no utilization but using 2+ GPUs (with DDP) gets me good utilization. However I do notice that it will use N-1 GPUs with high util for a while and then occasionally use the last one at low percentages (~5-10% compared to the 80% util on the other N-1 GPUs). My CPUs are usually all maxed out while this is happening/in all scenarios.

When running with 1 GPU I am sure that it is actually using the GPU because I see GPU available: True, used: True. Also when inspecting I see that the data is really on the GPU. So it’s being used but the utilization is very low.

Also, on 2+ GPUs utilization drops to 0/1% during validation, which I don’t think is the expected behavior (unless I am wrong). I also see that only about half of my CPUs are doing work during this time vs being maxed out while the GPU utilization is high.

I’ve tried:

  • Changing the optimizer from Adam to SGD
  • Increasing prefetch_factor (this only affects 2+ gpus)
  • Increasing batch size (this only affects 2+ gpus)
  • Checked GPU Memory usage which wasn’t huge (consistently around 900MiB-1000MiB, on a RTX 2080 Ti which has 11019MiB available on each one.)
  • Turning off all data transforms to see if they were the bottleneck / CPU being bottleneck.

I am using pytorch-lightning=1.6.5, pytorch=1.12.1+cu102 (since my machine has cuda 10.2), python=3.9.0. My Machine has 4 GPUs, each one is an NVIDIA RTX 2080 Ti with 11 Gb memory, and 32 CPUs at 4Gb each for a total of 128Gb memory on the CPU.
My code uses a custom dataloader, dataset, and datamodule.
The model looks like this (you can ignore the moduledict):

  | Name      | Type              | Params
------------------------------------------------
0 | ewmetrics | ModuleDict        | 0     
1 | loss      | BCEWithLogitsLoss | 0     
2 | encoder   | ModuleList        | 47.6 K
3 | decoder   | ModuleList        | 47.7 K
------------------------------------------------

This is an important problem for me as good utilization really speeds up my training. Only using CPUs makes each epoch rather slow, and with little to no utilization I get pretty much the same performance. When I’m using 3-4 GPUs my epochs are at a more acceptable speed, but I cannot tune my model with multigpu training (for many many reasons that I don’t want to get into here, but basically framework limitations across all my options). I need to run hundreds of experiments, many of them with tuning.

I can provide code and some basic fake data here, however do note that the behavior on this toy dataset is different from the real dataset (in that it is somehow even slower and less computationally efficient than the real dataset). I am still working on why that is.