0/1% GPU Utilization when using 1 GPU, but Higher GPU Utilization with 2+ GPUS

davzaman · December 8, 2022, 8:39pm

Hello,

I have a tricky bug where just 1 GPU for training gets little to no utilization but using 2+ GPUs (with DDP) gets me good utilization. However I do notice that it will use N-1 GPUs with high util for a while and then occasionally use the last one at low percentages (~5-10% compared to the 80% util on the other N-1 GPUs). My CPUs are usually all maxed out while this is happening/in all scenarios.

When running with 1 GPU I am sure that it is actually using the GPU because I see GPU available: True, used: True. Also when inspecting I see that the data is really on the GPU. So it’s being used but the utilization is very low.

Also, on 2+ GPUs utilization drops to 0/1% during validation, which I don’t think is the expected behavior (unless I am wrong). I also see that only about half of my CPUs are doing work during this time vs being maxed out while the GPU utilization is high.

I’ve tried:

Changing the optimizer from Adam to SGD
Increasing prefetch_factor (this only affects 2+ gpus)
Increasing batch size (this only affects 2+ gpus)
Checked GPU Memory usage which wasn’t huge (consistently around 900MiB-1000MiB, on a RTX 2080 Ti which has 11019MiB available on each one.)
Turning off all data transforms to see if they were the bottleneck / CPU being bottleneck.

I am using pytorch-lightning=1.6.5, pytorch=1.12.1+cu102 (since my machine has cuda 10.2), python=3.9.0. My Machine has 4 GPUs, each one is an NVIDIA RTX 2080 Ti with 11 Gb memory, and 32 CPUs at 4Gb each for a total of 128Gb memory on the CPU.
My code uses a custom dataloader, dataset, and datamodule.
The model looks like this (you can ignore the moduledict):

  | Name      | Type              | Params
------------------------------------------------
0 | ewmetrics | ModuleDict        | 0     
1 | loss      | BCEWithLogitsLoss | 0     
2 | encoder   | ModuleList        | 47.6 K
3 | decoder   | ModuleList        | 47.7 K
------------------------------------------------

This is an important problem for me as good utilization really speeds up my training. Only using CPUs makes each epoch rather slow, and with little to no utilization I get pretty much the same performance. When I’m using 3-4 GPUs my epochs are at a more acceptable speed, but I cannot tune my model with multigpu training (for many many reasons that I don’t want to get into here, but basically framework limitations across all my options). I need to run hundreds of experiments, many of them with tuning.

I can provide code and some basic fake data here, however do note that the behavior on this toy dataset is different from the real dataset (in that it is somehow even slower and less computationally efficient than the real dataset). I am still working on why that is.

Topic		Replies	Views
DDP Training Stuck while GPU utilization is 100% implementation help	3	3750	November 22, 2022
How do I know I have fully utilized my gpus? DDP/GPU	0	586	July 25, 2022
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	232	May 8, 2024
DistributedDataParallel multi GPU barely faster than single GPU DDP/GPU	2	1501	March 10, 2023
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1608	November 8, 2023

0/1% GPU Utilization when using 1 GPU, but Higher GPU Utilization with 2+ GPUS

Related topics