I am using 7 RTX-3090s to train a ResNet-56 on ImageNet using torch native DDP. A single RTX-3090 is said to, according to the data sheet, have a float32 computing power of 35TFLOPS. A single 224x224 ImageNet inference requires about 0.025TFLOPs. When I run my torch lightning training script, I see roughly 570 images are being processed by the trainer (that means forward-backward pass plus gradient update). Given these numbers, the measured FLOPs being performed by these 7 RTX-3090s is only 26TFLOPS (570 images per second * 0.025TFLOPs/image * 2). According to my math, the equivalent number of GPUs I am using is barely one. I am not sure if this is the right way to check whether I am completely utilizing my GPUs. Any help on this issue is appreciated.
A little more info:
Nvidia-smi tells me that my gpu utilization remains close to 100% (97~99%) most of the time. My num_worker is said to 28 (4 worker / gpu). I have 56 CPU cores on my system.