Hey, I was wondering if anyone has observed odd performance when training multi-GPU models? I’ve developed a script which trains a toy dataset (in this case the cats and dogs model), using a ResNet or EfficientNet. The script works fine locally on the GPU. However, when I move the script to the cloud and train using multiple GPUs strange things start to happen. The script trains fine on the cloud using 1 GPU, albeit slow as I was testing using a M60. However, if I run the same script on 4x K80 with ddp I find that the training process is around ~15% slower (which I’m guessing is the difference between K80 and M60).
I checked GPU usage and the GPUs are all being used. However, model performance seems slower/worse than using just one GPU. Any ideas why this could be?