Multi-gpu training is much lower than single gpu (due to additional processes?)

When I use single gpu, the training time for one epoch is about 5h. However, When I use two ranks, the time extends to about 12.5h. Both ddp and deepspeed strategies showed this issue, so I think it’s an intrinsic problem of pytorch lightning.

Something strange is that when I use single gpu, there is only one process by checking nvidia-smi, while it becomes two processes on each rank for 2-gpu training. I guess that it is used for info transfer between ranks.
If I use large batch size, it will report Out of CUDA memory error and exit. For single-gpu, the training just terminated. However, for 2-gpu, only one process on each rank exits and the remaining one is still running, after that the training continues normally with much faster speed (2.5h/epoch), which seems to be the desired speed (half of what with single-gpu).

It looks that the additional process limits the speed of multi-gpu training. Is its existence normal? If so, what’s its function? Is there anyway to avoid it? (Obviously I don’t want to exit it by making Out of CUDA memory error manually.)