Multiple CPUs do not communicate under the DDP strategy.

andreyvlasenko2006 · September 29, 2023, 9:13pm

Dear all,

I try to parallelize my training on a single node with 48 CPUs using Lightning’s version 2.04. I use trainer = pl.Trainer( strategy = "ddp", devices= 48, accelerator='cpu').

According to the DDP documentation, each CPU on the training phase gets a part of a training batch (a kind of splitting batch to subbatches for each CPU), computes the corresponding gradient, and puts it into the bucket. After all CPUs finish computing, the trainer estimates the batch’s gradient as their average.

In my case, there is no averaging. Each CPU computes its part of the data independently of the others. The result of training is the same as if I train my model on a single CPU on using 48 times smaller training set.

Does anyone have an idea what I do wrong and how to parallelize CPU efficiently?

Topic		Replies	Views
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1646	November 8, 2023
Use DDP to train a single model, on a single GPU, multiple processes	0	156	May 15, 2024
Multiple GPU runs the scipt twice DDP/GPU	10	365	February 8, 2024
Multi-gpu training is much lower than single gpu (due to additional processes?) DDP/GPU	0	253	May 8, 2024
CPU count during training Trainer	4	3046	January 6, 2021

Multiple CPUs do not communicate under the DDP strategy.

Related topics