I try to parallelize my training on a single node with 48 CPUs using Lightning’s version 2.04. I use
trainer = pl.Trainer( strategy = "ddp", devices= 48, accelerator='cpu').
According to the DDP documentation, each CPU on the training phase gets a part of a training batch (a kind of splitting batch to subbatches for each CPU), computes the corresponding gradient, and puts it into the bucket. After all CPUs finish computing, the trainer estimates the batch’s gradient as their average.
In my case, there is no averaging. Each CPU computes its part of the data independently of the others. The result of training is the same as if I train my model on a single CPU on using 48 times smaller training set.
Does anyone have an idea what I do wrong and how to parallelize CPU efficiently?