Parallelizing batchsize-1 fully-convolutional training on multiple GPUs (one triplet per GPU)

I am training a fully convolutional siamese model, which runs a stack of 3D convolutions over some data, and converts it into an embedding using AdaptiveAvgPool3D at the end. The fully convolutional nature of it make it impossible to have batchsizes greater than 1, but it still runs pretty efficiently at 100% GPU utilisation.

However, I do have multiple GPUs. I now wondered, is it possible to split this kind of model on multiple GPUs? The straight-forward way would be to have a copy of the model on each GPU, calculate a few samples, accumulate their gradients, and then do backprop. Sadly, I don’t quite understand how I would implement this in Lightning - which parallel strategy would be the one to use for this, if any?

Hey @OlfwayAdbayIgbay

Accumulating gradients like you describe is trivial in Lightning, you can just set Trainer(accumulate_grad_batches=...) to a number (Documentation). For splitting the model, there are actually several approaches but before you do that, I would suggest looking into sharding optimizer state. A large chunk of GPU memory is actually consumed by the optimizer. A simple trick that often works here is to use deepspeed stage 2:

Trainer(strategy="deepspeed_stage_2")

Documentation

In many cases, this will free up a few GB on your GPU and then you can fit larger batch sizes. How does that sound?

EDIT: Added links to docs