Parallelizing batchsize-1 fully-convolutional training on multiple GPUs (one triplet per GPU)

awaelchli · March 15, 2023, 10:21am

Accumulating gradients like you describe is trivial in Lightning, you can just set Trainer(accumulate_grad_batches=...) to a number (Documentation). For splitting the model, there are actually several approaches but before you do that, I would suggest looking into sharding optimizer state. A large chunk of GPU memory is actually consumed by the optimizer. A simple trick that often works here is to use deepspeed stage 2:

Trainer(strategy="deepspeed_stage_2")

Documentation

In many cases, this will free up a few GB on your GPU and then you can fit larger batch sizes. How does that sound?

EDIT: Added links to docs