Accumulating gradients like you describe is trivial in Lightning, you can just set Trainer(accumulate_grad_batches=...)
to a number (Documentation). For splitting the model, there are actually several approaches but before you do that, I would suggest looking into sharding optimizer state. A large chunk of GPU memory is actually consumed by the optimizer. A simple trick that often works here is to use deepspeed stage 2:
Trainer(strategy="deepspeed_stage_2")
In many cases, this will free up a few GB on your GPU and then you can fit larger batch sizes. How does that sound?
EDIT: Added links to docs