Use DDP to train a single model, on a single GPU, multiple processes

I am at a loss here. I would like to maximize the number of steps my model takes within a specific time block, and so instead of filling the batch size to max out the VRAM, I want to run multiple processes for this single model and aggregate the results as if it was being on on multiple devices, but is only running on a single GPU. I have more then enough VRAM to run 3-4 processes for this specific model, potentially giving me a 3x speed boost in traversing steps. Like I said, a larger batch size is out of the question, as it only reduces the number of steps for a given time frame.

How can I do this? I saw something about using ddp, gloo backend, and setting the device twice as seen here Emulating multiple devices with a single GPU · Lightning-AI/pytorch-lightning · Discussion #8630 · GitHub but lightning complains when I add the same device twice. I really want to avoid writing anything from scatch to aggregate the gradients between processes, so out of the box solutions are preffered.