############################### Efficient Gradient Accumulation ############################### Gradient accumulation works the same way with Fabric as in PyTorch. You are in control of which model accumulates and at what frequency: .. code-block:: python for iteration, batch in enumerate(dataloader): # Accumulate gradient 8 batches at a time is_accumulating = iteration % 8 != 0 output = model(input) loss = ... # .backward() accumulates when .zero_grad() wasn't called fabric.backward(loss) ... if not is_accumulating: # Step the optimizer after the accumulation phase is over optimizer.step() optimizer.zero_grad() However, in a distributed setting, for example, when training across multiple GPUs or machines, doing it this way can significantly slow down your training loop. To optimize this code, we should skip the synchronization in ``.backward()`` during the accumulation phase. We only need to synchronize the gradients when the accumulation phase is over! This can be achieved by adding the :meth:`~lightning.fabric.fabric.Fabric.no_backward_sync` context manager over the :meth:`~lightning.fabric.fabric.Fabric.backward` call: .. code-block:: diff for iteration, batch in enumerate(dataloader): # Accumulate gradient 8 batches at a time is_accumulating = iteration % 8 != 0 + with fabric.no_backward_sync(model, enabled=is_accumulating): output = model(input) loss = ... # .backward() accumulates when .zero_grad() wasn't called fabric.backward(loss) ... if not is_accumulating: # Step the optimizer after accumulation phase is over optimizer.step() optimizer.zero_grad() For those strategies that don't support it, a warning is emitted. For single-device strategies, it is a no-op. Both the model's ``.forward()`` and the ``fabric.backward()`` call need to run under this context.