Confusing # of optimizer steps when using gradient accumulation with DeepSpeed

I use the profiler=simple to test the training and validation my model on 10000 samples (for training and validation, each), with a batch size of 4 and accumulate_grad_batches=8. Based on this, the optimizer should only have 312 + 1 = 313 steps.
However, I’m seeing in the profiling results that Num calls to optimizer_step is 2500, while Num calls to optimizer_zero_grad is 313.
I can see that DeepSpeed handles gradient accumulation internally (lightning/ at 56377d9b1f8dc4ebfeeaf18b81ce712c434d79e7 · Lightning-AI/lightning · GitHub), but I just need to be sure there isn’t a bug somewhere in the code.
If it’s normal for the number of optimizer_step calls to match batch counts even when gradients are accumulated, I believe we should add this information to the docs so that people aren’t confused in the future.
Here’s the optimizer that I’m using:

optimizer = FusedAdam(optimizer_grouped_params,, betas=(0.9, 0.999),
                      eps=1e-7, weight_decay=self.hparams.weight_decay,
                      amsgrad=False,  # doesn't support it
                      adam_w_mode=True, set_grad_none=True)

I wonder if this line could be a culprit: lightning/src/lightning/pytorch/strategies/ at 56377d9b1f8dc4ebfeeaf18b81ce712c434d79e7 · Lightning-AI/lightning · GitHub