Confusing # of optimizer steps when using gradient accumulation with DeepSpeed

farzantajeddini · May 25, 2023, 3:08pm

I use the profiler=simple to test the training and validation my model on 10000 samples (for training and validation, each), with a batch size of 4 and accumulate_grad_batches=8. Based on this, the optimizer should only have 312 + 1 = 313 steps.
However, I’m seeing in the profiling results that Num calls to optimizer_step is 2500, while Num calls to optimizer_zero_grad is 313.
I can see that DeepSpeed handles gradient accumulation internally (lightning/deepspeed.py at 56377d9b1f8dc4ebfeeaf18b81ce712c434d79e7 · Lightning-AI/lightning · GitHub), but I just need to be sure there isn’t a bug somewhere in the code.
If it’s normal for the number of optimizer_step calls to match batch counts even when gradients are accumulated, I believe we should add this information to the docs so that people aren’t confused in the future.
Here’s the optimizer that I’m using:

optimizer = FusedAdam(optimizer_grouped_params, lr=self.hparams.lr, betas=(0.9, 0.999),
                      eps=1e-7, weight_decay=self.hparams.weight_decay,
                      amsgrad=False,  # doesn't support it
                      adam_w_mode=True, set_grad_none=True)

I wonder if this line could be a culprit: lightning/src/lightning/pytorch/strategies/deepspeed.py at 56377d9b1f8dc4ebfeeaf18b81ce712c434d79e7 · Lightning-AI/lightning · GitHub

Topic		Replies	Views
Global_step increased at new epoch regardless of gradient accumulation Trainer	2	1078	March 26, 2023
Optimizer step in Profiler Trainer	0	107	May 6, 2024
Gradient Accumulation with Dual (optimizer, scheduler) Training Trainer	0	482	November 10, 2022
Clarification on log_every_n_steps with accumulate_grad_batches Trainer	1	553	July 16, 2023
How to interpret simple profiler results?	1	65	September 4, 2024

Confusing # of optimizer steps when using gradient accumulation with DeepSpeed

Related topics