I use the profiler=simple
to test the training and validation my model on 10000 samples (for training and validation, each), with a batch size of 4 and accumulate_grad_batches=8
. Based on this, the optimizer should only have 312 + 1 = 313 steps.
However, I’m seeing in the profiling results that Num calls
to optimizer_step
is 2500, while Num calls
to optimizer_zero_grad
is 313.
I can see that DeepSpeed handles gradient accumulation internally (lightning/deepspeed.py at 56377d9b1f8dc4ebfeeaf18b81ce712c434d79e7 · Lightning-AI/lightning · GitHub), but I just need to be sure there isn’t a bug somewhere in the code.
If it’s normal for the number of optimizer_step
calls to match batch counts even when gradients are accumulated, I believe we should add this information to the docs so that people aren’t confused in the future.
Here’s the optimizer that I’m using:
optimizer = FusedAdam(optimizer_grouped_params, lr=self.hparams.lr, betas=(0.9, 0.999),
eps=1e-7, weight_decay=self.hparams.weight_decay,
amsgrad=False, # doesn't support it
adam_w_mode=True, set_grad_none=True)
I wonder if this line could be a culprit: lightning/src/lightning/pytorch/strategies/deepspeed.py at 56377d9b1f8dc4ebfeeaf18b81ce712c434d79e7 · Lightning-AI/lightning · GitHub