I am training a large model (>7B) using Deepspeed Stage 3 with optimizer and parameter offloading on a single node multi-gpu setup.
After every iteration(step), I get the following warning on the console that i should flush cache across all ranks :
Epoch 0: : 316it [3:45:31, 42.82s/it, v_num=0][2023-05-23 19:23:57,584]
[WARNING] [stage3.py:1826:step] 2 pytorch allocator cache flushes since last step.
this happens when there is high memory pressure and is detrimental to performance.
if this is happening frequently consider adjusting settings to reduce memory consumption.
If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache()
calls in your training loop to ensure that all ranks flush their caches at the same time
I assume that adding a get_accelerator().empty_cache()
in a on_train_batch_end
Callback should help with this issue.
Could someone please let me know if my understanding is correct? Or am i missing something else?