How to ensure all ranks flush their caches during training using DeepSpeed Stage3

I am training a large model (>7B) using Deepspeed Stage 3 with optimizer and parameter offloading on a single node multi-gpu setup.

After every iteration(step), I get the following warning on the console that i should flush cache across all ranks :

Epoch 0: : 316it [3:45:31, 42.82s/it, v_num=0][2023-05-23 19:23:57,584] 
[WARNING] [stage3.py:1826:step] 2 pytorch allocator cache flushes since last step. 
this happens when there is high memory pressure and is detrimental to performance.
if this is happening frequently consider adjusting settings to reduce memory consumption.
If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache()
calls in your training loop to ensure that all ranks flush their caches at the same time

I assume that adding a get_accelerator().empty_cache() in a on_train_batch_end Callback should help with this issue.

Could someone please let me know if my understanding is correct? Or am i missing something else?

Hi @sandeepchittilla

High memory pressure could indicate that you have a too high batch size. You could simply reduce the batch size by 1 or 2 to see if that improves the performance of the caching allocator. I would investigate that first before calling empty_cache(), which you should do only as a last resort.

If you do need it, you should add it in that hook by calling torch.cuda.empty_cache(), yes. But this requires synchronization and could impact the performance.

Hi @awaelchli

Thanks for your response. That makes sense. Will try it out.