Deepspeed stage 3 partition_activations brings no benefit

Hi, I am trying to run a gpt2 model with blocksize 2048, and I cannot use batchsize larger than 16 because activation memory becomes too large.

To reduce activation memory I already use deepspeed actication checkpointing on each transformer block +amp.

I saw there is an option to partition / shard activations too, advertized by megatron. But when I try it I see no effect at all.
I tried partition_activations + cpu_activations + deepspeed act ckpt. No effect.

Can anyone help please or show some proof that this feature actually works. I am using lightning 1.9.4. Is there a particular version of deepspeed I should use?

Hi, I tried out Sean Naren’s repo - deepspeed branch, upgraded to PL 194 an deespeed 093, but partitioned activations flag within activation checkpointing produces no memory reduction.
I see some fishy behavior in deepspeed/runtime/activation_checkpointing/checkpointing.py in get_partition_size() method, the partition size n rank 0 is same as the size of the whole layer, mp_size is 1 despite running on 2 gpus.
I think it is because of this line: /home/local/CORP/aiani/anaconda3/envs/mingpt_sean_naren/lib/python3.7/site-packages/pytorch_lightning/strategies/deepspeed.py
deepspeed.checkpointing.configure(
mpu_=None,
Why None?
mpu: “An object that implements the following methods get_model_parallel_rank/group/world_size, and get_data_parallel_rank/group/world_size”

See Megatron implementation, it passes an mpu object:

I’d really appreciate some help here, guys! Thanks!