Hi, I am trying to run a gpt2 model with blocksize 2048, and I cannot use batchsize larger than 16 because activation memory becomes too large.
To reduce activation memory I already use deepspeed actication checkpointing on each transformer block +amp.
I saw there is an option to partition / shard activations too, advertized by megatron. But when I try it I see no effect at all.
I tried partition_activations + cpu_activations + deepspeed act ckpt. No effect.
Can anyone help please or show some proof that this feature actually works. I am using lightning 1.9.4. Is there a particular version of deepspeed I should use?
Hi, I tried out Sean Naren’s repo - deepspeed branch, upgraded to PL 194 an deespeed 093, but partitioned activations flag within activation checkpointing produces no memory reduction.
I see some fishy behavior in deepspeed/runtime/activation_checkpointing/checkpointing.py in get_partition_size() method, the partition size n rank 0 is same as the size of the whole layer, mp_size is 1 despite running on 2 gpus.
I think it is because of this line: /home/local/CORP/aiani/anaconda3/envs/mingpt_sean_naren/lib/python3.7/site-packages/pytorch_lightning/strategies/deepspeed.py
deepspeed.checkpointing.configure(
mpu_=None,
Why None?
mpu: “An object that implements the following methods get_model_parallel_rank/group/world_size, and get_data_parallel_rank/group/world_size”
See Megatron implementation, it passes an mpu object:
I’d really appreciate some help here, guys! Thanks!