Hi, I tried out Sean Naren’s repo - deepspeed branch, upgraded to PL 194 an deespeed 093, but partitioned activations flag within activation checkpointing produces no memory reduction.
I see some fishy behavior in deepspeed/runtime/activation_checkpointing/checkpointing.py in get_partition_size() method, the partition size n rank 0 is same as the size of the whole layer, mp_size is 1 despite running on 2 gpus.
I think it is because of this line: /home/local/CORP/aiani/anaconda3/envs/mingpt_sean_naren/lib/python3.7/site-packages/pytorch_lightning/strategies/deepspeed.py
mpu: “An object that implements the following methods get_model_parallel_rank/group/world_size, and get_data_parallel_rank/group/world_size”
I think there is a bug here somewhere.
See Megatron implementation, it passes an mpu object: