Deepspeed stage 3 partition_activations brings no benefit

andrasiani · May 31, 2023, 10:11pm

Hi, I am trying to run a gpt2 model with blocksize 2048, and I cannot use batchsize larger than 16 because activation memory becomes too large.

To reduce activation memory I already use deepspeed actication checkpointing on each transformer block +amp.

I saw there is an option to partition / shard activations too, advertized by megatron. But when I try it I see no effect at all.
I tried partition_activations + cpu_activations + deepspeed act ckpt. No effect.

Can anyone help please or show some proof that this feature actually works. I am using lightning 1.9.4. Is there a particular version of deepspeed I should use?

andrasiani-73mJO · June 7, 2023, 9:10am

Hi, I tried out Sean Naren’s repo - deepspeed branch, upgraded to PL 194 an deespeed 093, but partitioned activations flag within activation checkpointing produces no memory reduction.
I see some fishy behavior in deepspeed/runtime/activation_checkpointing/checkpointing.py in get_partition_size() method, the partition size n rank 0 is same as the size of the whole layer, mp_size is 1 despite running on 2 gpus.
I think it is because of this line: /home/local/CORP/aiani/anaconda3/envs/mingpt_sean_naren/lib/python3.7/site-packages/pytorch_lightning/strategies/deepspeed.py
deepspeed.checkpointing.configure(
mpu_=None,
Why None?
mpu: “An object that implements the following methods get_model_parallel_rank/group/world_size, and get_data_parallel_rank/group/world_size”

See Megatron implementation, it passes an mpu object:

github.com

microsoft/Megatron-DeepSpeed/blob/1f640c00c115eee9cd8515db80f8c92a4c24e9ca/megatron/initialize.py#L174


      
          Since they are used in places outside of activation checkpointing,
          we overwrite them to maintain consistency.
          This must be called before all the calls to mpu.model_parallel_cuda_manual_seed
          '''
          num_layers = args.num_layers // args.checkpoint_num_layers
          num_layers = num_layers if args.num_layers % args.checkpoint_num_layers == 0 else num_layers + 1
          if args.split_transformers:
              num_layers *= 2
          
          
deepspeed.checkpointing.configure(
              mpu,
              partition_activations=args.partition_activations,
              contiguous_checkpointing=args.contigious_checkpointing,
              num_checkpoints=num_layers,
              checkpoint_in_cpu=args.checkpoint_in_cpu,
              synchronize=args.synchronize_each_layer,
              profile=args.profile_backward)
          
          
mpu.checkpoint = deepspeed.checkpointing.checkpoint
          mpu.get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
          mpu.model_parallel_cuda_manual_seed = deepspeed.checkpointing.model_parallel_cuda_manual_seed

I’d really appreciate some help here, guys! Thanks!

Topic		Replies	Views
Deepspeed partition activations in activation checkpointing does not work DDP/GPU	0	951	June 7, 2023
Deepspeed partitioned activation checkpointing issues DDP/GPU	0	758	June 21, 2023
Deepspeed zero3 partition activations for activation checkpointing is not working DDP/GPU	0	561	June 13, 2023
Lack of documentation on deepspeed / fsdp DDP/GPU	0	746	April 24, 2023
Converting deepspeed checkpoints to fp32 checkpoint DDP/GPU	2	1846	April 22, 2023

Deepspeed stage 3 partition_activations brings no benefit

Related topics