Deepspeed stage 3 partition_activations brings no benefit

andrasiani-73mJO · June 7, 2023, 9:10am

Hi, I tried out Sean Naren’s repo - deepspeed branch, upgraded to PL 194 an deespeed 093, but partitioned activations flag within activation checkpointing produces no memory reduction.
I see some fishy behavior in deepspeed/runtime/activation_checkpointing/checkpointing.py in get_partition_size() method, the partition size n rank 0 is same as the size of the whole layer, mp_size is 1 despite running on 2 gpus.
I think it is because of this line: /home/local/CORP/aiani/anaconda3/envs/mingpt_sean_naren/lib/python3.7/site-packages/pytorch_lightning/strategies/deepspeed.py
deepspeed.checkpointing.configure(
mpu_=None,
Why None?
mpu: “An object that implements the following methods get_model_parallel_rank/group/world_size, and get_data_parallel_rank/group/world_size”

See Megatron implementation, it passes an mpu object:

github.com

microsoft/Megatron-DeepSpeed/blob/1f640c00c115eee9cd8515db80f8c92a4c24e9ca/megatron/initialize.py#L174


      
          Since they are used in places outside of activation checkpointing,
          we overwrite them to maintain consistency.
          This must be called before all the calls to mpu.model_parallel_cuda_manual_seed
          '''
          num_layers = args.num_layers // args.checkpoint_num_layers
          num_layers = num_layers if args.num_layers % args.checkpoint_num_layers == 0 else num_layers + 1
          if args.split_transformers:
              num_layers *= 2
          
          
deepspeed.checkpointing.configure(
              mpu,
              partition_activations=args.partition_activations,
              contiguous_checkpointing=args.contigious_checkpointing,
              num_checkpoints=num_layers,
              checkpoint_in_cpu=args.checkpoint_in_cpu,
              synchronize=args.synchronize_each_layer,
              profile=args.profile_backward)
          
          
mpu.checkpoint = deepspeed.checkpointing.checkpoint
          mpu.get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
          mpu.model_parallel_cuda_manual_seed = deepspeed.checkpointing.model_parallel_cuda_manual_seed

I’d really appreciate some help here, guys! Thanks!