FSDP not reducing memory for non-trainable submodule

andrasiani · April 12, 2023, 7:48pm

Hi, I have a LightningModule with 2 models:

teacher with requires_grad= false: huggingface pretrained gpt-xl model placed in. eval() mode and with torch.no_grad() in forward pass. -1.5B params
trainable student -124M params

I am trying to leverage fsdp to partition both teacher and student weights accross 6 gpus, I am particularly interested in reducing memory footprint of teacher, but I am targeting both.

However, my runs with FSDP/deepspeed_zero_3 don`t show any improvement in memory usage compared to ddp_sharded/deepspeed_zero_2.

Is there something fundamentally wrong with my approach. Why doesn`t fsdp reduce the memory usage in my teacher?

Topic		Replies	Views
FSDP for both pretrained teacher and trainable student DDP/GPU	4	1121	April 18, 2023
FullyShardedDataParallel no memory decrease DDP/GPU	7	1790	December 8, 2022
Is it possible to run part of the model in deepspeed/fsdp and rest in ddp DDP/GPU	1	644	April 28, 2023
Lack of documentation on deepspeed / fsdp DDP/GPU	0	766	April 24, 2023
Does PyTorch Lightning support Torch Elastic in FSDP DDP/GPU	1	329	January 21, 2024

FSDP not reducing memory for non-trainable submodule

Related topics