FSDP for both pretrained teacher and trainable student

andrasiani · April 12, 2023, 8:07pm

Is it ok to manually wrap both of teacher and student models?
If so, will each model be split into equal weight chunks and partitioned accross the 6 gpus.
The problem is that I am not seeing any memory reduction with automatic wrapping.
Really appreciate tour guidance, thanks in advance.
I also created this issue where I explain my modelsin more detail: FSDP not reducing memory for non-trainable submoduleb

Topic		Replies	Views
Is it possible to run part of the model in deepspeed/fsdp and rest in ddp DDP/GPU	1	644	April 28, 2023
FSDP not reducing memory for non-trainable submodule	0	543	April 12, 2023
FullyShardedDataParallel no memory decrease DDP/GPU	7	1790	December 8, 2022
Does PyTorch Lightning support Torch Elastic in FSDP DDP/GPU	1	329	January 21, 2024
Lack of documentation on deepspeed / fsdp DDP/GPU	0	766	April 24, 2023

FSDP for both pretrained teacher and trainable student

Related topics