FSDP for both pretrained teacher and trainable student

andrasiani · April 11, 2023, 8:08pm

Hi,
When using FSDP for distillation, should I include pretrained (requires_grad=False) teacher and student in the same model?
I` d like to partition teacher weights as well (my teacher is way bigger then the student), is there a way to achieve this?

awaelchli · April 11, 2023, 9:35pm

Hey

I think you can include both of them under the same LightningModule and then decide yourself which submodules to wrap with FSDP. This can be done by manual wrapping in the configure_sharded_model hook:

https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html#manual-wrapping

There is an example there in the doc:

 def configure_sharded_model(self):
        # wrap any of your layer
        # you can wrap all of them or just the one you need
        self.linear_layer = wrap(self.linear_layer)

andrasiani · April 12, 2023, 6:01pm

That is very useful, thank you.
However, if I don`t do manual wrapping I understand auto wrapping is default behavior.
Is there a problem if part of my model, or a child submodule of my model (teacher) has required_grad=false or it is called in forward method with torch.no_grad?
Will FSDP work on such submodules of the model that require no grad?

andrasiani · April 12, 2023, 8:07pm

Is it ok to manually wrap both of teacher and student models?
If so, will each model be split into equal weight chunks and partitioned accross the 6 gpus.
The problem is that I am not seeing any memory reduction with automatic wrapping.
Really appreciate tour guidance, thanks in advance.
I also created this issue where I explain my modelsin more detail: FSDP not reducing memory for non-trainable submoduleb

awaelchli · April 18, 2023, 1:14pm

The automatic wrapping only applies for modules with >= 10k parameters by default in FSDP. This can be controlled by an auto-wrap policy: FullyShardedDataParallel — PyTorch 2.0 documentation

So you can either manually wrap the layers, or pass an auto-wrap-policy that conditionally defines when a module gets wrapped.

Topic		Replies	Views
Is it possible to run part of the model in deepspeed/fsdp and rest in ddp DDP/GPU	1	644	April 28, 2023
FSDP not reducing memory for non-trainable submodule	0	543	April 12, 2023
FullyShardedDataParallel no memory decrease DDP/GPU	7	1790	December 8, 2022
Does PyTorch Lightning support Torch Elastic in FSDP DDP/GPU	1	329	January 21, 2024
Lack of documentation on deepspeed / fsdp DDP/GPU	0	766	April 24, 2023

FSDP for both pretrained teacher and trainable student

Related topics