FSDP for both pretrained teacher and trainable student

Hi,
When using FSDP for distillation, should I include pretrained (requires_grad=False) teacher and student in the same model?
I` d like to partition teacher weights as well (my teacher is way bigger then the student), is there a way to achieve this?

Hey

I think you can include both of them under the same LightningModule and then decide yourself which submodules to wrap with FSDP. This can be done by manual wrapping in the configure_sharded_model hook:

https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html#manual-wrapping

There is an example there in the doc:

 def configure_sharded_model(self):
        # wrap any of your layer
        # you can wrap all of them or just the one you need
        self.linear_layer = wrap(self.linear_layer)
1 Like

That is very useful, thank you.
However, if I don`t do manual wrapping I understand auto wrapping is default behavior.
Is there a problem if part of my model, or a child submodule of my model (teacher) has required_grad=false or it is called in forward method with torch.no_grad?
Will FSDP work on such submodules of the model that require no grad?

Is it ok to manually wrap both of teacher and student models?
If so, will each model be split into equal weight chunks and partitioned accross the 6 gpus.
The problem is that I am not seeing any memory reduction with automatic wrapping.
Really appreciate tour guidance, thanks in advance.
I also created this issue where I explain my modelsin more detail: FSDP not reducing memory for non-trainable submoduleb

The automatic wrapping only applies for modules with >= 10k parameters by default in FSDP. This can be controlled by an auto-wrap policy: FullyShardedDataParallel — PyTorch 2.0 documentation

So you can either manually wrap the layers, or pass an auto-wrap-policy that conditionally defines when a module gets wrapped.

1 Like