Is it possible to run part of the model in deepspeed/fsdp and rest in ddp

andrasiani · April 28, 2023, 6:24pm

I have a knowledge distillation scenario, where I’d like to run fsdp on teacher only, but I couldn’t find a way to avoid wrapping the student as well.
My model = student + teacher. FSDP is used with auto_wrap_policy to detect only transformer blocks of my teacher, which is successful, I see all teacher blocks wrapped when I print out the model. But I also see the whole model is wrapped in a main fsdp wrapper.
I guess this is expected behavior and also means student is sharded as well…
I’d like however to run student in DDP. Is there a way to mix these?

Same for deepspeed, the Lightning documentation is not clear whether it is possible to only wrap part of the model.
Thanks!

awaelchli · April 28, 2023, 6:30pm

@andrasiani In FSDP, there is a possibility to manually wrap yes. Have you tried this?
https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html#manual-wrapping

Using the auto-wrap-policy, the same should be possible too. And yes, afaik it is normal that the top level module is wrapped with FSDP.

For deepspeed, I don’t think it is possible to control that, but I haven’t checked in detail.

Topic		Replies	Views
FSDP for both pretrained teacher and trainable student DDP/GPU	4	1121	April 18, 2023
Lack of documentation on deepspeed / fsdp DDP/GPU	0	766	April 24, 2023
FSDP not reducing memory for non-trainable submodule	0	543	April 12, 2023
Does PyTorch Lightning support Torch Elastic in FSDP DDP/GPU	1	329	January 21, 2024
FullyShardedDataParallel no memory decrease DDP/GPU	7	1790	December 8, 2022

Is it possible to run part of the model in deepspeed/fsdp and rest in ddp

Related topics