Is it possible to run part of the model in deepspeed/fsdp and rest in ddp

I have a knowledge distillation scenario, where I’d like to run fsdp on teacher only, but I couldn’t find a way to avoid wrapping the student as well.
My model = student + teacher. FSDP is used with auto_wrap_policy to detect only transformer blocks of my teacher, which is successful, I see all teacher blocks wrapped when I print out the model. But I also see the whole model is wrapped in a main fsdp wrapper.
I guess this is expected behavior and also means student is sharded as well…
I’d like however to run student in DDP. Is there a way to mix these?

Same for deepspeed, the Lightning documentation is not clear whether it is possible to only wrap part of the model.
Thanks!

@andrasiani In FSDP, there is a possibility to manually wrap yes. Have you tried this?
https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html#manual-wrapping

Using the auto-wrap-policy, the same should be possible too. And yes, afaik it is normal that the top level module is wrapped with FSDP.

For deepspeed, I don’t think it is possible to control that, but I haven’t checked in detail.