I have a knowledge distillation scenario, where I’d like to run fsdp on teacher only, but I couldn’t find a way to avoid wrapping the student as well.
My model = student + teacher. FSDP is used with auto_wrap_policy to detect only transformer blocks of my teacher, which is successful, I see all teacher blocks wrapped when I print out the model. But I also see the whole model is wrapped in a main fsdp wrapper.
I guess this is expected behavior and also means student is sharded as well…
I’d like however to run student in DDP. Is there a way to mix these?
Same for deepspeed, the Lightning documentation is not clear whether it is possible to only wrap part of the model.
Thanks!