Is it ok to manually wrap both of teacher and student models?
If so, will each model be split into equal weight chunks and partitioned accross the 6 gpus.
The problem is that I am not seeing any memory reduction with automatic wrapping.
Really appreciate tour guidance, thanks in advance.
I also created this issue where I explain my modelsin more detail: FSDP not reducing memory for non-trainable submoduleb