FullyShardedDataParallel no memory decrease


The model summary was not yet updated to be compatible with sharded models like FSDP. The size there can be misleading. There will be a separate column “params per device” similar to the summary that shows when using deepspeed.

Can you show us how you have wrapped and applied the policy?