Finening 11B HF LLM on 8x GPU with 32GB RAM

Hi there,
I’m trying to figure out how to finetune FLAN-T5-XXL or another model of similar size on a node with 8GPUs each with 32GB. I believe that the size of the model should not be a problem as this repo: GitHub - SeanNaren/minGPT: A minimal PyTorch Lightning OpenAI GPT w DeepSpeed Training! shows how to train a 45B LLM. I also saw a huggingface+deepspeed setup for finetuning exactly this model: Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers

I already tried any solution I could think about but with no luck. I use deepspeed_stage_3_offload strategy with 16bit precision and load the HF model with this command:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map=‘auto’, torch_dtype=torch.float16)
After this line is executed, the sharded weights are loaded to the 8 GPUs without problems. But when the Trainer.fit method is executed, new processes are created and they all start to load the weights again it seems:
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/8
At this point one of the GPUs goes out of memory.

Does anybody know what could be the cause of the problem and a possible solution. I started to believe that it is the combination of HF model and deepspeed integration in PL because as I mentioned, it is possible to finetune this model with deepspeed.
I also see that error occurs in the accelerate package:
File “…/site-packages/accelerate/utils/modeling.py”, line 167, in set_module_tensor_to_device
new_value = value.to(device)

Thanks a lot for any suggestion.