Finening 11B HF LLM on 8x GPU with 32GB RAM

janhula21 · June 24, 2023, 6:18am

Hi there,
I’m trying to figure out how to finetune FLAN-T5-XXL or another model of similar size on a node with 8GPUs each with 32GB. I believe that the size of the model should not be a problem as this repo: GitHub - SeanNaren/minGPT: A minimal PyTorch Lightning OpenAI GPT w DeepSpeed Training! shows how to train a 45B LLM. I also saw a huggingface+deepspeed setup for finetuning exactly this model: Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers

I already tried any solution I could think about but with no luck. I use deepspeed_stage_3_offload strategy with 16bit precision and load the HF model with this command:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map=‘auto’, torch_dtype=torch.float16)
After this line is executed, the sharded weights are loaded to the 8 GPUs without problems. But when the Trainer.fit method is executed, new processes are created and they all start to load the weights again it seems:
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/8
At this point one of the GPUs goes out of memory.

Does anybody know what could be the cause of the problem and a possible solution. I started to believe that it is the combination of HF model and deepspeed integration in PL because as I mentioned, it is possible to finetune this model with deepspeed.
I also see that error occurs in the accelerate package:
File “…/site-packages/accelerate/utils/modeling.py”, line 167, in set_module_tensor_to_device
new_value = value.to(device)

Thanks a lot for any suggestion.

Topic		Replies	Views
Lack of documentation on deepspeed / fsdp DDP/GPU	0	745	April 24, 2023
Training sharded HuggingFace models on multiple GPUs (DeepSpeed) implementation help	1	1440	November 5, 2023
FullyShardedDataParallel no memory decrease DDP/GPU	7	1751	December 8, 2022
Converting deepspeed checkpoints to fp32 checkpoint DDP/GPU	2	1845	April 22, 2023
After fine-tuning on multi-GPU my model is moved to CPU for testing	0	234	September 18, 2023

Finening 11B HF LLM on 8x GPU with 32GB RAM

Related topics