Converting deepspeed checkpoints to fp32 checkpoint

kushalj001 · April 22, 2023, 7:48am

Hi @awaelchli, that worked. Thanks a lot! Can you explain why this worked and not the lightning method?
I also wanted to understand how checkpoint saving and loading work actually with deepspeed. Since I am finetuning the model on 6 GPUs, shouldn’t I have 6 optim states files? In the pic above, there’s only 1 optim state file. Is that correct or my understanding of the overall working is incorrect?

I was going through the deepspeed docs on saving the model and found this note:

Important: all processes must call this method and not just the process with rank 0. It is because each process needs to save its master weights and scheduler+optimizer states. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.

Does this call happen automatically or there’s something that we need to do on our end? I am using slurm to finetune these models, so I actually don’t get direct access to the gpu machines once the job is submitted. Is my model only saving the optim states from the first gpu?

Thank you