Converting deepspeed checkpoints to fp32 checkpoint

Hi,
I finetuned a flan-t5-xl on 6 A100 GPUs using deepspeed stage 2. The checkpoint folder has two files related to optim states and model states,

image

When I try to use the conversion function given in the docs to convert this checkpoint to fp32 (Train 1 trillion+ parameter models — PyTorch Lightning 2.0.1.post0 documentation) , I get the following error,

2023-04-19 11:26:14.947199: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic$collating deepspeed ckpt to single file...
Processing zero checkpoint './checkpoints/google/flan-t5-xl-a100_80gb-ds2-bs2/epoch=03-val_loss=0.00.ckpt/checkpoint'
Detected checkpoint of type zero stage 2, world_size: 1
Parsing checkpoint created by deepspeed==0.8.3
Reconstructed fp32 state dict with 558 params 2849757184 elements
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    convert_zero_checkpoint_to_fp32_state_dict(save_path, output_path)
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/lightning/pytorch/utilities/deepspeed.py", line 96, in $    optim_state = torch.load(optim_files[0], map_location=CPU_DEVICE)
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 789, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 1131, in _load
    result = unpickler.load()
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/pickle.py", line 1210, in load
    dispatch[key[0]](self)
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/pickle.py", line 1251, in load_binpersid
    self.append(self.persistent_load(pid))
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 1101, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 1079, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage).storage().untyped()
OSError: [Errno 14] Bad address)

I am not able to understand or debug this issue. Can anyone explain what’s going on here?
Also, do the checkpoint files make sense above? Assuming I finetuned my model on 6 GPUs I was expecting more sharded files of either the model states or optim states. Is this understanding correct?

Thank you

If you go up one directory, you should find a file called zero_to_fp32.py. Could you run that? It is also possible that your checkpoint is corrupted. Check that you can manually torch.load() the files in the checkpoint folder without errors.

Hi @awaelchli, that worked. Thanks a lot! Can you explain why this worked and not the lightning method?
I also wanted to understand how checkpoint saving and loading work actually with deepspeed. Since I am finetuning the model on 6 GPUs, shouldn’t I have 6 optim states files? In the pic above, there’s only 1 optim state file. Is that correct or my understanding of the overall working is incorrect?

I was going through the deepspeed docs on saving the model and found this note:

Important: all processes must call this method and not just the process with rank 0. It is because each process needs to save its master weights and scheduler+optimizer states. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.

Does this call happen automatically or there’s something that we need to do on our end? I am using slurm to finetune these models, so I actually don’t get direct access to the gpu machines once the job is submitted. Is my model only saving the optim states from the first gpu?

Thank you