Hi,
I finetuned a flan-t5-xl on 6 A100 GPUs using deepspeed stage 2. The checkpoint folder has two files related to optim states and model states,
When I try to use the conversion function given in the docs to convert this checkpoint to fp32 (Train 1 trillion+ parameter models — PyTorch Lightning 2.0.1.post0 documentation) , I get the following error,
2023-04-19 11:26:14.947199: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic$collating deepspeed ckpt to single file...
Processing zero checkpoint './checkpoints/google/flan-t5-xl-a100_80gb-ds2-bs2/epoch=03-val_loss=0.00.ckpt/checkpoint'
Detected checkpoint of type zero stage 2, world_size: 1
Parsing checkpoint created by deepspeed==0.8.3
Reconstructed fp32 state dict with 558 params 2849757184 elements
Traceback (most recent call last):
File "test.py", line 11, in <module>
convert_zero_checkpoint_to_fp32_state_dict(save_path, output_path)
File "/cluster/home/kujain/.local/lib/python3.8/site-packages/lightning/pytorch/utilities/deepspeed.py", line 96, in $ optim_state = torch.load(optim_files[0], map_location=CPU_DEVICE)
File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 789, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 1131, in _load
result = unpickler.load()
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/pickle.py", line 1210, in load
dispatch[key[0]](self)
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/pickle.py", line 1251, in load_binpersid
self.append(self.persistent_load(pid))
File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 1101, in persistent_load
load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/cluster/home/kujain/.local/lib/python3.8/site-packages/torch/serialization.py", line 1079, in load_tensor
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage).storage().untyped()
OSError: [Errno 14] Bad address)
I am not able to understand or debug this issue. Can anyone explain what’s going on here?
Also, do the checkpoint files make sense above? Assuming I finetuned my model on 6 GPUs I was expecting more sharded files of either the model states or optim states. Is this understanding correct?
Thank you