Lack of documentation on deepspeed / fsdp

kchoi · April 24, 2023, 6:49pm

Hi,
First of all, thanks for the great product. I’d like to ask for some help / improvement on the documentation.

Here’s what I’ve been through.

I somewhat started a training with deepspeed stage 2 x multi GPU x 16 precision based on this documentation. But the training/validation loss didn’t decrease at all.
On debugging the model (with a slightly different environment but very similar - py3.10, torch2, cudnn 11.7, and 8 GPUs), I set something like this.

trainer = pl.Trainer(accelerator="gpu", devices=8, strategy=DeepSpeedStrategy(stage=2, logging_batch_size_per_gpu=batch_size, pin_memory=True), precision='16-mixed',)

Unfortunately, around this time, I was already guessing how I should do, based on 1-2yr old Q&A and codes examples by googling, because the probably most exhaustive doc is from two years ago: Accessible Multi-Billion Parameter Model Training with PyTorch Lightning + DeepSpeed | by PyTorch Lightning team | PyTorch Lightning Developer Blog .

Anyway, it seems to work until it tries to save checkpoint. I had some pickling error, so I removed my ModelCheckpoint callback from Trainer.
But, I still had the same error. So I started to believe (by guessing) probably somehow the deepspeed plugin has its own logic that saves checkpoints no matter what (but why…?). and it still fails, as in the issue #17369.
So I thought probably something is off, either on Lightning or Deepspeed. I finally tried to use fsdp. But I wen through this error.

lightning.fabric.utilities.exceptions.MisconfigurationException: `gradient_clip_algorithm='norm'` is currently not supported for `FSDPMixedPrecisionPlugin`

Though I wrote up briefly, all these processes took like 2-3 hours since I have to wait for a bit to see an error, and googling/searching on GitHub can take long without any fruit.

All in all, I think there’s some missing part in the documentation in how to use deepspeed plugins. The API doc only has brief explanations. Of course it’s cool that PL almost works with just one or several line changes, but in reality, users like me seem to need to change some more code, which is still super cool just as long as I can understand what I’m supposed to do and what I can expect. Thanks again and I’d be happy to provide more contexts or any.

Topic		Replies	Views
Does PyTorch Lightning support Torch Elastic in FSDP DDP/GPU	1	329	January 21, 2024
Manual Optimization with Deepspeed DDP/GPU	0	327	May 19, 2023
Ddp2 in multi node and multi gpu failing on pytorch lightning	0	555	November 7, 2021
DDP MultiGPU Training does not reduce training time DDP/GPU	3	1696	November 8, 2023
FSDPStrategy num_node is always 1 DDP/GPU	4	435	July 6, 2023

Lack of documentation on deepspeed / fsdp

Related topics