Lack of documentation on deepspeed / fsdp

Hi,
First of all, thanks for the great product. I’d like to ask for some help / improvement on the documentation.

Here’s what I’ve been through.

  • I somewhat started a training with deepspeed stage 2 x multi GPU x 16 precision based on this documentation. But the training/validation loss didn’t decrease at all.

  • On debugging the model (with a slightly different environment but very similar - py3.10, torch2, cudnn 11.7, and 8 GPUs), I set something like this.

trainer = pl.Trainer(accelerator="gpu", devices=8, strategy=DeepSpeedStrategy(stage=2, logging_batch_size_per_gpu=batch_size, pin_memory=True), precision='16-mixed',)

Unfortunately, around this time, I was already guessing how I should do, based on 1-2yr old Q&A and codes examples by googling, because the probably most exhaustive doc is from two years ago: Accessible Multi-Billion Parameter Model Training with PyTorch Lightning + DeepSpeed | by PyTorch Lightning team | PyTorch Lightning Developer Blog .

  • Anyway, it seems to work until it tries to save checkpoint. I had some pickling error, so I removed my ModelCheckpoint callback from Trainer.
    But, I still had the same error. So I started to believe (by guessing) probably somehow the deepspeed plugin has its own logic that saves checkpoints no matter what (but why…?). and it still fails, as in the issue #17369.

  • So I thought probably something is off, either on Lightning or Deepspeed. I finally tried to use fsdp. But I wen through this error.

lightning.fabric.utilities.exceptions.MisconfigurationException: `gradient_clip_algorithm='norm'` is currently not supported for `FSDPMixedPrecisionPlugin`

Though I wrote up briefly, all these processes took like 2-3 hours since I have to wait for a bit to see an error, and googling/searching on GitHub can take long without any fruit.


All in all, I think there’s some missing part in the documentation in how to use deepspeed plugins. The API doc only has brief explanations. Of course it’s cool that PL almost works with just one or several line changes, but in reality, users like me seem to need to change some more code, which is still super cool just as long as I can understand what I’m supposed to do and what I can expect. Thanks again and I’d be happy to provide more contexts or any.