Cloud-based checkpoints (advanced)

Cloud checkpoints

Lightning is integrated with the major remote file systems including local filesystems and several cloud storage providers such as S3 on AWS, GCS on Google Cloud, or ADL on Azure.

PyTorch Lightning uses fsspec internally to handle all filesystem operations.


Save a cloud checkpoint

To save to a remote filesystem, prepend a protocol like “s3:/” to the root_dir used for writing and reading model data.

# `default_root_dir` is the default path used for logs and checkpoints
trainer = Trainer(default_root_dir="s3://my_bucket/data/")
trainer.fit(model)

Resume training from a cloud checkpoint

To resume training from a cloud checkpoint use a cloud url.

trainer = Trainer(default_root_dir=tmpdir, max_steps=3)
trainer.fit(model, ckpt_path="s3://my_bucket/ckpts/classifier.ckpt")

PyTorch Lightning uses fsspec internally to handle all filesystem operations.