Shortcuts

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[2.0.1] - 2023-03-30

[2.0.1] - Changed

  • Generalized Optimizer validation to accommodate both FSDP 1.x and 2.x (#16733)

[2.0.0] - 2023-03-15

[2.0.0] - Added

  • Added Fabric.all_reduce (#16459)

  • Added support for saving and loading DeepSpeed checkpoints through Fabric.save/load() (#16452)

  • Added support for automatically calling set_epoch on the dataloader.batch_sampler.sampler (#16841)

  • Added support for writing logs to remote file systems with the CSVLogger (#16880)

  • Added support for frozen dataclasses in the optimizer state (#16656)

  • Added lightning.fabric.is_wrapped to check whether a module, optimizer, or dataloader was already wrapped by Fabric (#16953)

[2.0.0] - Changed

  • Fabric now chooses accelerator="auto", strategy="auto", devices="auto" as defaults (#16842)

  • Checkpoint saving and loading redesign (#16434)

    • Changed the method signatrue of Fabric.save and Fabric.load

    • Changed the method signature of Strategy.save_checkpoint and Fabric.load_checkpoint

    • Fabric.save accepts a state that can contain model and optimizer references

    • Fabric.load can now load state in-place onto models and optimizers

    • Fabric.load returns a dictionary of objects that weren’t loaded into the state

    • Strategy.save_checkpoint and Fabric.load_checkpoint are now responsible for accessing the state of the model and optimizers

  • DataParallelStrategy.get_module_state_dict() and DDPStrategy.get_module_state_dict() now correctly extracts the state dict without keys prefixed with ‘module’ (#16487)

  • “Native” suffix removal (#16490)

    • strategy="fsdp_full_shard_offload" is now strategy="fsdp_cpu_offload"

    • lightning.fabric.plugins.precision.native_amp is now lightning.fabric.plugins.precision.amp

  • Enabled all shorthand strategy names that can be supported in the CLI (#16485)

  • Renamed strategy='tpu_spawn' to strategy='xla' and strategy='tpu_spawn_debug' to strategy='xla_debug' (#16781)

  • Changed arguments for precision settings (from [64|32|16|bf16] to [“64-true”|”32-true”|”16-mixed”|”bf16-mixed”]) (#16767)

  • The selection Fabric(strategy="ddp_spawn", ...) no longer falls back to “ddp” when a cluster environment gets detected (#16780)

  • Renamed setup_dataloaders(replace_sampler=...) to setup_dataloaders(use_distributed_sampler=...) (#16829)

[2.0.0] - Removed

  • Removed support for PyTorch 1.10 (#16492)

  • Removed support for Python 3.7 (#16579)

[2.0.0] - Fixed

  • Fixed issue where the wrapped dataloader iter() would be called twice (#16841)

  • Improved the error message for installing tensorboard or tensorboardx (#17053)

[1.9.4] - 2023-03-01

[1.9.4] - Added

  • Added Fabric(strategy="auto") support (#16916)

[1.9.4] - Fixed

  • Fixed edge cases in parsing device ids using NVML (#16795)

  • Fixed DDP spawn hang on TPU Pods (#16844)

  • Fixed an error when passing find_usable_cuda_devices(num_devices=-1) (#16866)

[1.9.3] - 2023-02-21

[1.9.3] - Fixed

  • Fixed an issue causing a wrong environment plugin to be selected when accelerator=tpu and devices > 1 (#16806)

  • Fixed parsing of defaults for --accelerator and --precision in Fabric CLI when accelerator and precision are set to non-default values in the code (#16818)

[1.9.2] - 2023-02-15

[1.9.2] - Fixed

  • Fixed an attribute error and improved input validation for invalid strategy types being passed to Trainer (#16693)

[1.9.1] - 2023-02-10

[1.9.1] - Fixed

  • Fixed error handling for accelerator="mps" and ddp strategy pairing (#16455)

  • Fixed strict availability check for torch_xla requirement (#16476)

  • Fixed an issue where PL would wrap DataLoaders with XLA’s MpDeviceLoader more than once (#16571)

  • Fixed the batch_sampler reference for DataLoaders wrapped with XLA’s MpDeviceLoader (#16571)

  • Fixed an import error when torch.distributed is not available (#16658)

[1.9.0] - 2023-01-17

[1.9.0] - Added

  • Added Fabric.launch() to programmatically launch processes (e.g. in Jupyter notebook) (#14992)

  • Added the option to launch Fabric scripts from the CLI, without the need to wrap the code into the run method (#14992)

  • Added Fabric.setup_module() and Fabric.setup_optimizers() to support strategies that need to set up the model before an optimizer can be created (#15185)

  • Added support for Fully Sharded Data Parallel (FSDP) training in Lightning Lite (#14967)

  • Added lightning.fabric.accelerators.find_usable_cuda_devices utility function (#16147)

  • Added basic support for LightningModules (#16048)

  • Added support for managing callbacks via Fabric(callbacks=...) and emitting events through Fabric.call() (#16074)

  • Added Logger support (#16121)

    • Added Fabric(loggers=...) to support different Logger frameworks in Fabric

    • Added Fabric.log for logging scalars using multiple loggers

    • Added Fabric.log_dict for logging a dictionary of multiple metrics at once

    • Added Fabric.loggers and Fabric.logger attributes to access the individual logger instances

    • Added support for calling self.log and self.log_dict in a LightningModule when using Fabric

    • Added access to self.logger and self.loggers in a LightningModule when using Fabric

  • Added lightning.fabric.loggers.TensorBoardLogger (#16121)

  • Added lightning.fabric.loggers.CSVLogger (#16346)

  • Added support for a consistent .zero_grad(set_to_none=...) on the wrapped optimizer regardless of which strategy is used (#16275)

[1.9.0] - Changed

  • Renamed the class LightningLite to Fabric (#15932, #15938)

  • The Fabric.run() method is no longer abstract (#14992)

  • The XLAStrategy now inherits from ParallelStrategy instead of DDPSpawnStrategy (#15838)

  • Merged the implementation of DDPSpawnStrategy into DDPStrategy and removed DDPSpawnStrategy (#14952)

  • The dataloader wrapper returned from .setup_dataloaders() now calls .set_epoch() on the distributed sampler if one is used (#16101)

  • Renamed Strategy.reduce to Strategy.all_reduce in all strategies (#16370)

  • When using multiple devices, the strategy now defaults to “ddp” instead of “ddp_spawn” when none is set (#16388)

[1.9.0] - Removed

  • Removed support for FairScale’s sharded training (strategy='ddp_sharded'|'ddp_sharded_spawn'). Use Fully-Sharded Data Parallel instead (strategy='fsdp') (#16329)

[1.9.0] - Fixed

  • Restored sampling parity between PyTorch and Fabric dataloaders when using the DistributedSampler (#16101)

  • Fixes an issue where the error message wouldn’t tell the user the real value that was passed through the CLI (#16334)

[1.8.6] - 2022-12-21

  • minor cleaning

[1.8.5] - 2022-12-15

  • minor cleaning

[1.8.4] - 2022-12-08

[1.8.4] - Fixed

  • Fixed shuffle=False having no effect when using DDP/DistributedSampler (#15931)

[1.8.3] - 2022-11-22

[1.8.3] - Changed

  • Temporarily removed support for Hydra multi-run (#15737)

[1.8.2] - 2022-11-17

[1.8.2] - Fixed

  • Fixed the automatic fallback from LightningLite(strategy="ddp_spawn", ...) to LightningLite(strategy="ddp", ...) when on an LSF cluster (#15103)

[1.8.1] - 2022-11-10

[1.8.1] - Fixed

  • Fix an issue with the SLURM srun detection causing permission errors (#15485)

  • Fixed the import of lightning_lite causing a warning ‘Redirects are currently not supported in Windows or MacOs’ (#15610)