Lightning 1.6: Habana Accelerator, Bagua Distributed, Fault-Tolerance Improvements

PyTorch Lightning 1.6 Now Available

The PyTorch Lightning team released version 1.6 with support for Intel’s Habana Accelerator, new efficient DDP strategy (Bagua), manual Fault-tolerance, and other stability and reliability changes.

⚡Visit the release page on GitHub to download.⚡

Lightning Highlights
New Hooks
New Properties
Experimental Features
Backward Incompatible Changes
Full Lightning Changelog

Lightning Highlights

PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. Here are some highlights:

Introducing Intel’s Habana Accelerator

Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.

You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:

trainer = pl.Trainer(accelerator="hpu")

# single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)

# distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)

The Bagua Strategy

The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:

trainer = pl.Trainer(strategy="bagua")

# or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default

Towards stable Accelerator, Strategy, and Plugin APIs

The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They’re where all the distributed boilerplate lives, and we’re constantly working to improve both them and the overall PyTorch Lightning platform experience.

In this release, we’ve made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:

All TrainingTypePlugins have been renamed to Strategy (#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the new strategy and devices flags to the Trainer.
```
# Before
from pytorch_lighting.plugins import DDPPlugin

# New
from pytorch_lighting.strategies import DDPStrategy
```
The Accelerator and PrecisionPlugin have moved into Strategy. All strategies now take an optional parameter accelerator and precision_plugin (#11022, #10570).
Custom Accelerator implementations must now implement two new abstract methods: is_available() (#11797) and auto_device_count() (#10222). The latter determines how many devices get used by default when specifying Trainer(accelerator=..., devices="auto").
We redesigned the process creation for spawn-based strategies such as DDPSpawnStrategy and TPUSpawnStrategy (#10896). All spawn-based strategies now spawn processes immediately upon calling Trainer.{fit,validate,test,predict}, which means the hooks/callbacks prepare_data, setup, configure_sharded_model and teardown all run under an initialized process group. These changes align the spawn-based strategies with their non-spawn counterparts (such as DDPStrategy).

We’ve also exposed the process group backend for use. For example, you can now easily enable fairring like this:

# Explicitly specify the process group backend if you choose to
ddp = pl.strategies.DDPStrategy(process_group_backend="fairring")
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)

In a similar fashion, if installing torch>=1.11, you can enable DDP static graph to apply special runtime optimizations:

trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))

`LightningCLI` improvements

In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:

from pytorch_lightning.utilities.cli import LightningCLI

LightningCLI(auto_registry=True)

We have also added support for the ReduceLROnPlateau scheduler with shorthand notation:

$ python script.py fit --optimizer=Adam --lr_scheduler=ReduceLROnPlateau --lr_scheduler.monitor=metric_to_track

If you need to customize the learning rate scheduler configuration, you can do so by overriding:

class MyLightningCLI(LightningCLI):
   @staticmethod
    def configure_optimizers(lightning_module, optimizer, lr_scheduler=None):
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": lr_scheduler, ...}}

Finally, loggers are also now configurable with shorthand:

$ python script.py fit --trainer.logger=WandbLogger --trainer.logger.name="my_lightning_run"

Control SLURM’s re-queueing

We’ve added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:

from pytorch_lightning.plugins.environments import SLURMEnvironment

trainer = pl.Trainer(plugins=SLURMEnvironment(auto_requeue=False))

Fault-tolerance improvements

The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1 to SIGTERM for better support inside cloud instances.

An additional feature we’re excited to announce is support for consecutive trainer.fit() calls.

trainer = pl.Trainer(max_epochs=2)
trainer.fit(model)

# now, run 2 more epochs
trainer.fit_loop.max_epochs = 4
trainer.fit(model)

Loop customization improvements

The Loop‘s state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.

We’ve also made it easier to replace Lightning’s loops with your own. For example:

class MyCustomLoop(pl.loops.TrainingEpochLoop):
    ...

trainer = pl.Trainer(...)
trainer.fit_loop.replace(epoch_loop=MyCustomLoop)
# Trainer runs the fit loop with your new epoch loop!
trainer.fit(model)

Data-Loading improvements

In previous versions, Lightning required that the DataLoader instance set its input arguments as instance attributes. This meant that custom DataLoaders also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:

class MyDataLoader(torch.utils.data.DataLoader):
    def __init__(self, a=123, *args, **kwargs):
-       # this was required before
-       self.a = a
        super().__init__(*args, **kwargs)

trainer.fit(model, train_dataloader=MyDataLoader())

As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn’t need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV’s. You can now define your own pre-fetching value like this:

class MyCustomLoop(pl.loops.FitLoop):
    @property
    def prefetch_batches(self):
        return 7 # lucky number 7

trainer = pl.Trainer(...)
trainer.fit_loop.replace(fit_loop=MyCustomLoop)

New Hooks

`LightningModule.lr_scheduler_step`

Lightning now allows the use of custom learning rate schedulers that aren’t natively available in PyTorch. A great example of this is Timm Schedulers.

When using custom learning rate schedulers relying on an API other than PyTorch’s, you can now define the LightningModule.lr_scheduler_step with your desired logic.

from timm.scheduler import TanhLRScheduler


class MyLightningModule(pl.LightningModule):
    def configure_optimizers(self):
        optimizer = ...
        scheduler = TanhLRScheduler(optimizer, ...)
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "epoch"}}

    def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
        scheduler.step(epoch=self.current_epoch)  # timm's scheduler need the epoch value

A new stateful API

This release introduces new hooks to standardize all stateful components to use state_dict and load_state_dict, mimicking the PyTorch API. The new hooks receive their own component’s state and replace most usages of the previous on_save_checkpoint and on_load_checkpoint hooks.

def MyCallback(pl.Callback):
-   def on_save_checkpoint(self, trainer, pl_module, checkpoint):
-       return {'x': self.x}
    
-   def on_load_checkpoint(self, trainer, pl_module, checkpoint):
-       self.x = x

+   def state_dict(self):
+       return {'x': self.x}
    
+   def load_state_dict(self, checkpoint):
+       self.x = x

New Properties

Trainer.estimated_stepping_batches

You can use built-in Trainer.estimated_stepping_batches to compute the total number of stepping batches needed for the complete training.

The property takes gradient accumulation factor and distributed setting into consideration when performing this computation so that you don’t have to derive it manually:

class MyLightningModule(pl.LightningModule):
    def configure_optimizers(self):
        optimizer = ...
        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches
        )
        return {"optimizer": optimizer, "lr_scheduler": scheduler}

Trainer.num_devices and Trainer.device_ids

In the past, retrieving the number of devices used, or their IDs, posed a considerable challenge. Additionally, doing so required knowing which property to access based on the current Trainer configuration.

To simplify this process, we’ve deprecated the per-accelerator properties to have accelerator agnostic properties. For example:

- num_devices = max(1, trainer.num_gpus, trainer.num_processes)
- if trainer.tpu_cores:
-    num_devices = max(num_devices, trainer.tpu_cores)
+ num_devices = trainer.num_devices

Experimental Features

Manual Fault-tolerance

Fault Tolerance has limitations that require specific information about your data-loading structure.

It is now possible to resolve those limitations by enabling manual fault tolerance where you can write your own logic and specify how exactly to checkpoint your own datasets and samplers. You can do so using this environment flag:

$ PL_FAULT_TOLERANT_TRAINING=MANUAL python script.py

Check out this video for a dive into the internals of this flag.

Customizing the layer synchronization

We introduced a new plugin class for wrapping layers of a model with synchronization logic for multiprocessing.

class MyLayerSync(pl.plugins.LayerSync):
    ...

layer_sync = MyLayerSync(...)
trainer = Trainer(sync_batchnorm=True, plugins=layer_sync, strategy="ddp")

Registering Custom Accelerators

There has been much progress in the field of ML Accelerators, and the list of accelerators is constantly expanding.

We’ve made it easier for users to try out new accelerators by enabling support for registering custom Accelerator classes in Lightning.

from pytorch_lightning.accelerators import Accelerator, AcceleratorRegistry

class SOTAAccelerator(Accelerator):
    def __init__(self, x):
        ...

AcceleratorRegistry.register("sota_accelerator", SOTAAccelerator, x=123)
# the following works now:
trainer = Trainer(accelerator="sota_accelerator")

Backward Incompatible Changes

Here is a selection of notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.

Drop PyTorch 1.7 support

Following our 4 PyTorch release window, this release supports PyTorch 1.8 to 1.11. Support for PyTorch 1.7 has been removed.

Drop Python 3.6 support

Following Python’s end-of-life, support for Python 3.6 has been removed.

`AcceleratorConnector` rewrite

To support new accelerator and stategy features, we completely rewrote our internal AcceleratorConncetor class. No backwards compatibility was maintained so it is likely to have broken your code if it was using this class.

Re-define the `current_epoch` boundary

To resolve fault-tolerance issues, we changed where the current epoch value gets increased.

trainer.current_epoch is now increased by 1 on_train_end. This means that if a model is run for 3 epochs (0, 1, 2), trainer.current_epoch will now return 3 instead of 2 after trainer.fit(). This can also impact custom callbacks that acess this property inside this hook.

This also impacts checkpoints saved during an epoch (e.g. on_train_epoch_end). For example, a Trainer(max_epochs=1, limit_train_batches=1) instance that saves a checkpoint will have the current_epoch=0 value saved instead of current_epoch=1.

Re-define the `global_step` boundary

To resolve fault-tolerance issues, we changed where the global step value gets increased.

Access to trainer.global_step during an intra-training validation hook will now correctly return the number of optimizer steps taken already. In pseudocode:

  training_step()
+ global_step += 1
  validation_if_necessary()
- global_step += 1

Saved checkpoints that use the global step value as part of the filename are now increased by 1 for the same reason. A checkpoint saved after 1 step will be now be named step=1.ckpt instead of step=0.ckpt.

The trainer.global_step value will now account for TBPTT or multiple optimizers. Users setting Trainer({min,max}_steps=...) under these circumstances will need to adjust their values.

Removed automatic reduction of outputs in `training_step` when using DataParallel

When using Trainer(strategy="dp"), all the tensors returned by training_step were previously reduced to a scalar (#11594). This behavior was especially confusing when outputs needed to be collected into the training_epoch_end hook.

From now on, outputs are no longer reduced except for the loss tensor, unless you implement training_step_end, in which case the loss won’t get reduced either.

No longer fallback to CPU with no devices

Previous versions were lenient in that the lack of GPU devices defaulted to running on CPU. This meant that users’ code could be running much slower without them ever noticing that it was running on CPU.

We suggest passing Trainer(accelerator="auto") when this leniency is desired.

LIGHTNING CHANGELOG

Added

Allow logging to an existing run ID in MLflow with MLFlowLogger (#12290)
Enable gradient accumulation using Horovod’s backward_passes_per_step (#11911)
Add new DETAIL log level to provide useful logs for improving monitoring and debugging of batch jobs (#11008)
Added a flag SLURMEnvironment(auto_requeue=True|False) to control whether Lightning handles the requeuing (#10601)
Fault Tolerant Manual
- Add _Stateful protocol to detect if classes are stateful (#10646)
- Add _FaultTolerantMode enum used to track different supported fault tolerant modes (#10645)
- Add a _rotate_worker_indices utility to reload the state according the latest worker (#10647)
- Add stateful workers (#10674)
- Add an utility to collect the states across processes (#10639)
- Add logic to reload the states across data loading components (#10699)
- Cleanup some fault tolerant utilities (#10703)
- Enable Fault Tolerant Manual Training (#10707)
- Broadcast the _terminate_gracefully to all processes and add support for DDP (#10638)
Added support for re-instantiation of custom (subclasses of) DataLoaders returned in the *_dataloader() methods, i.e., automatic replacement of samplers now works with custom types of DataLoader (#10680)
Added a function to validate if fault tolerant training is supported. (#10465)
Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)
Show a better error message when a custom DataLoader implementation is not well implemented and we need to reconstruct it (#10719)
Show a better error message when frozen dataclass is used as a batch (#10927)
Save the Loop‘s state by default in the checkpoint (#10784)
Added Loop.replace to easily switch one loop for another (#10324)
Added support for --lr_scheduler=ReduceLROnPlateau to the LightningCLI (#10860)
Added LightningCLI.configure_optimizers to override the configure_optimizers return value (#10860)
Added LightningCLI(auto_registry) flag to register all subclasses of the registerable components automatically (#12108)
Added a warning that shows when max_epochs in the Trainer is not set (#10700)
Added support for returning a single Callback from LightningModule.configure_callbacks without wrapping it into a list (#11060)
Added console_kwargs for RichProgressBar to initialize inner Console (#10875)
Added support for shorthand notation to instantiate loggers with the LightningCLI (#11533)
Added a LOGGER_REGISTRY instance to register custom loggers to the LightningCLI (#11533)
Added info message when the Trainer arguments limit_*_batches, overfit_batches, or val_check_interval are set to 1 or 1.0 (#11950)
Added a PrecisionPlugin.teardown method (#10990)
Added LightningModule.lr_scheduler_step (#10249)
Added support for no pre-fetching to DataFetcher (#11606)
Added support for optimizer step progress tracking with manual optimization (#11848)
Return the output of the optimizer.step. This can be useful for LightningLite users, manual optimization users, or users overriding LightningModule.optimizer_step (#11711)
Teardown the active loop and strategy on exception (#11620)
Added a MisconfigurationException if user provided opt_idx in scheduler config doesn’t match with actual optimizer index of its respective optimizer (#11247)
Added a loggers property to Trainer which returns a list of loggers provided by the user (#11683)
Added a loggers property to LightningModule which retrieves the loggers property from Trainer (#11683)
Added support for DDP when using a CombinedLoader for the training data (#11648)
Added a warning when using DistributedSampler during validation/testing (#11479)
Added support for Bagua training strategy (#11146)
Added support for manually returning a poptorch.DataLoader in a *_dataloader hook (#12116)
Added rank_zero module to centralize utilities (#11747)
Added a _Stateful support for LightningDataModule (#11637)
Added _Stateful support for PrecisionPlugin (#11638)
Added Accelerator.is_available to check device availability (#11797)
Enabled static type-checking on the signature of Trainer (#11888)
Added utility functions for moving optimizers to devices (#11758)
Added a warning when saving an instance of nn.Module with save_hyperparameters() (#12068)
Added estimated_stepping_batches property to Trainer (#11599)
Added support for pluggable Accelerators (#12030)
Added profiling for on_load_checkpoint/on_save_checkpoint callback and LightningModule hooks (#12149)
Added LayerSync and NativeSyncBatchNorm plugins (#11754)
Added optional storage_options argument to Trainer.save_checkpoint() to pass to custom CheckpointIO implementations (#11891)
Added support to explicitly specify the process group backend for parallel strategies (#11745)
Added device_ids and num_devices property to Trainer (#12151)
Added Callback.state_dict() and Callback.load_state_dict() methods (#12232)
Added AcceleratorRegistry (#12180)
Added support for Habana Accelerator (HPU) (#11808)
Added support for dataclasses in apply_to_collections (#11889)

Changed

Drop PyTorch 1.7 support (#12191), (#12432)
Make benchmark flag optional and set its value based on the deterministic flag (#11944)
Implemented a new native and rich format in _print_results method of the EvaluationLoop (#11332)
Do not print an empty table at the end of the EvaluationLoop (#12427)
Set the prog_bar flag to False in LightningModule.log_grad_norm (#11472)
Raised exception in init_dist_connection() when torch distributed is not available (#10418)
The monitor argument in the EarlyStopping callback is no longer optional (#10328)
Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
Raised MisconfigurationException when enable_progress_bar=False and a progress bar instance has been passed in the callback list (#10520)
Moved trainer.connectors.env_vars_connector._defaults_from_env_vars to utilities.argsparse._defaults_from_env_vars (#10501)
Changes in LightningCLI required for the new major release of jsonargparse v4.0.0 (#10426)
Renamed refresh_rate_per_second parameter to refresh_rate for RichProgressBar signature (#10497)
Moved ownership of the PrecisionPlugin into TrainingTypePlugin and updated all references (#10570)
Fault Tolerant relies on signal.SIGTERM to gracefully exit instead of signal.SIGUSR1 (#10605)
Loop.restarting=... now sets the value recursively for all subloops (#11442)
Raised an error if the batch_size cannot be inferred from the current batch if it contained a string or was a custom batch object (#10541)
The validation loop is now disabled when overfit_batches > 0 is set in the Trainer (#9709)
Moved optimizer related logics from Accelerator to TrainingTypePlugin (#10596)
Moved ownership of the lightning optimizers from the Trainer to the Strategy (#11444)
Moved ownership of the data fetchers from the DataConnector to the Loops (#11621)
Moved batch_to_device method from Accelerator to TrainingTypePlugin (#10649)
The DDPSpawnPlugin no longer overrides the post_dispatch plugin hook (#10034)
Integrate the progress bar implementation with progress tracking (#11213)
The LightningModule.{add_to_queue,get_from_queue} hooks no longer get a torch.multiprocessing.SimpleQueue and instead receive a list based queue (#10034)
Changed training_step, validation_step, test_step and predict_step method signatures in Accelerator and updated input from caller side (#10908)
Changed the name of the temporary checkpoint that the DDPSpawnPlugin and related plugins save (#10934)
LoggerCollection returns only unique logger names and versions (#10976)
Redesigned process creation for spawn-based plugins (DDPSpawnPlugin, TPUSpawnPlugin, etc.) (#10896)
- All spawn-based plugins now spawn processes immediately upon calling Trainer.{fit,validate,test,predict}
- The hooks/callbacks prepare_data, setup, configure_sharded_model and teardown now run under initialized process group for spawn-based plugins just like their non-spawn counterparts
- Some configuration errors that were previously raised as MisconfigurationExceptions will now be raised as ProcessRaisedException (torch>=1.8) or as Exception (torch<1.8)
- Removed the TrainingTypePlugin.pre_dispatch() method and merged it with TrainingTypePlugin.setup() (#11137)
Changed profiler to index and display the names of the hooks with a new pattern []. (#11026)
Changed batch_to_device entry in profiling from stage-specific to generic, to match profiling of other hooks (#11031)
Changed the info message for finalizing ddp-spawn worker processes to a debug-level message (#10864)
Removed duplicated file extension when uploading model checkpoints with NeptuneLogger (#11015)
Removed __getstate__ and __setstate__ of RichProgressBar (#11100)
The DDPPlugin and DDPSpawnPlugin and their subclasses now remove the SyncBatchNorm wrappers in teardown() to enable proper support at inference after fitting (#11078)
Moved ownership of the Accelerator instance to the TrainingTypePlugin; all training-type plugins now take an optional parameter accelerator (#11022)
Renamed the TrainingTypePlugin to Strategy (#11120)
- Renamed the ParallelPlugin to ParallelStrategy (#11123)
- Renamed the DataParallelPlugin to DataParallelStrategy (#11183)
- Renamed the DDPPlugin to DDPStrategy (#11142)
- Renamed the DDP2Plugin to DDP2Strategy (#11185)
- Renamed the DDPShardedPlugin to DDPShardedStrategy (#11186)
- Renamed the DDPFullyShardedPlugin to DDPFullyShardedStrategy (#11143)
- Renamed the DDPSpawnPlugin to DDPSpawnStrategy (#11145)
- Renamed the DDPSpawnShardedPlugin to DDPSpawnShardedStrategy (#11210)
- Renamed the DeepSpeedPlugin to DeepSpeedStrategy (#11194)
- Renamed the HorovodPlugin to HorovodStrategy (#11195)
- Renamed the TPUSpawnPlugin to TPUSpawnStrategy (#11190)
- Renamed the IPUPlugin to IPUStrategy (#11193)
- Renamed the SingleDevicePlugin to SingleDeviceStrategy (#11182)
- Renamed the SingleTPUPlugin to SingleTPUStrategy (#11182)
- Renamed the TrainingTypePluginsRegistry to StrategyRegistry (#11233)
Marked the ResultCollection, ResultMetric, and ResultMetricCollection classes as protected (#11130)
Marked trainer.checkpoint_connector as protected (#11550)
The epoch start/end hooks are now called by the FitLoop instead of the TrainingEpochLoop (#11201)
DeepSpeed does not require lightning module zero 3 partitioning (#10655)
Moved Strategy classes to the strategies directory (#11226)
Renamed training_type_plugin file to strategy (#11239)
Changed DeviceStatsMonitor to group metrics based on the logger’s group_separator (#11254)
Raised UserWarning if evaluation is triggered with best ckpt and trainer is configured with multiple checkpoint callbacks (#11274)
Trainer.logged_metrics now always contains scalar tensors, even when a Python scalar was logged (#11270)
The tuner now uses the checkpoint connector to copy and restore its state (#11518)
Changed MisconfigurationException to ModuleNotFoundError when rich isn’t available (#11360)
The trainer.current_epoch value is now increased by 1 during and after on_train_end (#8578)
The trainer.global_step value now accounts for multiple optimizers and TBPTT splits (#11805)
The trainer.global_step value is now increased right after the optimizer.step() call which will impact users who access it during an intra-training validation hook (#11805)
The filename of checkpoints created with ModelCheckpoint(filename='{step}') is different compared to previous versions. A checkpoint saved after 1 step will be named step=1.ckpt instead of step=0.ckpt (#11805)
Inherit from ABC for Accelerator: Users need to implement auto_device_count (#11521)
Changed parallel_devices property in ParallelStrategy to be lazy initialized (#11572)
Updated TQDMProgressBar to run a separate progress bar for each eval dataloader (#11657)
Sorted SimpleProfiler(extended=False) summary based on mean duration for each hook (#11671)
Avoid enforcing shuffle=False for eval dataloaders (#11575)
When using DP (data-parallel), Lightning will no longer automatically reduce all tensors returned in training_step; it will only reduce the loss unless training_step_end is overridden (#11594)
When using DP (data-parallel), the training_epoch_end hook will no longer receive reduced outputs from training_step and instead get the full tensor of results from all GPUs (#11594)
Changed default logger name to lightning_logs for consistency (#11762)
Rewrote accelerator_connector (#11448)
When manual optimization is used with DDP, we no longer force find_unused_parameters=True (#12425)
Disable loading dataloades if corresponding limit_batches=0 (#11576)
Removed is_global_zero check in training_epoch_loop before logger.save. If you have a custom logger that implements save the Trainer will now call save on all ranks by default. To change this behavior add @rank_zero_only to your save implementation (#12134)
Disabled tuner with distributed strategies (#12179)
Marked trainer.logger_connector as protected (#12195)
Move Strategy.process_dataloader function call from fit/evaluation/predict_loop.py to data_connector.py (#12251)
ModelCheckpoint(save_last=True, every_n_epochs=N) now saves a “last” checkpoint every epoch (disregarding every_n_epochs) instead of only once at the end of training (#12418)
The strategies that support sync_batchnorm now only apply it when fitting (#11919)
Avoided fallback on CPU if no devices are provided for other accelerators (#12410)
Modified supporters.py so that in the accumulator element (for loss) is created directly on the device (#12430)
Removed EarlyStopping.on_save_checkpoint and EarlyStopping.on_load_checkpoint in favor of EarlyStopping.state_dict and EarlyStopping.load_state_dict (#11887)
Removed BaseFinetuning.on_save_checkpoint and BaseFinetuning.on_load_checkpoint in favor of BaseFinetuning.state_dict and BaseFinetuning.load_state_dict (#11887)
Removed BackboneFinetuning.on_save_checkpoint and BackboneFinetuning.on_load_checkpoint in favor of BackboneFinetuning.state_dict and BackboneFinetuning.load_state_dict (#11887)
Removed ModelCheckpoint.on_save_checkpoint and ModelCheckpoint.on_load_checkpoint in favor of ModelCheckpoint.state_dict and ModelCheckpoint.load_state_dict (#11887)
Removed Timer.on_save_checkpoint and Timer.on_load_checkpoint in favor of Timer.state_dict and Timer.load_state_dict (#11887)
Replaced PostLocalSGDOptimizer with a dedicated model averaging component (#12378)

Deprecated

Deprecated training_type_plugin property in favor of strategy in Trainer and updated the references (#11141)
Deprecated Trainer.{validated,tested,predicted}_ckpt_path and replaced with read-only property Trainer.ckpt_path set when checkpoints loaded via Trainer.{fit,validate,test,predict} (#11696)
Deprecated ClusterEnvironment.master_{address,port} in favor of ClusterEnvironment.main_{address,port} (#10103)
Deprecated DistributedType in favor of _StrategyType (#10505)
Deprecated the precision_plugin constructor argument from Accelerator (#10570)
Deprecated DeviceType in favor of _AcceleratorType (#10503)
Deprecated the property Trainer.slurm_job_id in favor of the new SLURMEnvironment.job_id() method (#10622)
Deprecated the access to the attribute IndexBatchSamplerWrapper.batch_indices in favor of IndexBatchSamplerWrapper.seen_batch_indices (#10870)
Deprecated on_init_start and on_init_end callback hooks (#10940)
Deprecated Trainer.call_hook in favor of Trainer._call_callback_hooks, Trainer._call_lightning_module_hook, Trainer._call_ttp_hook, and Trainer._call_accelerator_hook (#10979)
Deprecated TrainingTypePlugin.post_dispatch in favor of TrainingTypePlugin.teardown (#10939)
Deprecated ModelIO.on_hpc_{save/load} in favor of CheckpointHooks.on_{save/load}_checkpoint (#10911)
Deprecated Trainer.run_stage in favor of Trainer.{fit,validate,test,predict} (#11000)
Deprecated Trainer.lr_schedulers in favor of Trainer.lr_scheduler_configs which returns a list of dataclasses instead of dictionaries (#11443)
Deprecated Trainer.verbose_evaluate in favor of EvaluationLoop(verbose=...) (#10931)
Deprecated Trainer.should_rank_save_checkpoint Trainer property (#11068)
Deprecated Trainer.lightning_optimizers (#11444)
Deprecated TrainerOptimizersMixin and moved functionality to core/optimizer.py(#11155)
Deprecated the on_train_batch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#12182)
Deprecated the training_epoch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#12182)
Deprecated TrainerCallbackHookMixin (#11148)
Deprecated TrainerDataLoadingMixin and moved functionality to Trainer and DataConnector (#11282)
Deprecated function pytorch_lightning.callbacks.device_stats_monitor.prefix_metric_keys (#11254)
Deprecated Callback.on_epoch_start hook in favour of Callback.on_{train/val/test}_epoch_start (#11578)
Deprecated Callback.on_epoch_end hook in favour of Callback.on_{train/val/test}_epoch_end (#11578)
Deprecated LightningModule.on_epoch_start hook in favor of LightningModule.on_{train/val/test}_epoch_start (#11578)
Deprecated LightningModule.on_epoch_end hook in favor of LightningModule.on_{train/val/test}_epoch_end (#11578)
Deprecated on_before_accelerator_backend_setup callback hook in favour of setup (#11568)
Deprecated on_batch_start and on_batch_end callback hooks in favor of on_train_batch_start and on_train_batch_end (#11577)
Deprecated on_configure_sharded_model callback hook in favor of setup (#11627)
Deprecated pytorch_lightning.utilities.distributed.rank_zero_only in favor of pytorch_lightning.utilities.rank_zero.rank_zero_only (#11747)
Deprecated pytorch_lightning.utilities.distributed.rank_zero_debug in favor of pytorch_lightning.utilities.rank_zero.rank_zero_debug (#11747)
Deprecated pytorch_lightning.utilities.distributed.rank_zero_info in favor of pytorch_lightning.utilities.rank_zero.rank_zero_info (#11747)
Deprecated pytorch_lightning.utilities.warnings.rank_zero_warn in favor of pytorch_lightning.utilities.rank_zero.rank_zero_warn (#11747)
Deprecated pytorch_lightning.utilities.warnings.rank_zero_deprecation in favor of pytorch_lightning.utilities.rank_zero.rank_zero_deprecation (#11747)
Deprecated pytorch_lightning.utilities.warnings.LightningDeprecationWarning in favor of pytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
Deprecated on_pretrain_routine_start and on_pretrain_routine_end callback hooks in favor of on_fit_start (#11794)
Deprecated LightningModule.on_pretrain_routine_start and LightningModule.on_pretrain_routine_end hooks in favor of on_fit_start (#12122)
Deprecated agg_key_funcs and agg_default_func parameters from LightningLoggerBase (#11871)
Deprecated LightningLoggerBase.update_agg_funcs (#11871)
Deprecated LightningLoggerBase.agg_and_log_metrics in favor of LightningLoggerBase.log_metrics (#11832)
Deprecated passing weights_save_path to the Trainer constructor in favor of adding the ModelCheckpoint callback with dirpath directly to the list of callbacks (#12084)
Deprecated pytorch_lightning.profiler.AbstractProfiler in favor of pytorch_lightning.profiler.Profiler (#12106)
Deprecated pytorch_lightning.profiler.BaseProfiler in favor of pytorch_lightning.profiler.Profiler (#12150)
Deprecated BaseProfiler.profile_iterable (#12102)
Deprecated LoggerCollection in favor of trainer.loggers (#12147)
Deprecated PrecisionPlugin.on_{save,load}_checkpoint in favor of PrecisionPlugin.{state_dict,load_state_dict} (#11978)
Deprecated LightningDataModule.on_save/load_checkpoint in favor of state_dict/load_state_dict (#11893)
Deprecated Trainer.use_amp in favor of Trainer.amp_backend (#12312)
Deprecated LightingModule.use_amp in favor of Trainer.amp_backend (#12315)
Deprecated specifying the process group backend through the environment variable PL_TORCH_DISTRIBUTED_BACKEND (#11745)
Deprecated ParallelPlugin.torch_distributed_backend in favor of DDPStrategy.process_group_backend property (#11745)
Deprecated ModelCheckpoint.save_checkpoint in favor of Trainer.save_checkpoint (#12456)
Deprecated Trainer.devices in favor of Trainer.num_devices and Trainer.device_ids (#12151)
Deprecated Trainer.root_gpu in favor of Trainer.strategy.root_device.index when GPU is used (#12262)
Deprecated Trainer.num_gpus in favor of Trainer.num_devices when GPU is used (#12384)
Deprecated Trainer.ipus in favor of Trainer.num_devices when IPU is used (#12386)
Deprecated Trainer.num_processes in favor of Trainer.num_devices (#12388)
Deprecated Trainer.data_parallel_device_ids in favor of Trainer.device_ids (#12072)
Deprecated returning state from Callback.on_save_checkpoint in favor of returning state in Callback.state_dict for checkpointing (#11887)
Deprecated passing only the callback state to Callback.on_load_checkpoint(callback_state) in favor of passing the callback state to Callback.load_state_dict and in 1.8, passing the entire checkpoint dictionary to Callback.on_load_checkpoint(checkpoint) (#11887)
Deprecated Trainer.gpus in favor of Trainer.device_ids or Trainer.num_devices (#12436)
Deprecated Trainer.tpu_cores in favor of Trainer.num_devices (#12437)

Removed

Removed deprecated parameter method in pytorch_lightning.utilities.model_helpers.is_overridden (#10507)
Remove deprecated method ClusterEnvironment.creates_children (#10339)
Removed deprecated TrainerModelHooksMixin.is_function_implemented and TrainerModelHooksMixin.has_arg (#10322)
Removed deprecated pytorch_lightning.utilities.device_dtype_mixin.DeviceDtypeModuleMixin in favor of pytorch_lightning.core.mixins.device_dtype_mixin.DeviceDtypeModuleMixin (#10442)
Removed deprecated LightningModule.loaded_optimizer_states_dict property (#10346)
Removed deprecated Trainer.fit(train_dataloader=), Trainer.validate(val_dataloaders=), and Trainer.test(test_dataloader=) (#10325)
Removed deprecated every_n_val_epochs parameter of ModelCheckpoint (#10366)
Removed deprecated import pytorch_lightning.profiler.profilers in favor of import pytorch_lightning.profiler (#10443)
Removed deprecated property configure_slurm_dpp from accelerator connector (#10370)
Removed deprecated arguments num_nodes and sync_batchnorm from DDPPlugin, DDPSpawnPlugin, DeepSpeedPlugin (#10357)
Removed deprecated property is_slurm_managing_tasks from AcceleratorConnector (#10353)
Removed deprecated LightningModule.log(tbptt_reduce_fx, tbptt_reduce_token, sync_dist_op) (#10423)
Removed deprecated Plugin.task_idx (#10441)
Removed deprecated method master_params from PrecisionPlugin (#10372)
Removed the automatic detachment of “extras” returned from training_step. For example, return {'loss': ..., 'foo': foo.detach()} will now be necessary if foo has gradients which you do not want to store (#10424)
Removed deprecated passthrough methods and properties from Accelerator base class:
- (#10403)
- (#10448)
Removed deprecated signature for transfer_batch_to_device hook. The new argument dataloader_idx is now required (#10480)
Removed deprecated utilities.distributed.rank_zero_{warn/deprecation} (#10451)
Removed deprecated mode argument from ModelSummary class (#10449)
Removed deprecated Trainer.train_loop property in favor of Trainer.fit_loop (#10482)
Removed deprecated Trainer.train_loop property in favor of Trainer.fit_loop (#10482)
Removed deprecated disable_validation property from Trainer (#10450)
Removed deprecated CheckpointConnector.hpc_load property in favor of CheckpointConnector.restore (#10525)
Removed deprecated reload_dataloaders_every_epoch from Trainer in favour of reload_dataloaders_every_n_epochs (#10481)
Removed the precision_plugin attribute from Accelerator in favor of its equivalent attribute precision_plugin in the TrainingTypePlugin (#10570)
Removed DeepSpeedPlugin.{precision,amp_type,amp_level} properties (#10657)
Removed patching of on_before_batch_transfer, transfer_batch_to_device and on_after_batch_transfer hooks in LightningModule (#10603)
Removed argument return_result from the DDPSpawnPlugin.spawn() method (#10867)
Removed the property TrainingTypePlugin.results and corresponding properties in subclasses (#10034)
Removed the mp_queue attribute from DDPSpawnPlugin and TPUSpawnPlugin (#10034)
Removed unnecessary _move_optimizer_state method overrides from TPUSpawnPlugin and SingleTPUPlugin (#10849)
Removed should_rank_save_checkpoint property from TrainingTypePlugin (#11070)
Removed model_sharded_context method from Accelerator (#10886)
Removed method pre_dispatch from the PrecisionPlugin (#10887)
Removed method setup_optimizers_in_pre_dispatch from the strategies and achieve the same logic in setup and pre_dispatch methods (#10906)
Removed methods pre_dispatch, dispatch and post_dispatch from the Accelerator (#10885)
Removed method training_step, test_step, validation_step and predict_step from the Accelerator (#10890)
Removed TrainingTypePlugin.start_{training,evaluating,predicting} hooks and the same in all subclasses (#10989, #10896)
Removed Accelerator.on_train_start (#10999)
Removed support for Python 3.6 (#11117)
Removed Strategy.init_optimizers in favor of Strategy.setup_optimizers (#11236)
Removed profile("training_step_and_backward") in Closure class since we already profile calls training_step and backward (#11222)
Removed Strategy.optimizer_zero_grad (#11246)
Removed Strategy.on_gpu (#11537)
Removed Strategy.on_tpu property (#11536)
Removed the abstract property LightningLoggerBase.experiment (#11603)
Removed FitLoop.current_epoch getter and setter (#11562)
Removed access to _short_id in NeptuneLogger (#11517)
Removed log_text and log_image from the LightningLoggerBase API (#11857)
Removed calls to profile("model_forward") in favor of profiling training_step (#12032)
Removed get_mp_spawn_kwargs from DDPSpawnStrategy and TPUSpawnStrategy in favor of configuration in the _SpawnLauncher (#11966)
Removed _aggregate_metrics, _reduce_agg_metrics, and _finalize_agg_metrics from LightningLoggerBase (#12053)
Removed the AcceleratorConnector.device_type property (#12081)
Removed AcceleratorConnector.num_nodes (#12107)
Removed AcceleratorConnector.has_ipu property (#12111)
Removed AcceleratorConnector.use_ipu property (#12110)
Removed AcceleratorConnector.has_tpu property (#12109)
Removed AcceleratorConnector.use_dp property (#12112)
Removed configure_sync_batchnorm from ParallelStrategy and all other strategies that inherit from it (#11754)
Removed public attribute sync_batchnorm from strategies (#11754)
Removed AcceleratorConnector.root_gpu property (#12262)
Removed AcceleratorConnector.tpu_id property (#12387)
Removed AcceleratorConnector.num_gpus property (#12384)
Removed AcceleratorConnector.num_ipus property (#12386)
Removed AcceleratorConnector.num_processes property (#12388)
Removed AcceleratorConnector.parallel_device_ids property (#12072)
Removed AcceleratorConnector.devices property (#12435)
Removed AcceleratorConnector.parallel_devices property (#12075)
Removed AcceleratorConnector.tpu_cores property (#12437)

Fixed

Fixed an issue where ModelCheckpoint could delete last checkpoint from the old directory when dirpath has changed during resumed training (#12225)
Fixed an issue where ModelCheckpoint could delete older checkpoints when dirpath has changed during resumed training (#12045)
Fixed an issue where HorovodStrategy.teardown() did not complete gracefully if an exception was thrown during callback setup #11752
Fixed security vulnerabilities CVE-2020-1747 and CVE-2020-14343 caused by the PyYAML dependency (#11099)
Fixed security vulnerability “CWE-94: Improper Control of Generation of Code (Code Injection)” (#12212)
Fixed logging on {test,validation}_epoch_end with multiple dataloaders (#11132)
Reset the validation progress tracking state after sanity checking (#11218)
Fixed double evaluation bug with fault-tolerance enabled where the second call was completely skipped (#11119)
Fixed an issue with the TPUSpawnPlugin handling the XLA_USE_BF16 environment variable incorrectly (#10990)
Fixed wrong typehint for Trainer.lightning_optimizers (#11155)
Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy (#11307)
Fixed bug that forced overriding configure_optimizers with the CLI (#11672)
Fixed type promotion when tensors of higher category than float are logged (#11401)
Fixed SimpleProfiler summary (#11414)
No longer set a DistributedSampler to the poptorch.DataLoader when IPUs are used (#12114)
Fixed bug where progress bar was not being disabled when not in rank zero during predict (#11377)
Fixed the mid-epoch warning call while resuming training (#11556)
Fixed LightningModule.{un,}toggle_model when only 1 optimizer is used (#12088)
Fixed an issue in RichProgressbar to display the metrics logged only on main progress bar (#11690)
Fixed RichProgressBar progress when refresh rate does not evenly divide the total counter (#11668)
Fixed RichProgressBar progress validation bar total when using multiple validation runs within a single training epoch (#11668)
Configure native Deepspeed schedulers with interval=’step’ (#11788), (#12031)
Update RichProgressBarTheme styles after detecting light theme on colab (#10993)
Fixed passing _ddp_params_and_buffers_to_ignore (#11949)
Fixed an AttributeError when calling save_hyperparameters and no parameters need saving (#11827)
Fixed environment variable priority for global rank determination (#11406)
Fixed an issue that caused the Trainer to produce identical results on subsequent runs without explicit re-seeding (#11870)
Fixed an issue that caused the Tuner to affect the random state (#11870)
Fixed to avoid common hook warning if no hook is overridden (#12131)
Fixed deepspeed keeping old sub-folders in same ckpt path (#12194)
Fixed returning logged metrics instead of callback metrics during evaluation (#12224)
Fixed the case where logger=None is passed to the Trainer (#12249)
Fixed bug where the global step tracked by ModelCheckpoint was still set even if no checkpoint was saved (#12418)
Fixed bug where ModelCheckpoint was overriding the epoch and step logged values (#12418)
Fixed bug where monitoring the default epoch and step values with ModelCheckpoint would fail (#12418)
Fixed initializing optimizers unnecessarily in DDPFullyShardedStrategy (#12267)
Fixed check for horovod module (#12377)
Fixed logging to loggers with multiple eval dataloaders (#12454)
Fixed an issue with resuming from a checkpoint trained with QAT (#11346)