PyTorch Lightning 1.6 Now Available
The PyTorch Lightning team released version 1.6 with support for Intel’s Habana Accelerator, new efficient DDP strategy (Bagua), manual Fault-tolerance, and other stability and reliability changes.
⚡Visit the release page on GitHub to download.⚡
- Lightning Highlights
- New Hooks
- New Properties
- Experimental Features
- Backward Incompatible Changes
- Full Lightning Changelog
Lightning Highlights
PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. Here are some highlights:
Introducing Intel’s Habana Accelerator
Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.
You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:
trainer = pl.Trainer(accelerator="hpu") # single Gaudi training trainer = pl.Trainer(accelerator="hpu", devices=1) # distributed training with 8 Gaudi trainer = pl.Trainer(accelerator="hpu", devices=8)
The Bagua Strategy
The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:
trainer = pl.Trainer(strategy="bagua") # or to choose a custom algorithm trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default
Towards stable Accelerator, Strategy, and Plugin APIs
The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They’re where all the distributed boilerplate lives, and we’re constantly working to improve both them and the overall PyTorch Lightning platform experience.
In this release, we’ve made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:
- All
TrainingTypePlugins have been renamed toStrategy(#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the newstrategyanddevicesflags to the Trainer.# Before from pytorch_lighting.plugins import DDPPlugin # New from pytorch_lighting.strategies import DDPStrategy
- The
AcceleratorandPrecisionPluginhave moved intoStrategy. All strategies now take an optional parameteracceleratorandprecision_plugin(#11022, #10570). - Custom Accelerator implementations must now implement two new abstract methods:
is_available()(#11797) andauto_device_count()(#10222). The latter determines how many devices get used by default when specifyingTrainer(accelerator=..., devices="auto"). - We redesigned the process creation for spawn-based strategies such as
DDPSpawnStrategyandTPUSpawnStrategy(#10896). All spawn-based strategies now spawn processes immediately upon callingTrainer.{fit,validate,test,predict}, which means the hooks/callbacksprepare_data,setup,configure_sharded_modelandteardownall run under an initialized process group. These changes align the spawn-based strategies with their non-spawn counterparts (such asDDPStrategy).
We’ve also exposed the process group backend for use. For example, you can now easily enable fairring like this:
# Explicitly specify the process group backend if you choose to ddp = pl.strategies.DDPStrategy(process_group_backend="fairring") trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)
In a similar fashion, if installing torch>=1.11, you can enable DDP static graph to apply special runtime optimizations:
trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))
LightningCLI improvements
In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:
from pytorch_lightning.utilities.cli import LightningCLI LightningCLI(auto_registry=True)
We have also added support for the ReduceLROnPlateau scheduler with shorthand notation:
$ python script.py fit --optimizer=Adam --lr_scheduler=ReduceLROnPlateau --lr_scheduler.monitor=metric_to_track
If you need to customize the learning rate scheduler configuration, you can do so by overriding:
class MyLightningCLI(LightningCLI):
@staticmethod
def configure_optimizers(lightning_module, optimizer, lr_scheduler=None):
return {"optimizer": optimizer, "lr_scheduler": {"scheduler": lr_scheduler, ...}}Finally, loggers are also now configurable with shorthand:
$ python script.py fit --trainer.logger=WandbLogger --trainer.logger.name="my_lightning_run"
Control SLURM’s re-queueing
We’ve added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:
from pytorch_lightning.plugins.environments import SLURMEnvironment trainer = pl.Trainer(plugins=SLURMEnvironment(auto_requeue=False))
Fault-tolerance improvements
The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1 to SIGTERM for better support inside cloud instances.
An additional feature we’re excited to announce is support for consecutive trainer.fit() calls.
trainer = pl.Trainer(max_epochs=2) trainer.fit(model) # now, run 2 more epochs trainer.fit_loop.max_epochs = 4 trainer.fit(model)
Loop customization improvements
The Loop‘s state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.
We’ve also made it easier to replace Lightning’s loops with your own. For example:
class MyCustomLoop(pl.loops.TrainingEpochLoop):
...
trainer = pl.Trainer(...)
trainer.fit_loop.replace(epoch_loop=MyCustomLoop)
# Trainer runs the fit loop with your new epoch loop!
trainer.fit(model)Data-Loading improvements
In previous versions, Lightning required that the DataLoader instance set its input arguments as instance attributes. This meant that custom DataLoaders also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:
class MyDataLoader(torch.utils.data.DataLoader):
def __init__(self, a=123, *args, **kwargs):
- # this was required before
- self.a = a
super().__init__(*args, **kwargs)
trainer.fit(model, train_dataloader=MyDataLoader())As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn’t need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV’s. You can now define your own pre-fetching value like this:
class MyCustomLoop(pl.loops.FitLoop):
@property
def prefetch_batches(self):
return 7 # lucky number 7
trainer = pl.Trainer(...)
trainer.fit_loop.replace(fit_loop=MyCustomLoop)New Hooks
LightningModule.lr_scheduler_step
Lightning now allows the use of custom learning rate schedulers that aren’t natively available in PyTorch. A great example of this is Timm Schedulers.
When using custom learning rate schedulers relying on an API other than PyTorch’s, you can now define the LightningModule.lr_scheduler_step with your desired logic.
from timm.scheduler import TanhLRScheduler
class MyLightningModule(pl.LightningModule):
def configure_optimizers(self):
optimizer = ...
scheduler = TanhLRScheduler(optimizer, ...)
return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "epoch"}}
def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
scheduler.step(epoch=self.current_epoch) # timm's scheduler need the epoch valueA new stateful API
This release introduces new hooks to standardize all stateful components to use state_dict and load_state_dict, mimicking the PyTorch API. The new hooks receive their own component’s state and replace most usages of the previous on_save_checkpoint and on_load_checkpoint hooks.
def MyCallback(pl.Callback):
- def on_save_checkpoint(self, trainer, pl_module, checkpoint):
- return {'x': self.x}
- def on_load_checkpoint(self, trainer, pl_module, checkpoint):
- self.x = x
+ def state_dict(self):
+ return {'x': self.x}
+ def load_state_dict(self, checkpoint):
+ self.x = xNew Properties
Trainer.estimated_stepping_batches
You can use built-in Trainer.estimated_stepping_batches to compute the total number of stepping batches needed for the complete training.
The property takes gradient accumulation factor and distributed setting into consideration when performing this computation so that you don’t have to derive it manually:
class MyLightningModule(pl.LightningModule):
def configure_optimizers(self):
optimizer = ...
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches
)
return {"optimizer": optimizer, "lr_scheduler": scheduler}Trainer.num_devices and Trainer.device_ids
In the past, retrieving the number of devices used, or their IDs, posed a considerable challenge. Additionally, doing so required knowing which property to access based on the current Trainer configuration.
To simplify this process, we’ve deprecated the per-accelerator properties to have accelerator agnostic properties. For example:
- num_devices = max(1, trainer.num_gpus, trainer.num_processes) - if trainer.tpu_cores: - num_devices = max(num_devices, trainer.tpu_cores) + num_devices = trainer.num_devices
Experimental Features
Manual Fault-tolerance
Fault Tolerance has limitations that require specific information about your data-loading structure.
It is now possible to resolve those limitations by enabling manual fault tolerance where you can write your own logic and specify how exactly to checkpoint your own datasets and samplers. You can do so using this environment flag:
$ PL_FAULT_TOLERANT_TRAINING=MANUAL python script.py
Check out this video for a dive into the internals of this flag.
Customizing the layer synchronization
We introduced a new plugin class for wrapping layers of a model with synchronization logic for multiprocessing.
class MyLayerSync(pl.plugins.LayerSync):
...
layer_sync = MyLayerSync(...)
trainer = Trainer(sync_batchnorm=True, plugins=layer_sync, strategy="ddp")Registering Custom Accelerators
There has been much progress in the field of ML Accelerators, and the list of accelerators is constantly expanding.
We’ve made it easier for users to try out new accelerators by enabling support for registering custom Accelerator classes in Lightning.
from pytorch_lightning.accelerators import Accelerator, AcceleratorRegistry
class SOTAAccelerator(Accelerator):
def __init__(self, x):
...
AcceleratorRegistry.register("sota_accelerator", SOTAAccelerator, x=123)
# the following works now:
trainer = Trainer(accelerator="sota_accelerator")Backward Incompatible Changes
Here is a selection of notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.
Drop PyTorch 1.7 support
Following our 4 PyTorch release window, this release supports PyTorch 1.8 to 1.11. Support for PyTorch 1.7 has been removed.
Drop Python 3.6 support
Following Python’s end-of-life, support for Python 3.6 has been removed.
AcceleratorConnector rewrite
To support new accelerator and stategy features, we completely rewrote our internal AcceleratorConncetor class. No backwards compatibility was maintained so it is likely to have broken your code if it was using this class.
Re-define the current_epoch boundary
To resolve fault-tolerance issues, we changed where the current epoch value gets increased.
trainer.current_epoch is now increased by 1 on_train_end. This means that if a model is run for 3 epochs (0, 1, 2), trainer.current_epoch will now return 3 instead of 2 after trainer.fit(). This can also impact custom callbacks that acess this property inside this hook.
This also impacts checkpoints saved during an epoch (e.g. on_train_epoch_end). For example, a Trainer(max_epochs=1, limit_train_batches=1) instance that saves a checkpoint will have the current_epoch=0 value saved instead of current_epoch=1.
Re-define the global_step boundary
To resolve fault-tolerance issues, we changed where the global step value gets increased.
Access to trainer.global_step during an intra-training validation hook will now correctly return the number of optimizer steps taken already. In pseudocode:
training_step() + global_step += 1 validation_if_necessary() - global_step += 1
Saved checkpoints that use the global step value as part of the filename are now increased by 1 for the same reason. A checkpoint saved after 1 step will be now be named step=1.ckpt instead of step=0.ckpt.
The trainer.global_step value will now account for TBPTT or multiple optimizers. Users setting Trainer({min,max}_steps=...) under these circumstances will need to adjust their values.
Removed automatic reduction of outputs in training_step when using DataParallel
When using Trainer(strategy="dp"), all the tensors returned by training_step were previously reduced to a scalar (#11594). This behavior was especially confusing when outputs needed to be collected into the training_epoch_end hook.
From now on, outputs are no longer reduced except for the loss tensor, unless you implement training_step_end, in which case the loss won’t get reduced either.
No longer fallback to CPU with no devices
Previous versions were lenient in that the lack of GPU devices defaulted to running on CPU. This meant that users’ code could be running much slower without them ever noticing that it was running on CPU.
We suggest passing Trainer(accelerator="auto") when this leniency is desired.
LIGHTNING CHANGELOG
Added
- Allow logging to an existing run ID in MLflow with
MLFlowLogger(#12290) - Enable gradient accumulation using Horovod’s
backward_passes_per_step(#11911) - Add new
DETAILlog level to provide useful logs for improving monitoring and debugging of batch jobs (#11008) - Added a flag
SLURMEnvironment(auto_requeue=True|False)to control whether Lightning handles the requeuing (#10601) - Fault Tolerant Manual
- Add
_Statefulprotocol to detect if classes are stateful (#10646) - Add
_FaultTolerantModeenum used to track different supported fault tolerant modes (#10645) - Add a
_rotate_worker_indicesutility to reload the state according the latest worker (#10647) - Add stateful workers (#10674)
- Add an utility to collect the states across processes (#10639)
- Add logic to reload the states across data loading components (#10699)
- Cleanup some fault tolerant utilities (#10703)
- Enable Fault Tolerant Manual Training (#10707)
- Broadcast the
_terminate_gracefullyto all processes and add support for DDP (#10638)
- Add
- Added support for re-instantiation of custom (subclasses of)
DataLoadersreturned in the*_dataloader()methods, i.e., automatic replacement of samplers now works with custom types ofDataLoader(#10680) - Added a function to validate if fault tolerant training is supported. (#10465)
- Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)
- Show a better error message when a custom
DataLoaderimplementation is not well implemented and we need to reconstruct it (#10719) - Show a better error message when frozen dataclass is used as a batch (#10927)
- Save the
Loop‘s state by default in the checkpoint (#10784) - Added
Loop.replaceto easily switch one loop for another (#10324) - Added support for
--lr_scheduler=ReduceLROnPlateauto theLightningCLI(#10860) - Added
LightningCLI.configure_optimizersto override theconfigure_optimizersreturn value (#10860) - Added
LightningCLI(auto_registry)flag to register all subclasses of the registerable components automatically (#12108) - Added a warning that shows when
max_epochsin theTraineris not set (#10700) - Added support for returning a single Callback from
LightningModule.configure_callbackswithout wrapping it into a list (#11060) - Added
console_kwargsforRichProgressBarto initialize inner Console (#10875) - Added support for shorthand notation to instantiate loggers with the
LightningCLI(#11533) - Added a
LOGGER_REGISTRYinstance to register custom loggers to theLightningCLI(#11533) - Added info message when the
Trainerargumentslimit_*_batches,overfit_batches, orval_check_intervalare set to1or1.0(#11950) - Added a
PrecisionPlugin.teardownmethod (#10990) - Added
LightningModule.lr_scheduler_step(#10249) - Added support for no pre-fetching to
DataFetcher(#11606) - Added support for optimizer step progress tracking with manual optimization (#11848)
- Return the output of the
optimizer.step. This can be useful forLightningLiteusers, manual optimization users, or users overridingLightningModule.optimizer_step(#11711) - Teardown the active loop and strategy on exception (#11620)
- Added a
MisconfigurationExceptionif user providedopt_idxin scheduler config doesn’t match with actual optimizer index of its respective optimizer (#11247) - Added a
loggersproperty toTrainerwhich returns a list of loggers provided by the user (#11683) - Added a
loggersproperty toLightningModulewhich retrieves theloggersproperty fromTrainer(#11683) - Added support for DDP when using a
CombinedLoaderfor the training data (#11648) - Added a warning when using
DistributedSamplerduring validation/testing (#11479) - Added support for
Baguatraining strategy (#11146) - Added support for manually returning a
poptorch.DataLoaderin a*_dataloaderhook (#12116) - Added
rank_zeromodule to centralize utilities (#11747) - Added a
_Statefulsupport forLightningDataModule(#11637) - Added
_Statefulsupport forPrecisionPlugin(#11638) - Added
Accelerator.is_availableto check device availability (#11797) - Enabled static type-checking on the signature of
Trainer(#11888) - Added utility functions for moving optimizers to devices (#11758)
- Added a warning when saving an instance of
nn.Modulewithsave_hyperparameters()(#12068) - Added
estimated_stepping_batchesproperty toTrainer(#11599) - Added support for pluggable Accelerators (#12030)
- Added profiling for
on_load_checkpoint/on_save_checkpointcallback and LightningModule hooks (#12149) - Added
LayerSyncandNativeSyncBatchNormplugins (#11754) - Added optional
storage_optionsargument toTrainer.save_checkpoint()to pass to customCheckpointIOimplementations (#11891) - Added support to explicitly specify the process group backend for parallel strategies (#11745)
- Added
device_idsandnum_devicesproperty toTrainer(#12151) - Added
Callback.state_dict()andCallback.load_state_dict()methods (#12232) - Added
AcceleratorRegistry(#12180) - Added support for Habana Accelerator (HPU) (#11808)
- Added support for dataclasses in
apply_to_collections(#11889)
Changed
- Drop PyTorch 1.7 support (#12191), (#12432)
- Make
benchmarkflag optional and set its value based on the deterministic flag (#11944) - Implemented a new native and rich format in
_print_resultsmethod of theEvaluationLoop(#11332) - Do not print an empty table at the end of the
EvaluationLoop(#12427) - Set the
prog_barflag to False inLightningModule.log_grad_norm(#11472) - Raised exception in
init_dist_connection()when torch distributed is not available (#10418) - The
monitorargument in theEarlyStoppingcallback is no longer optional (#10328) - Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
- Raised
MisconfigurationExceptionwhenenable_progress_bar=Falseand a progress bar instance has been passed in the callback list (#10520) - Moved
trainer.connectors.env_vars_connector._defaults_from_env_varstoutilities.argsparse._defaults_from_env_vars(#10501) - Changes in
LightningCLIrequired for the new major release of jsonargparse v4.0.0 (#10426) - Renamed
refresh_rate_per_secondparameter torefresh_rateforRichProgressBarsignature (#10497) - Moved ownership of the
PrecisionPluginintoTrainingTypePluginand updated all references (#10570) - Fault Tolerant relies on
signal.SIGTERMto gracefully exit instead ofsignal.SIGUSR1(#10605) Loop.restarting=...now sets the value recursively for all subloops (#11442)- Raised an error if the
batch_sizecannot be inferred from the current batch if it contained a string or was a custom batch object (#10541) - The validation loop is now disabled when
overfit_batches > 0is set in the Trainer (#9709) - Moved optimizer related logics from
AcceleratortoTrainingTypePlugin(#10596) - Moved ownership of the lightning optimizers from the
Trainerto theStrategy(#11444) - Moved ownership of the data fetchers from the DataConnector to the Loops (#11621)
- Moved
batch_to_devicemethod fromAcceleratortoTrainingTypePlugin(#10649) - The
DDPSpawnPluginno longer overrides thepost_dispatchplugin hook (#10034) - Integrate the progress bar implementation with progress tracking (#11213)
- The
LightningModule.{add_to_queue,get_from_queue}hooks no longer get atorch.multiprocessing.SimpleQueueand instead receive a list based queue (#10034) - Changed
training_step,validation_step,test_stepandpredict_stepmethod signatures inAcceleratorand updated input from caller side (#10908) - Changed the name of the temporary checkpoint that the
DDPSpawnPluginand related plugins save (#10934) LoggerCollectionreturns only unique logger names and versions (#10976)- Redesigned process creation for spawn-based plugins (
DDPSpawnPlugin,TPUSpawnPlugin, etc.) (#10896)- All spawn-based plugins now spawn processes immediately upon calling
Trainer.{fit,validate,test,predict} - The hooks/callbacks
prepare_data,setup,configure_sharded_modelandteardownnow run under initialized process group for spawn-based plugins just like their non-spawn counterparts - Some configuration errors that were previously raised as
MisconfigurationExceptions will now be raised asProcessRaisedException(torch>=1.8) or asException(torch<1.8) - Removed the
TrainingTypePlugin.pre_dispatch()method and merged it withTrainingTypePlugin.setup()(#11137)
- All spawn-based plugins now spawn processes immediately upon calling
- Changed profiler to index and display the names of the hooks with a new pattern []. (#11026)
- Changed
batch_to_deviceentry in profiling from stage-specific to generic, to match profiling of other hooks (#11031) - Changed the info message for finalizing ddp-spawn worker processes to a debug-level message (#10864)
- Removed duplicated file extension when uploading model checkpoints with
NeptuneLogger(#11015) - Removed
__getstate__and__setstate__ofRichProgressBar(#11100) - The
DDPPluginandDDPSpawnPluginand their subclasses now remove theSyncBatchNormwrappers inteardown()to enable proper support at inference after fitting (#11078) - Moved ownership of the
Acceleratorinstance to theTrainingTypePlugin; all training-type plugins now take an optional parameteraccelerator(#11022) - Renamed the
TrainingTypePlugintoStrategy(#11120)- Renamed the
ParallelPlugintoParallelStrategy(#11123) - Renamed the
DataParallelPlugintoDataParallelStrategy(#11183) - Renamed the
DDPPlugintoDDPStrategy(#11142) - Renamed the
DDP2PlugintoDDP2Strategy(#11185) - Renamed the
DDPShardedPlugintoDDPShardedStrategy(#11186) - Renamed the
DDPFullyShardedPlugintoDDPFullyShardedStrategy(#11143) - Renamed the
DDPSpawnPlugintoDDPSpawnStrategy(#11145) - Renamed the
DDPSpawnShardedPlugintoDDPSpawnShardedStrategy(#11210) - Renamed the
DeepSpeedPlugintoDeepSpeedStrategy(#11194) - Renamed the
HorovodPlugintoHorovodStrategy(#11195) - Renamed the
TPUSpawnPlugintoTPUSpawnStrategy(#11190) - Renamed the
IPUPlugintoIPUStrategy(#11193) - Renamed the
SingleDevicePlugintoSingleDeviceStrategy(#11182) - Renamed the
SingleTPUPlugintoSingleTPUStrategy(#11182) - Renamed the
TrainingTypePluginsRegistrytoStrategyRegistry(#11233)
- Renamed the
- Marked the
ResultCollection,ResultMetric, andResultMetricCollectionclasses as protected (#11130) - Marked
trainer.checkpoint_connectoras protected (#11550) - The epoch start/end hooks are now called by the
FitLoopinstead of theTrainingEpochLoop(#11201) - DeepSpeed does not require lightning module zero 3 partitioning (#10655)
- Moved
Strategyclasses to thestrategiesdirectory (#11226) - Renamed
training_type_pluginfile tostrategy(#11239) - Changed
DeviceStatsMonitorto group metrics based on the logger’sgroup_separator(#11254) - Raised
UserWarningif evaluation is triggered withbestckpt and trainer is configured with multiple checkpoint callbacks (#11274) Trainer.logged_metricsnow always contains scalar tensors, even when a Python scalar was logged (#11270)- The tuner now uses the checkpoint connector to copy and restore its state (#11518)
- Changed
MisconfigurationExceptiontoModuleNotFoundErrorwhenrichisn’t available (#11360) - The
trainer.current_epochvalue is now increased by 1 during and afteron_train_end(#8578) - The
trainer.global_stepvalue now accounts for multiple optimizers and TBPTT splits (#11805) - The
trainer.global_stepvalue is now increased right after theoptimizer.step()call which will impact users who access it during an intra-training validation hook (#11805) - The filename of checkpoints created with
ModelCheckpoint(filename='{step}')is different compared to previous versions. A checkpoint saved after 1 step will be namedstep=1.ckptinstead ofstep=0.ckpt(#11805) - Inherit from
ABCforAccelerator: Users need to implementauto_device_count(#11521) - Changed
parallel_devicesproperty inParallelStrategyto be lazy initialized (#11572) - Updated
TQDMProgressBarto run a separate progress bar for each eval dataloader (#11657) - Sorted
SimpleProfiler(extended=False)summary based on mean duration for each hook (#11671) - Avoid enforcing
shuffle=Falsefor eval dataloaders (#11575) - When using DP (data-parallel), Lightning will no longer automatically reduce all tensors returned in training_step; it will only reduce the loss unless
training_step_endis overridden (#11594) - When using DP (data-parallel), the
training_epoch_endhook will no longer receive reduced outputs fromtraining_stepand instead get the full tensor of results from all GPUs (#11594) - Changed default logger name to
lightning_logsfor consistency (#11762) - Rewrote
accelerator_connector(#11448) - When manual optimization is used with DDP, we no longer force
find_unused_parameters=True(#12425) - Disable loading dataloades if corresponding
limit_batches=0(#11576) - Removed
is_global_zerocheck intraining_epoch_loopbeforelogger.save. If you have a custom logger that implementssavethe Trainer will now callsaveon all ranks by default. To change this behavior add@rank_zero_onlyto yoursaveimplementation (#12134) - Disabled tuner with distributed strategies (#12179)
- Marked
trainer.logger_connectoras protected (#12195) - Move
Strategy.process_dataloaderfunction call fromfit/evaluation/predict_loop.pytodata_connector.py(#12251) ModelCheckpoint(save_last=True, every_n_epochs=N)now saves a “last” checkpoint every epoch (disregardingevery_n_epochs) instead of only once at the end of training (#12418)- The strategies that support
sync_batchnormnow only apply it when fitting (#11919) - Avoided fallback on CPU if no devices are provided for other accelerators (#12410)
- Modified
supporters.pyso that in the accumulator element (for loss) is created directly on the device (#12430) - Removed
EarlyStopping.on_save_checkpointandEarlyStopping.on_load_checkpointin favor ofEarlyStopping.state_dictandEarlyStopping.load_state_dict(#11887) - Removed
BaseFinetuning.on_save_checkpointandBaseFinetuning.on_load_checkpointin favor ofBaseFinetuning.state_dictandBaseFinetuning.load_state_dict(#11887) - Removed
BackboneFinetuning.on_save_checkpointandBackboneFinetuning.on_load_checkpointin favor ofBackboneFinetuning.state_dictandBackboneFinetuning.load_state_dict(#11887) - Removed
ModelCheckpoint.on_save_checkpointandModelCheckpoint.on_load_checkpointin favor ofModelCheckpoint.state_dictandModelCheckpoint.load_state_dict(#11887) - Removed
Timer.on_save_checkpointandTimer.on_load_checkpointin favor ofTimer.state_dictandTimer.load_state_dict(#11887) - Replaced PostLocalSGDOptimizer with a dedicated model averaging component (#12378)
Deprecated
- Deprecated
training_type_pluginproperty in favor ofstrategyinTrainerand updated the references (#11141) - Deprecated
Trainer.{validated,tested,predicted}_ckpt_pathand replaced with read-only propertyTrainer.ckpt_pathset when checkpoints loaded viaTrainer.{fit,validate,test,predict}(#11696) - Deprecated
ClusterEnvironment.master_{address,port}in favor ofClusterEnvironment.main_{address,port}(#10103) - Deprecated
DistributedTypein favor of_StrategyType(#10505) - Deprecated the
precision_pluginconstructor argument fromAccelerator(#10570) - Deprecated
DeviceTypein favor of_AcceleratorType(#10503) - Deprecated the property
Trainer.slurm_job_idin favor of the newSLURMEnvironment.job_id()method (#10622) - Deprecated the access to the attribute
IndexBatchSamplerWrapper.batch_indicesin favor ofIndexBatchSamplerWrapper.seen_batch_indices(#10870) - Deprecated
on_init_startandon_init_endcallback hooks (#10940) - Deprecated
Trainer.call_hookin favor ofTrainer._call_callback_hooks,Trainer._call_lightning_module_hook,Trainer._call_ttp_hook, andTrainer._call_accelerator_hook(#10979) - Deprecated
TrainingTypePlugin.post_dispatchin favor ofTrainingTypePlugin.teardown(#10939) - Deprecated
ModelIO.on_hpc_{save/load}in favor ofCheckpointHooks.on_{save/load}_checkpoint(#10911) - Deprecated
Trainer.run_stagein favor ofTrainer.{fit,validate,test,predict}(#11000) - Deprecated
Trainer.lr_schedulersin favor ofTrainer.lr_scheduler_configswhich returns a list of dataclasses instead of dictionaries (#11443) - Deprecated
Trainer.verbose_evaluatein favor ofEvaluationLoop(verbose=...)(#10931) - Deprecated
Trainer.should_rank_save_checkpointTrainer property (#11068) - Deprecated
Trainer.lightning_optimizers(#11444) - Deprecated
TrainerOptimizersMixinand moved functionality tocore/optimizer.py(#11155) - Deprecated the
on_train_batch_end(outputs)format when multiple optimizers are used and TBPTT is enabled (#12182) - Deprecated the
training_epoch_end(outputs)format when multiple optimizers are used and TBPTT is enabled (#12182) - Deprecated
TrainerCallbackHookMixin(#11148) - Deprecated
TrainerDataLoadingMixinand moved functionality toTrainerandDataConnector(#11282) - Deprecated function
pytorch_lightning.callbacks.device_stats_monitor.prefix_metric_keys(#11254) - Deprecated
Callback.on_epoch_starthook in favour ofCallback.on_{train/val/test}_epoch_start(#11578) - Deprecated
Callback.on_epoch_endhook in favour ofCallback.on_{train/val/test}_epoch_end(#11578) - Deprecated
LightningModule.on_epoch_starthook in favor ofLightningModule.on_{train/val/test}_epoch_start(#11578) - Deprecated
LightningModule.on_epoch_endhook in favor ofLightningModule.on_{train/val/test}_epoch_end(#11578) - Deprecated
on_before_accelerator_backend_setupcallback hook in favour ofsetup(#11568) - Deprecated
on_batch_startandon_batch_endcallback hooks in favor ofon_train_batch_startandon_train_batch_end(#11577) - Deprecated
on_configure_sharded_modelcallback hook in favor ofsetup(#11627) - Deprecated
pytorch_lightning.utilities.distributed.rank_zero_onlyin favor ofpytorch_lightning.utilities.rank_zero.rank_zero_only(#11747) - Deprecated
pytorch_lightning.utilities.distributed.rank_zero_debugin favor ofpytorch_lightning.utilities.rank_zero.rank_zero_debug(#11747) - Deprecated
pytorch_lightning.utilities.distributed.rank_zero_infoin favor ofpytorch_lightning.utilities.rank_zero.rank_zero_info(#11747) - Deprecated
pytorch_lightning.utilities.warnings.rank_zero_warnin favor ofpytorch_lightning.utilities.rank_zero.rank_zero_warn(#11747) - Deprecated
pytorch_lightning.utilities.warnings.rank_zero_deprecationin favor ofpytorch_lightning.utilities.rank_zero.rank_zero_deprecation(#11747) - Deprecated
pytorch_lightning.utilities.warnings.LightningDeprecationWarningin favor ofpytorch_lightning.utilities.rank_zero.LightningDeprecationWarning - Deprecated
on_pretrain_routine_startandon_pretrain_routine_endcallback hooks in favor ofon_fit_start(#11794) - Deprecated
LightningModule.on_pretrain_routine_startandLightningModule.on_pretrain_routine_endhooks in favor ofon_fit_start(#12122) - Deprecated
agg_key_funcsandagg_default_funcparameters fromLightningLoggerBase(#11871) - Deprecated
LightningLoggerBase.update_agg_funcs(#11871) - Deprecated
LightningLoggerBase.agg_and_log_metricsin favor ofLightningLoggerBase.log_metrics(#11832) - Deprecated passing
weights_save_pathto theTrainerconstructor in favor of adding theModelCheckpointcallback withdirpathdirectly to the list of callbacks (#12084) - Deprecated
pytorch_lightning.profiler.AbstractProfilerin favor ofpytorch_lightning.profiler.Profiler(#12106) - Deprecated
pytorch_lightning.profiler.BaseProfilerin favor ofpytorch_lightning.profiler.Profiler(#12150) - Deprecated
BaseProfiler.profile_iterable(#12102) - Deprecated
LoggerCollectionin favor oftrainer.loggers(#12147) - Deprecated
PrecisionPlugin.on_{save,load}_checkpointin favor ofPrecisionPlugin.{state_dict,load_state_dict}(#11978) - Deprecated
LightningDataModule.on_save/load_checkpointin favor ofstate_dict/load_state_dict(#11893) - Deprecated
Trainer.use_ampin favor ofTrainer.amp_backend(#12312) - Deprecated
LightingModule.use_ampin favor ofTrainer.amp_backend(#12315) - Deprecated specifying the process group backend through the environment variable
PL_TORCH_DISTRIBUTED_BACKEND(#11745) - Deprecated
ParallelPlugin.torch_distributed_backendin favor ofDDPStrategy.process_group_backendproperty (#11745) - Deprecated
ModelCheckpoint.save_checkpointin favor ofTrainer.save_checkpoint(#12456) - Deprecated
Trainer.devicesin favor ofTrainer.num_devicesandTrainer.device_ids(#12151) - Deprecated
Trainer.root_gpuin favor ofTrainer.strategy.root_device.indexwhen GPU is used (#12262) - Deprecated
Trainer.num_gpusin favor ofTrainer.num_deviceswhen GPU is used (#12384) - Deprecated
Trainer.ipusin favor ofTrainer.num_deviceswhen IPU is used (#12386) - Deprecated
Trainer.num_processesin favor ofTrainer.num_devices(#12388) - Deprecated
Trainer.data_parallel_device_idsin favor ofTrainer.device_ids(#12072) - Deprecated returning state from
Callback.on_save_checkpointin favor of returning state inCallback.state_dictfor checkpointing (#11887) - Deprecated passing only the callback state to
Callback.on_load_checkpoint(callback_state)in favor of passing the callback state toCallback.load_state_dictand in 1.8, passing the entire checkpoint dictionary toCallback.on_load_checkpoint(checkpoint)(#11887) - Deprecated
Trainer.gpusin favor ofTrainer.device_idsorTrainer.num_devices(#12436) - Deprecated
Trainer.tpu_coresin favor ofTrainer.num_devices(#12437)
Removed
- Removed deprecated parameter
methodinpytorch_lightning.utilities.model_helpers.is_overridden(#10507) - Remove deprecated method
ClusterEnvironment.creates_children(#10339) - Removed deprecated
TrainerModelHooksMixin.is_function_implementedandTrainerModelHooksMixin.has_arg(#10322) - Removed deprecated
pytorch_lightning.utilities.device_dtype_mixin.DeviceDtypeModuleMixinin favor ofpytorch_lightning.core.mixins.device_dtype_mixin.DeviceDtypeModuleMixin(#10442) - Removed deprecated
LightningModule.loaded_optimizer_states_dictproperty (#10346) - Removed deprecated
Trainer.fit(train_dataloader=),Trainer.validate(val_dataloaders=), andTrainer.test(test_dataloader=)(#10325) - Removed deprecated
every_n_val_epochsparameter of ModelCheckpoint (#10366) - Removed deprecated
import pytorch_lightning.profiler.profilersin favor ofimport pytorch_lightning.profiler(#10443) - Removed deprecated property
configure_slurm_dppfrom accelerator connector (#10370) - Removed deprecated arguments
num_nodesandsync_batchnormfromDDPPlugin,DDPSpawnPlugin,DeepSpeedPlugin(#10357) - Removed deprecated property
is_slurm_managing_tasksfrom AcceleratorConnector (#10353) - Removed deprecated
LightningModule.log(tbptt_reduce_fx, tbptt_reduce_token, sync_dist_op)(#10423) - Removed deprecated
Plugin.task_idx(#10441) - Removed deprecated method
master_paramsfrom PrecisionPlugin (#10372) - Removed the automatic detachment of “extras” returned from
training_step. For example,return {'loss': ..., 'foo': foo.detach()}will now be necessary iffoohas gradients which you do not want to store (#10424) - Removed deprecated passthrough methods and properties from
Acceleratorbase class: - Removed deprecated signature for
transfer_batch_to_devicehook. The new argumentdataloader_idxis now required (#10480) - Removed deprecated
utilities.distributed.rank_zero_{warn/deprecation}(#10451) - Removed deprecated
modeargument fromModelSummaryclass (#10449) - Removed deprecated
Trainer.train_loopproperty in favor ofTrainer.fit_loop(#10482) - Removed deprecated
Trainer.train_loopproperty in favor ofTrainer.fit_loop(#10482) - Removed deprecated
disable_validationproperty from Trainer (#10450) - Removed deprecated
CheckpointConnector.hpc_loadproperty in favor ofCheckpointConnector.restore(#10525) - Removed deprecated
reload_dataloaders_every_epochfromTrainerin favour ofreload_dataloaders_every_n_epochs(#10481) - Removed the
precision_pluginattribute fromAcceleratorin favor of its equivalent attributeprecision_pluginin theTrainingTypePlugin(#10570) - Removed
DeepSpeedPlugin.{precision,amp_type,amp_level}properties (#10657) - Removed patching of
on_before_batch_transfer,transfer_batch_to_deviceandon_after_batch_transferhooks inLightningModule(#10603) - Removed argument
return_resultfrom theDDPSpawnPlugin.spawn()method (#10867) - Removed the property
TrainingTypePlugin.resultsand corresponding properties in subclasses (#10034) - Removed the
mp_queueattribute fromDDPSpawnPluginandTPUSpawnPlugin(#10034) - Removed unnecessary
_move_optimizer_statemethod overrides fromTPUSpawnPluginandSingleTPUPlugin(#10849) - Removed
should_rank_save_checkpointproperty fromTrainingTypePlugin(#11070) - Removed
model_sharded_contextmethod fromAccelerator(#10886) - Removed method
pre_dispatchfrom thePrecisionPlugin(#10887) - Removed method
setup_optimizers_in_pre_dispatchfrom thestrategiesand achieve the same logic insetupandpre_dispatchmethods (#10906) - Removed methods
pre_dispatch,dispatchandpost_dispatchfrom theAccelerator(#10885) - Removed method
training_step,test_step,validation_stepandpredict_stepfrom theAccelerator(#10890) - Removed
TrainingTypePlugin.start_{training,evaluating,predicting}hooks and the same in all subclasses (#10989, #10896) - Removed
Accelerator.on_train_start(#10999) - Removed support for Python 3.6 (#11117)
- Removed
Strategy.init_optimizersin favor ofStrategy.setup_optimizers(#11236) - Removed
profile("training_step_and_backward")inClosureclass since we already profile callstraining_stepandbackward(#11222) - Removed
Strategy.optimizer_zero_grad(#11246) - Removed
Strategy.on_gpu(#11537) - Removed
Strategy.on_tpuproperty (#11536) - Removed the abstract property
LightningLoggerBase.experiment(#11603) - Removed
FitLoop.current_epochgetter and setter (#11562) - Removed access to
_short_idinNeptuneLogger(#11517) - Removed
log_textandlog_imagefrom theLightningLoggerBaseAPI (#11857) - Removed calls to
profile("model_forward")in favor of profilingtraining_step(#12032) - Removed
get_mp_spawn_kwargsfromDDPSpawnStrategyandTPUSpawnStrategyin favor of configuration in the_SpawnLauncher(#11966) - Removed
_aggregate_metrics,_reduce_agg_metrics, and_finalize_agg_metricsfromLightningLoggerBase(#12053) - Removed the
AcceleratorConnector.device_typeproperty (#12081) - Removed
AcceleratorConnector.num_nodes(#12107) - Removed
AcceleratorConnector.has_ipuproperty (#12111) - Removed
AcceleratorConnector.use_ipuproperty (#12110) - Removed
AcceleratorConnector.has_tpuproperty (#12109) - Removed
AcceleratorConnector.use_dpproperty (#12112) - Removed
configure_sync_batchnormfromParallelStrategyand all other strategies that inherit from it (#11754) - Removed public attribute
sync_batchnormfrom strategies (#11754) - Removed
AcceleratorConnector.root_gpuproperty (#12262) - Removed
AcceleratorConnector.tpu_idproperty (#12387) - Removed
AcceleratorConnector.num_gpusproperty (#12384) - Removed
AcceleratorConnector.num_ipusproperty (#12386) - Removed
AcceleratorConnector.num_processesproperty (#12388) - Removed
AcceleratorConnector.parallel_device_idsproperty (#12072) - Removed
AcceleratorConnector.devicesproperty (#12435) - Removed
AcceleratorConnector.parallel_devicesproperty (#12075) - Removed
AcceleratorConnector.tpu_coresproperty (#12437)
Fixed
- Fixed an issue where
ModelCheckpointcould delete last checkpoint from the old directory whendirpathhas changed during resumed training (#12225) - Fixed an issue where
ModelCheckpointcould delete older checkpoints whendirpathhas changed during resumed training (#12045) - Fixed an issue where
HorovodStrategy.teardown()did not complete gracefully if an exception was thrown during callback setup #11752 - Fixed security vulnerabilities CVE-2020-1747 and CVE-2020-14343 caused by the
PyYAMLdependency (#11099) - Fixed security vulnerability “CWE-94: Improper Control of Generation of Code (Code Injection)” (#12212)
- Fixed logging on
{test,validation}_epoch_endwith multiple dataloaders (#11132) - Reset the validation progress tracking state after sanity checking (#11218)
- Fixed double evaluation bug with fault-tolerance enabled where the second call was completely skipped (#11119)
- Fixed an issue with the
TPUSpawnPluginhandling theXLA_USE_BF16environment variable incorrectly (#10990) - Fixed wrong typehint for
Trainer.lightning_optimizers(#11155) - Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy (#11307)
- Fixed bug that forced overriding
configure_optimizerswith the CLI (#11672) - Fixed type promotion when tensors of higher category than float are logged (#11401)
- Fixed
SimpleProfilersummary (#11414) - No longer set a
DistributedSamplerto thepoptorch.DataLoaderwhen IPUs are used (#12114) - Fixed bug where progress bar was not being disabled when not in rank zero during predict (#11377)
- Fixed the mid-epoch warning call while resuming training (#11556)
- Fixed
LightningModule.{un,}toggle_modelwhen only 1 optimizer is used (#12088) - Fixed an issue in
RichProgressbarto display the metrics logged only on main progress bar (#11690) - Fixed
RichProgressBarprogress when refresh rate does not evenly divide the total counter (#11668) - Fixed
RichProgressBarprogress validation bar total when using multiple validation runs within a single training epoch (#11668) - Configure native Deepspeed schedulers with interval=’step’ (#11788), (#12031)
- Update
RichProgressBarThemestyles after detecting light theme on colab (#10993) - Fixed passing
_ddp_params_and_buffers_to_ignore(#11949) - Fixed an
AttributeErrorwhen callingsave_hyperparametersand no parameters need saving (#11827) - Fixed environment variable priority for global rank determination (#11406)
- Fixed an issue that caused the Trainer to produce identical results on subsequent runs without explicit re-seeding (#11870)
- Fixed an issue that caused the Tuner to affect the random state (#11870)
- Fixed to avoid common hook warning if no hook is overridden (#12131)
- Fixed deepspeed keeping old sub-folders in same ckpt path (#12194)
- Fixed returning logged metrics instead of callback metrics during evaluation (#12224)
- Fixed the case where
logger=Noneis passed to the Trainer (#12249) - Fixed bug where the global step tracked by
ModelCheckpointwas still set even if no checkpoint was saved (#12418) - Fixed bug where
ModelCheckpointwas overriding theepochandsteplogged values (#12418) - Fixed bug where monitoring the default
epochandstepvalues withModelCheckpointwould fail (#12418) - Fixed initializing optimizers unnecessarily in
DDPFullyShardedStrategy(#12267) - Fixed check for horovod module (#12377)
- Fixed logging to loggers with multiple eval dataloaders (#12454)
- Fixed an issue with resuming from a checkpoint trained with QAT (#11346)

