Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
[1.8.6] - 2022-12-21¶
minor cleaning
[1.8.5] - 2022-12-15¶
Add function to remove checkpoint to allow override for extended classes (#16067)
[1.8.4] - 2022-12-08¶
[1.8.4] - Changed¶
[1.8.4] - Fixed¶
Fixed issue with unsupported torch.inference_mode() on hpu backends (#15918)
Fixed LRScheduler import for PyTorch 2.0 (#15940)
Fixed
fit_loop.restarting
to beFalse
for lr finder (#15620)Fixed
torch.jit.script
-ing a LightningModule causing an unintended error message about deprecateduse_amp
property (#15947)Fixed the
XLAProfiler
not recording anything due to mismatching of action names (#15885)
[1.8.3] - 2022-11-22¶
[1.8.3] - Changed¶
[1.8.2] - 2022-11-17¶
[1.8.2] - Fixed¶
[1.8.1] - 2022-11-10¶
[1.8.1] - Fixed¶
Fixed
TensorBoardLogger
not validating the input array type when logging the model graph (#15323)Fixed an attribute error in
ColossalAIStrategy
at import time whentorch.distributed
is not available (#15535)Fixed an issue when calling
fs.listdir
with file URI instead of path inCheckpointConnector
(#15413)Fixed an issue with the
BaseFinetuning
callback not setting thetrack_running_stats
attribute for batch normaliztion layers (#15063)Fixed an issue with
WandbLogger(log_model=True|'all)
raising an error and not being able to serialize tensors in the metadata (#15544)Fixed the gradient unscaling logic when using
Trainer(precision=16)
and fused optimizers such asAdam(..., fused=True)
(#15544)Fixed model state transfer in multiprocessing launcher when running multi-node (#15567)
Fixed manual optimization raising
AttributeError
with Bagua Strategy (#12534)Fixed the import of
pytorch_lightning
causing a warning ‘Redirects are currently not supported in Windows or MacOs’ (#15610)
[1.8.0] - 2022-11-01¶
[1.8.0] - Added¶
Added support for requeueing slurm array jobs (#15040)
Added native AMP support for
ddp_fork
(and associated alias strategies) with CUDA GPUs (#14983)Added
BatchSizeFinder
callback (#11089)Added
LearningRateFinder
callback (#13802)Tuner now supports a new
method
argument which will determine when to run theBatchSizeFinder
: one offit
,validate
,test
orpredict
(#11089)Added prefix to log message in
seed_everything
with rank info (#14031)Added support for auto wrapping for
DDPFullyShardedNativeStrategy
(#14252)Added support for passing extra init-parameters to the
LightningDataModule.from_datasets
(#14185)Added support for saving sharded optimizer state dict outside of
DDPShardedStrategy
(#14208)Added support for auto wrapping for
DDPFullyShardedStrategy
(#14383)Integrate the
lightning_utilities
package ( #14475, #14537, #14556, #14558, #14575, #14620)Added
args
parameter toLightningCLI
to ease running from within Python (#14596)Added
WandbLogger.download_artifact
andWandbLogger.use_artifact
for managing artifacts with Weights and Biases (#14551)Added an option to configure the signal SLURM sends when a job is preempted or requeued (#14626)
Added a warning when the model passed to
LightningLite.setup()
does not have all parameters on the same device (#14822)The
CometLogger
now flags the Comet Experiments as being created from Lightning for analytics purposes (#14906)Introduce
ckpt_path="hpc"
keyword for checkpoint loading (#14911)Added a more descriptive error message when attempting to fork processes with pre-initialized CUDA context (#14709)
Added support for custom parameters in subclasses of
SaveConfigCallback
(#14998)Added
inference_mode
flag to Trainer to let users enable/disable inference mode during evaluation (#15034)Added
LightningLite.no_backward_sync
for control over efficient gradient accumulation with distributed strategies (#14966)Added a sanity check that scripts are executed with the
srun
command in SLURM and that environment variables are not conflicting (#15011)Added an error message when attempting to launch processes with
python -i
and an interactive-incompatible strategy (#15293)
[1.8.0] - Changed¶
The
Trainer.{fit,validate,test,predict,tune}
methods now raise a useful error message if the input is not aLightningModule
(#13892)Raised a
MisconfigurationException
if batch transfer hooks are overriden withIPUAccelerator
(#13961)Replaced the unwrapping logic in strategies with direct access to unwrapped
LightningModule
(#13738)Enabled
on_before_batch_transfer
forDPStrategy
andIPUAccelerator
(#14023)When resuming training with Apex enabled, the
Trainer
will now raise an error (#14341)Included
torch.cuda
rng state to the aggregate_collect_rng_states()
and_set_rng_states()
(#14384)Changed
trainer.should_stop
to not stop in between an epoch and run untilmin_steps/min_epochs
only (#13890)The
pyDeprecate
dependency is no longer installed (#14472)When using multiple loggers, by default checkpoints and profiler output now get saved to the log dir of the first logger in the list (#14325)
In Lightning Lite, state-dict access to the module wrapper now gets passed through to the original module reference (#14629)
Removed fall-back to
LightningEnvironment
when number of SLURM tasks does not correspond to number of processes in Trainer (#14300)Aligned DDP and DDPSpawn strategies in setting up the environment (#11073)
Integrated the Lite Precision plugins into the PL Precision plugins - the base class in PL now extends the
lightning_lite.precision.Precision
base class (#14798)The
PrecisionPlugin.backward
signature changed: Theclosure_loss
argument was renamed totensor
The
PrecisionPlugin.{pre_,post_}backward
signature changed: Theclosure_loss
argument was renamed totensor
and moved as the first argumentThe
PrecisionPlugin.optimizer_step
signature changed: Themodel
,optimizer_idx
andclosure
arguments need to be passed as keyword arguments now
Trainer queries the CUDA devices through NVML if available to avoid initializing CUDA before forking, which eliminates the need for the
PL_DISABLE_FORK
environment variable introduced in v1.7.4 (#14631)The
MLFlowLogger.finalize()
now sets the status toFAILED
when an exception occurred inTrainer
, and sets the status toFINISHED
on successful completion (#12292)It is no longer needed to call
model.double()
when usingprecision=64
in Lightning Lite (#14827)HPC checkpoints are now loaded automatically only in slurm environment when no specific value for
ckpt_path
has been set (#14911)The
Callback.on_load_checkpoint
now gets the full checkpoint dictionary and thecallback_state
argument was renamedcheckpoint
(#14835)Moved the warning about saving nn.Module in
save_hyperparameters()
to before the deepcopy (#15132)To avoid issues with forking processes, from PyTorch 1.13 and higher, Lightning will directly use the PyTorch NVML-based check for
torch.cuda.device_count
and from PyTorch 1.14 and higher, Lightning will configure PyTorch to use a NVML-based check fortorch.cuda.is_available
. (#15110, #15133)The
NeptuneLogger
now usesneptune.init_run
instead of the deprecatedneptune.init
to initialize a run (#15393)
[1.8.0] - Deprecated¶
Deprecated
LightningDeepSpeedModule
(#14000)Deprecated
amp_level
fromTrainer
in favour of passing it explictly via precision plugin (#13898)Deprecated the calls to
pytorch_lightning.utiltiies.meta
functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)Deprecated the
unwrap_lightning_module
andunwrap_lightning_module_sharded
utility functions in favor of accessing the unwrappedLightningModule
on the strategy directly (#13738)Deprecated the
pl_module
argument inLightningParallelModule
,LightningDistributedModule
,LightningShardedDataParallel
,LightningBaguaModule
andLightningDeepSpeedModule
wrapper classes (#13738)Deprecated the
on_colab_kaggle
function (#14247)Deprecated the internal
pl.core.mixins.DeviceDtypeModuleMixin
class (#14511, #14548)Deprecated all functions in
pytorch_lightning.utilities.xla_device
(#14514, #14550)Deprecated the internal
inner_f
functionDeprecated the internal
pl_multi_process
functionDeprecated the internal
XLADeviceUtils.xla_available
staticmethodDeprecated the
XLADeviceUtils.tpu_device_exists
staticmethod in favor ofpytorch_lightning.accelerators.TPUAccelerator.is_available()
Deprecated
pytorch_lightning.utilities.distributed.tpu_distributed
in favor oflightning_lite.accelerators.tpu.tpu_distributed
(#14550)Deprecated all functions in
pytorch_lightning.utilities.cloud_io
in favor oflightning_lite.utilities.cloud_io
(#14515)Deprecated the functions in
pytorch_lightning.utilities.apply_func
in favor oflightning_utilities.core.apply_func
(#14516, #14537)Deprecated all functions in
pytorch_lightning.utilities.device_parser
(#14492, #14753)Deprecated the
pytorch_lightning.utilities.device_parser.determine_root_gpu_device
in favor oflightning_lite.utilities.device_parser.determine_root_gpu_device
Deprecated the
pytorch_lightning.utilities.device_parser.parse_gpu_ids
in favor oflightning_lite.utilities.device_parser.parse_gpu_ids
Deprecated the
pytorch_lightning.utilities.device_parser.is_cuda_available
in favor oflightning_lite.accelerators.cuda.is_cuda_available
Deprecated the
pytorch_lightning.utilities.device_parser.num_cuda_devices
in favor oflightning_lite.accelerators.cuda.num_cuda_devices
Deprecated the
pytorch_lightning.utilities.device_parser.parse_cpu_cores
in favor oflightning_lite.accelerators.cpu.parse_cpu_cores
Deprecated the
pytorch_lightning.utilities.device_parser.parse_tpu_cores
in favor oflightning_lite.accelerators.tpu.parse_tpu_cores
Deprecated the
pytorch_lightning.utilities.device_parser.parse_hpus
in favor ofpytorch_lightning.accelerators.hpu.parse_hpus
Deprecated duplicate
SaveConfigCallback
parameters inLightningCLI.__init__
:save_config_kwargs
,save_config_overwrite
andsave_config_multifile
. Newsave_config_kwargs
parameter should be used instead (#14998)Deprecated
TrainerFn.TUNING
,RunningStage.TUNING
andtrainer.tuning
property (#15100)Deprecated custom
pl.utilities.distributed.AllGatherGrad
implementation in favor of PyTorch’s (#15364)
[1.8.0] - Removed¶
Removed the deprecated
Trainer.training_type_plugin
property in favor ofTrainer.strategy
(#14011)Removed all deprecated training type plugins (#14011)
Removed the deprecated
DDP2Strategy
(#14026)Removed the deprecated
DistributedType
andDeviceType
enum classes (#14045)Removed deprecated support for passing the
rank_zero_warn
warning category positionally (#14470)Removed the legacy and unused
Trainer.get_deprecated_arg_names()
(#14415)Removed the deprecated
on_train_batch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#14373)Removed the deprecated
training_epoch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#14373)Removed the experimental
pytorch_lightning.utiltiies.meta
functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)Removed the deprecated
LoggerCollection
;Trainer.logger
andLightningModule.logger
now returns the first logger when more than one gets passed to the Trainer (#14283)Removed the deprecated the
trainer.lr_schedulers
(#14408)Removed the deprecated
LightningModule.{on_hpc_load,on_hpc_save}
hooks in favor of the general purpose hooksLightningModule.{on_load_checkpoint,on_save_checkpoint}
(#14315)Removed deprecated support for old torchtext versions (#14375)
Removed deprecated support for the old
neptune-client
API in theNeptuneLogger
(#14727)Removed the deprecated
weights_save_path
Trainer argumnent andTrainer.weights_save_path
property (#14424)Removed the deprecated (#14471)
pytorch_lightning.utilities.distributed.rank_zero_only
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_only
pytorch_lightning.utilities.distributed.rank_zero_debug
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_debug
pytorch_lightning.utilities.distributed.rank_zero_info
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_info
pytorch_lightning.utilities.warnings.rank_zero_warn
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_warn
pytorch_lightning.utilities.warnings.rank_zero_deprecation
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_deprecation
pytorch_lightning.utilities.warnings.LightningDeprecationWarning
in favor ofpytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
Removed deprecated
Trainer.num_processes
attribute in favour ofTrainer.num_devices
(#14423)Removed the deprecated
Trainer.data_parallel_device_ids
hook in favour ofTrainer.device_ids
(#14422)Removed the deprecated class
TrainerCallbackHookMixin
(#14401)Removed the deprecated
BaseProfiler
andAbstractProfiler
classes (#14404)Removed the deprecated way to set the distributed backend via the environment variable
PL_TORCH_DISTRIBUTED_BACKEND
, in favor of setting theprocess_group_backend
in the strategy constructor (#14693)Removed deprecated callback hooks (#14834)
Callback.on_configure_sharded_model
in favor ofCallback.setup
Callback.on_before_accelerator_backend_setup
in favor ofCallback.setup
Callback.on_batch_start
in favor ofCallback.on_train_batch_start
Callback.on_batch_end
in favor ofCallback.on_train_batch_end
Callback.on_epoch_start
in favor ofCallback.on_{train,validation,test}_epoch_start
Callback.on_epoch_end
in favor ofCallback.on_{train,validation,test}_epoch_end
Callback.on_pretrain_routine_{start,end}
in favor ofCallback.on_fit_start
Removed the deprecated device attributes
Trainer.{devices,gpus,num_gpus,ipus,tpu_cores}
in favor of the accelerator-agnosticTrainer.num_devices
(#14829)Removed the deprecated
LightningIPUModule
(#14830)Removed the deprecated
Logger.agg_and_log_metrics
hook in favour ofLogger.log_metrics
and theagg_key_funcs
andagg_default_func
arguments. (#14840)Removed the deprecated precision plugin checkpoint hooks
PrecisionPlugin.on_load_checkpoint
andPrecisionPlugin.on_save_checkpoint
(#14833)Removed the deprecated
Trainer.root_gpu
attribute in favor ofTrainer.strategy.root_device
(#14829)Removed the deprecated
Trainer.use_amp
andLightningModule.use_amp
attributes (#14832)Removed the deprecated callback hooks
Callback.on_init_start
andCallback.on_init_end
(#14867)Removed the deprecated
Trainer.run_stage
in favor ofTrainer.{fit,validate,test,predict}
(#14870)Removed the deprecated
SimpleProfiler.profile_iterable
andAdvancedProfiler.profile_iterable
attributes (#14864)Removed the deprecated
Trainer.verbose_evaluate
(#14884)Removed the deprecated
Trainer.should_rank_save_checkpoint
(#14885)Removed the deprecated
TrainerOptimizersMixin
(#14887)Removed the deprecated
Trainer.lightning_optimizers
(#14889)Removed the deprecated
TrainerDataLoadingMixin
(#14888)Removed the deprecated
Trainer.call_hook
in favor ofTrainer._call_callback_hooks
,Trainer._call_lightning_module_hook
,Trainer._call_ttp_hook
, andTrainer._call_accelerator_hook
(#14869)Removed the deprecated
Trainer.{validated,tested,predicted}_ckpt_path
(#14897)Removed the deprecated
device_stats_monitor_prefix_metric_keys
(#14890)Removed the deprecated
LightningDataModule.on_save/load_checkpoint
hooks (#14909)Removed support for returning a value in
Callback.on_save_checkpoint
in favor of implementingCallback.state_dict
(#14835)
[1.8.0] - Fixed¶
Fixed an issue with
LightningLite.setup()
not setting the.device
attribute correctly on the returned wrapper (#14822)Fixed an attribute error when running the tuner together with the
StochasticWeightAveraging
callback (#14836)Fixed MissingFieldException in offline mode for the
NeptuneLogger()
(#14919)Fixed wandb
save_dir
is overridden byNone
dir
when using CLI (#14878)Fixed a missing call to
LightningDataModule.load_state_dict
hook while restoring checkpoint usingLightningDataModule.load_from_checkpoint
(#14883)Fixed torchscript error with containers of LightningModules (#14904)
Fixed reloading of the last checkpoint on run restart (#14907)
SaveConfigCallback
instances should only save the config once to allow having theoverwrite=False
safeguard when usingLightningCLI(..., run=False)
(#14927)Fixed an issue with terminating the trainer profiler when a
StopIteration
exception is raised while using anIterableDataset
(#14940)Do not update on-plateau schedulers when reloading from an end-of-epoch checkpoint (#14702)
Fixed
Trainer
support for PyTorch built without distributed support (#14971)Fixed batch normalization statistics calculation in
StochasticWeightAveraging
callback (#14866)Avoided initializing optimizers during deepspeed inference (#14944)
Fixed
LightningCLI
parse_env and description in subcommands (#15138)Fixed an exception that would occur when creating a
multiprocessing.Pool
after importing Lightning (#15292)Fixed a pickling error when using
RichProgressBar
together with checkpointing (#15319)Fixed the
RichProgressBar
crashing when used with distributed strategies (#15376)Fixed an issue with
RichProgressBar
not resetting the internal state for the sanity check progress (#15377)Fixed an issue with DataLoader re-instantiation when the attribute is an array and the default value of the corresponding argument changed (#15409)
[1.7.7] - 2022-09-22¶
[1.7.7] - Fixed¶
Fixed the availability check for the neptune-client package (#14714)
Break HPU Graphs into two parts (forward + backward as one and optimizer as another) for better performance (#14656)
Fixed torchscript error with ensembles of LightningModules (#14657, #14724)
Fixed an issue with
TensorBoardLogger.finalize
creating a new experiment when none was created during the Trainer’s execution (#14762)Fixed
TypeError
on import whentorch.distributed
is not available (#14809)
[1.7.6] - 2022-09-13¶
[1.7.6] - Changed¶
Improved the error messaging when passing
Trainer.method(model, x_dataloader=None)
with no module-method implementations available (#14614)
[1.7.6] - Fixed¶
Reset the dataloaders on OOM failure in batch size finder to use the last successful batch size (#14372)
Fixed an issue to keep downscaling the batch size in case there hasn’t been even a single successful optimal batch size with
mode="power"
(#14372)Fixed an issue where
self.log
-ing a tensor would create a user warning from PyTorch about cloning tensors (#14599)Fixed compatibility when
torch.distributed
is not available (#14454)
[1.7.5] - 2022-09-06¶
[1.7.5] - Fixed¶
[1.7.4] - 2022-08-31¶
[1.7.4] - Added¶
Added an environment variable
PL_DISABLE_FORK
that can be used to disable all forking in the Trainer (#14319)
[1.7.4] - Fixed¶
[1.7.3] - 2022-08-25¶
[1.7.3] - Fixed¶
Fixed an assertion error when using a
ReduceOnPlateau
scheduler with the Horovod strategy (#14215)Fixed an
AttributeError
when accessingLightningModule.logger
and the Trainer has multiple loggers (#14234)Added back support for
log
ging in theconfigure_gradient_clipping
hook after unintended removal in v1.7.2 (#14298)Fixed wrong num padding for
RichProgressBar
(#14296)Fixed an issue to avoid the impact of sanity check on
reload_dataloaders_every_n_epochs
for validation (#13964)
[1.7.2] - 2022-08-17¶
[1.7.2] - Added¶
[1.7.2] - Changed¶
[1.7.2] - Fixed¶
Fixed a bug that caused spurious
AttributeError
when multipleDataLoader
classes are imported (#14117)Fixed epoch-end logging results not being reset after the end of the epoch (#14061)
Fixed resuming from a checkpoint when using Stochastic Weight Averaging (SWA) (#9938)
Fixed the device placement when
LightningModule.cuda()
gets called without specifying a device index and the current cuda device was not 0 (#14128)Avoided false positive warning about using
sync_dist
when using torchmetrics (#14143)Avoid
metadata.entry_points
deprecation warning on Python 3.10 (#14052)Fixed epoch-end logging results not being reset after the end of the epoch (#14061)
Avoid raising the sampler warning if num_replicas=1 (#14097)
Fixed saving hyperparameters in a composition where the parent class is not a
LightningModule
orLightningDataModule
(#14151)Avoided requiring the FairScale package to use precision with the fsdp native strategy (#14092)
Fixed an issue in which the default name for a run in
WandbLogger
would be set to the project name instead of a randomly generated string (#14145)Fixed not preserving set attributes on
DataLoader
andBatchSampler
when instantiated inside*_dataloader
hooks (#14212)
[1.7.1] - 2022-08-09¶
[1.7.1] - Fixed¶
Casted only floating point tensors to fp16 with IPUs (#13983)
Casted tensors to fp16 before moving them to device with
DeepSpeedStrategy
(#14000)Fixed the
NeptuneLogger
dependency being unrecognized (#13988)Fixed an issue where users would be warned about unset
max_epochs
even whenfast_dev_run
was set (#13262)Fixed MPS device being unrecognized (#13992)
Fixed incorrect
precision="mixed"
being used withDeepSpeedStrategy
andIPUStrategy
(#14041)Fixed dtype inference during gradient norm computation (#14051)
Fixed a bug that caused
ddp_find_unused_parameters
to be setFalse
, whereas the intended default isTrue
(#14095)
[1.7.0] - 2022-08-02¶
[1.7.0] - Added¶
Added
ServableModule
and its associated callback calledServableModuleValidator
to ensure the model can served (#13614)Converted validation loop config warnings to
PossibleUserWarning
(#13377)Added a flag named
log_rank_zero_only
toEarlyStopping
to disable logging to non-zero rank processes (#13233)Added support for reloading the last checkpoint saved by passing
ckpt_path="last"
(#12816)Added
LightningDataModule.load_from_checkpoint
to support loading datamodules directly from checkpoint (#12550)Added a friendly error message when attempting to call
Trainer.save_checkpoint()
without a model attached (#12772)Added a friendly error message when attempting to use
DeepSpeedStrategy
on unsupported accelerators (#12699)Enabled
torch.inference_mode
for evaluation and prediction (#12715)Added support for setting
val_check_interval
to a value higher than the amount of training batches whencheck_val_every_n_epoch=None
(#11993)Include the
pytorch_lightning
version as a header in the CLI config files (#12532)Added support for
Callback
registration through entry points (#12739)Added support for
Trainer(deterministic="warn")
to warn instead of fail when a non-deterministic operation is encountered (#12588)Added profiling to the loops’ dataloader
__next__
calls (#12124)Hivemind Strategy
Include a version suffix for new “last” checkpoints of later runs in the same directory (#12902)
Show a better error message when a Metric that does not return a Tensor is logged (#13164)
Added missing
predict_dataset
argument inLightningDataModule.from_datasets
to create predict dataloaders (#12942)Added class name prefix to metrics logged by
DeviceStatsMonitor
(#12228)Automatically wrap custom samplers under a distributed environment by using
DistributedSamplerWrapper
(#12959)Added profiling of
LightningDataModule
hooks (#12971)Added Native FSDP Strategy (#12447)
Added breaking of lazy graph across training, validation, test and predict steps when training with habana accelerators to ensure better performance (#12938)
Added
Checkpoint
class to inherit from (#13024)Added CPU metric tracking to
DeviceStatsMonitor
(#11795)Added
teardown()
method toAccelerator
(#11935)Added support for using custom Trainers that don’t include callbacks using the CLI (#13138)
Added a
timeout
argument toDDPStrategy
andDDPSpawnStrategy
. (#13244, #13383)Added
XLAEnvironment
cluster environment plugin (#11330)Added logging messages to notify when
FitLoop
stopping conditions are met (#9749)Added support for calling unknown methods with
DummyLogger
(#13224Added support for recursively setting the
Trainer
reference for ensembles ofLightningModule
s (#13638Added Apple Silicon Support via
MPSAccelerator
(#13123)Added support for DDP Fork (#13405)
Added support for async checkpointing (#13658)
Added support for HPU Device stats monitor (#13819)
[1.7.0] - Changed¶
accelerator="gpu"
now automatically selects an available GPU backend (CUDA and MPS currently) (#13642)Enable validation during overfitting (#12527)
Added dataclass support to
extract_batch_size
(#12573)Changed checkpoints save path in the case of one logger and user-provided weights_save_path from
weights_save_path/name/version/checkpoints
toweights_save_path/checkpoints
(#12372)Changed checkpoints save path in the case of multiple loggers and user-provided weights_save_path from
weights_save_path/name1_name2/version1_version2/checkpoints
toweights_save_path/checkpoints
(#12372)Marked
swa_lrs
argument inStochasticWeightAveraging
callback as required (#12556)LightningCLI
’s shorthand notation changed to use jsonargparse native feature (#12614)LightningCLI
changed to use jsonargparse native support for list append (#13129)Changed
seed_everything_default
argument in theLightningCLI
to typeUnion[bool, int]
. If set toTrue
a seed is automatically generated for the parser argument--seed_everything
. (#12822, #13110)Make positional arguments required for classes passed into the
add_argparse_args
function. (#12504)Raise an error if there are insufficient training batches when using a float value of
limit_train_batches
(#12885)DataLoader
instantiated inside a*_dataloader
hook will not set the passed arguments as attributes anymore (#12981)When a multi-element tensor is logged, an error is now raised instead of silently taking the mean of all elements (#13164)
The
WandbLogger
will now use the run name in the logs folder if it is provided, and otherwise the project name (#12604)Enabled using any Sampler in distributed environment in Lite (#13646)
Raised a warning instead of forcing
sync_dist=True
on epoch end (13364)Updated
val_check_interval
(int) to consider total train batches processed instead of_batches_that_stepped
for validation check during training (#12832Updated Habana Accelerator’s
auto_device_count
,is_available
&get_device_name
methods based on the latest torch habana package (#13423)Disallowed using
BatchSampler
when running on multiple IPUs (#13854)
[1.7.0] - Deprecated¶
Deprecated
pytorch_lightning.accelerators.gpu.GPUAccelerator
in favor ofpytorch_lightning.accelerators.cuda.CUDAAccelerator
(#13636)Deprecated
pytorch_lightning.loggers.base.LightningLoggerBase
in favor ofpytorch_lightning.loggers.logger.Logger
, and deprecatedpytorch_lightning.loggers.base
in favor ofpytorch_lightning.loggers.logger
(#120148)Deprecated
pytorch_lightning.callbacks.base.Callback
in favor ofpytorch_lightning.callbacks.callback.Callback
(#13031)Deprecated
num_processes
,gpus
,tpu_cores,
andipus
from theTrainer
constructor in favor of using theaccelerator
anddevices
arguments (#11040)Deprecated setting
LightningCLI(seed_everything_default=None)
in favor ofFalse
(#12804).Deprecated
pytorch_lightning.core.lightning.LightningModule
in favor ofpytorch_lightning.core.module.LightningModule
(#12740)Deprecated
pytorch_lightning.loops.base.Loop
in favor ofpytorch_lightning.loops.loop.Loop
(#13043)Deprecated
Trainer.reset_train_val_dataloaders()
in favor ofTrainer.reset_{train,val}_dataloader
(#12184)Deprecated LightningCLI’s registries in favor of importing the respective package (#13221)
Deprecated public utilities in
pytorch_lightning.utilities.cli.LightningCLI
in favor of equivalent copies inpytorch_lightning.cli.LightningCLI
(#13767)Deprecated
pytorch_lightning.profiler
in favor ofpytorch_lightning.profilers
(#12308)
[1.7.0] - Removed¶
Removed deprecated
IndexBatchSamplerWrapper.batch_indices
(#13565)Removed the deprecated
LightningModule.add_to_queue
andLightningModule.get_from_queue
method (#13600)Removed deprecated
pytorch_lightning.core.decorators.parameter_validation
fromdecorators
(#13514)Removed the deprecated
Logger.close
method (#13149)Removed the deprecated
weights_summary
argument from theTrainer
constructor (#13070)Removed the deprecated
flush_logs_every_n_steps
argument from theTrainer
constructor (#13074)Removed the deprecated
process_position
argument from theTrainer
constructor (13071)Removed the deprecated
checkpoint_callback
argument from theTrainer
constructor (#13027)Removed the deprecated
on_{train,val,test,predict}_dataloader
hooks from theLightningModule
andLightningDataModule
(#13033)Removed the deprecated
TestTubeLogger
(#12859)Removed the deprecated
pytorch_lightning.core.memory.LayerSummary
andpytorch_lightning.core.memory.ModelSummary
(#12593)Removed the deprecated
summarize
method from theLightningModule
(#12559)Removed the deprecated
model_size
property from theLightningModule
class (#12641)Removed the deprecated
stochastic_weight_avg
argument from theTrainer
constructor (#12535)Removed the deprecated
progress_bar_refresh_rate
argument from theTrainer
constructor (#12514)Removed the deprecated
prepare_data_per_node
argument from theTrainer
constructor (#12536)Removed the deprecated
pytorch_lightning.core.memory.{get_gpu_memory_map,get_memory_profile}
(#12659)Removed the deprecated
terminate_on_nan
argument from theTrainer
constructor (#12553)Removed the deprecated
XLAStatsMonitor
callback (#12688)Remove deprecated
pytorch_lightning.callbacks.progress.progress
(#12658)Removed the deprecated
dim
andsize
arguments from theLightningDataModule
constructor(#12780)Removed the deprecated
train_transforms
argument from theLightningDataModule
constructor(#12662)Removed the deprecated
log_gpu_memory
argument from theTrainer
constructor (#12657)Removed the deprecated automatic logging of GPU stats by the logger connector (#12657)
Removed deprecated
GPUStatsMonitor
callback (#12554)Removed support for passing strategy names or strategy instances to the accelerator Trainer argument (#12696)
Removed support for passing strategy names or strategy instances to the plugins Trainer argument (#12700)
Removed the deprecated
val_transforms
argument from theLightningDataModule
constructor (#12763)Removed the deprecated
test_transforms
argument from theLightningDataModule
constructor (#12773)Removed deprecated
Trainer(max_steps=None)
(#13591)Removed deprecated
dataloader_idx
argument fromon_train_batch_start/end
hooksCallback
andLightningModule
(#12769, #12977)Removed deprecated
get_progress_bar_dict
property fromLightningModule
(#12839)Removed sanity check for multi-optimizer support with habana backends (#13217)
Removed the need to explicitly load habana module (#13338)
Removed the deprecated
Strategy.post_dispatch()
hook (#13461)Removed deprecated
pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor.lr_sch_names
(#13353)Removed deprecated
Trainer.slurm_job_id
in favor ofSLURMEnvironment.job_id
(#13459)Removed support for the
DDP2Strategy
(#12705)Removed deprecated
LightningDistributed
(#13549)Removed deprecated ClusterEnvironment properties
master_address
andmaster_port
in favor ofmain_address
andmain_port
(#13458)Removed deprecated ClusterEnvironment methods
KubeflowEnvironment.is_using_kubelfow()
,LSFEnvironment.is_using_lsf()
andTorchElasticEnvironment.is_using_torchelastic()
in favor of thedetect()
method (#13458)Removed deprecated
Callback.on_keyboard_interrupt
(#13438)Removed deprecated
LightningModule.on_post_move_to_device
(#13548)Removed
TPUSpawnStrategy.{tpu_local_core_rank,tpu_global_core_rank}
attributes in favor ofTPUSpawnStrategy.{local_rank,global_rank}
(#11163)Removed
SingleTPUStrategy.{tpu_local_core_rank,tpu_global_core_rank}
attributes in favor ofSingleTPUStrategy.{local_rank,global_rank}
(#11163)
[1.7.0] - Fixed¶
Improved support for custom
DataLoader
s when instantiated in*_dataloader
hook (#12981)Allowed custom
BatchSampler
s when instantiated in*_dataloader
hook #13640)Fixed an issue with unsupported torch.inference_mode() on hpu backends by making it use no_grad (#13014)
The model wrapper returned by
LightningLite.setup()
now properly supports pass-through when looking up attributes (#12597)Fixed issue where the CLI fails with certain torch objects (#13153)
Fixed
LightningCLI
signature parameter resolving for some lightning classes (#13283)Fixed Model Summary when using DeepSpeed Stage 3 (#13427)
Fixed
pytorch_lightning.utilities.distributed.gather_all_tensors
to handle tensors of different dimensions (#12630)Fixed the input validation for the accelerator Trainer argument when passed as a string (#13417)
Fixed
Trainer.predict(return_predictions=False)
to track prediction’s batch_indices (#13629)Fixed and issue that prevented setting a custom
CheckpointIO
plugin with strategies (#13785)Fixed main progress bar counter when
val_check_interval=int
andcheck_val_every_n_epoch=None
(#12832Improved support for custom
ReduceLROnPlateau
scheduler ifreduce_on_plateau
is set by the user in scheduler config (#13838)Used
global_step
while restoring logging step for old checkpoints (#13645)When training with
precision=16
on IPU, the cast has been moved off the IPU onto the host, making the copies from host to IPU cheaper (#13880)Fixed error handling in learning rate finder when not enough data points are available to give a good suggestion (#13845)
Fixed an issue that caused the learning rate finder to set the model’s learning rate to None when no suggestion was possible (#13845)
Fixed an issue causing deterministic algorighms and other globals to get reset in spawned processes (#13921)
Fixed default
amp_level
forDeepSpeedPrecisionPlugin
toO2
(#13897)Fixed Python 3.10 compatibility for truncated back-propagation through time (TBPTT) (#13973)
Fixed
TQDMProgressBar
reset and update to show correct time estimation (2/2) (#13962)
[1.6.5] - 2022-07-13¶
[1.6.5] - Fixed¶
Fixed
estimated_stepping_batches
requiring distributed comms inconfigure_optimizers
for theDeepSpeedStrategy
(#13350)Fixed bug with Python version check that prevented use with development versions of Python (#13420)
The loops now call
.set_epoch()
also on batch samplers if the dataloader has one wrapped in a distributed sampler (#13396)Fixed the restoration of log step during restart (#13467)
[1.6.4] - 2022-06-01¶
[1.6.4] - Added¶
Added all DDP params to be exposed through hpu parallel strategy (#13067)
[1.6.4] - Changed¶
[1.6.4] - Fixed¶
Fixed an issue causing zero-division error for empty dataloaders (#12885)
Fixed mismatching default values for the types of some arguments in the DeepSpeed and Fully-Sharded strategies which made the CLI unable to use them (#12989)
Avoid redundant callback restore warning while tuning (#13026)
Fixed
Trainer(precision=64)
during evaluation which now uses the wrapped precision module (#12983)Fixed an issue to use wrapped
LightningModule
for evaluation duringtrainer.fit
forBaguaStrategy
(#12983)Fixed an issue wrt unnecessary usage of habana mixed precision package for fp32 types (#13028)
Fixed the number of references of
LightningModule
so it can be deleted (#12897)Fixed
materialize_module
setting a module’s child recursively (#12870)Fixed issue where the CLI could not pass a
Profiler
to theTrainer
(#13084)Fixed torchelastic detection with non-distributed installations (#13142)
Fixed logging’s step values when multiple dataloaders are used during evaluation (#12184)
Fixed epoch logging on train epoch end (#13025)
Fixed
DDPStrategy
andDDPSpawnStrategy
to initialize optimizers only after moving the module to the device (#11952)
[1.6.3] - 2022-05-03¶
[1.6.3] - Fixed¶
Use only a single instance of
rich.console.Console
throughout codebase (#12886)Fixed an issue to ensure all the checkpoint states are saved in a common filepath with
DeepspeedStrategy
(#12887)Fixed
trainer.logger
deprecation message (#12671)Fixed an issue where sharded grad scaler is passed in when using BF16 with the
ShardedStrategy
(#12915)Fixed an issue wrt recursive invocation of DDP configuration in hpu parallel plugin (#12912)
Fixed printing of ragged dictionaries in
Trainer.validate
andTrainer.test
(#12857)Fixed threading support for legacy loading of checkpoints (#12814)
Fixed pickling of
KFoldLoop
(#12441)Stopped
optimizer_zero_grad
from being called after IPU execution (#12913)Fixed
fuse_modules
to be qat-aware fortorch>=1.11
(#12891)Enforced eval shuffle warning only for default samplers in DataLoader (#12653)
Enable mixed precision in
DDPFullyShardedStrategy
whenprecision=16
(#12965)Fixed
TQDMProgressBar
reset and update to show correct time estimation (1/2) (#12889)Fixed fit loop restart logic to enable resume using the checkpoint (#12821)
[1.6.2] - 2022-04-27¶
[1.6.2] - Fixed¶
Fixed
ImportError
whentorch.distributed
is not available. (#12794)When using custom DataLoaders in LightningDataModule, multiple inheritance is resolved properly (#12716)
Fixed encoding issues on terminals that do not support unicode characters (#12828)
Fixed support for
ModelCheckpoint
monitors with dots (#12783)
[1.6.1] - 2022-04-13¶
[1.6.1] - Changed¶
Support
strategy
argument being case insensitive (#12528)
[1.6.1] - Fixed¶
Run main progress bar updates independent of val progress bar updates in
TQDMProgressBar
(#12563)Avoid calling
average_parameters
multiple times per optimizer step (#12452)Properly pass some Logger’s parent’s arguments to
super().__init__()
(#12609)Fixed an issue where incorrect type warnings appear when the overridden
LightningLite.run
method accepts user-defined arguments (#12629)Fixed
rank_zero_only
decorator in LSF environments (#12587)Don’t raise a warning when
nn.Module
is not saved under hparams (#12669)Raise
MisconfigurationException
when the accelerator is available but the user passes invalid([]/0/"0")
values to thedevices
flag (#12708)Support
auto_select_gpus
with the accelerator and devices API (#12608)
[1.6.0] - 2022-03-29¶
[1.6.0] - Added¶
Allow logging to an existing run ID in MLflow with
MLFlowLogger
(#12290)Enable gradient accumulation using Horovod’s
backward_passes_per_step
(#11911)Add new
DETAIL
log level to provide useful logs for improving monitoring and debugging of batch jobs (#11008)Added a flag
SLURMEnvironment(auto_requeue=True|False)
to control whether Lightning handles the requeuing (#10601)Fault Tolerant Manual
Add
_Stateful
protocol to detect if classes are stateful (#10646)Add
_FaultTolerantMode
enum used to track different supported fault tolerant modes (#10645)Add a
_rotate_worker_indices
utility to reload the state according the latest worker (#10647)Add stateful workers (#10674)
Add an utility to collect the states across processes (#10639)
Add logic to reload the states across data loading components (#10699)
Cleanup some fault tolerant utilities (#10703)
Enable Fault Tolerant Manual Training (#10707)
Broadcast the
_terminate_gracefully
to all processes and add support for DDP (#10638)
Added support for re-instantiation of custom (subclasses of)
DataLoaders
returned in the*_dataloader()
methods, i.e., automatic replacement of samplers now works with custom types ofDataLoader
(#10680)Added a function to validate if fault tolerant training is supported. (#10465)
Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)
Show a better error message when a custom
DataLoader
implementation is not well implemented and we need to reconstruct it (#10719)Show a better error message when frozen dataclass is used as a batch (#10927)
Save the
Loop
’s state by default in the checkpoint (#10784)Added
Loop.replace
to easily switch one loop for another (#10324)Added support for
--lr_scheduler=ReduceLROnPlateau
to theLightningCLI
(#10860)Added
LightningCLI.configure_optimizers
to override theconfigure_optimizers
return value (#10860)Added
LightningCLI(auto_registry)
flag to register all subclasses of the registerable components automatically (#12108)Added a warning that shows when
max_epochs
in theTrainer
is not set (#10700)Added support for returning a single Callback from
LightningModule.configure_callbacks
without wrapping it into a list (#11060)Added
console_kwargs
forRichProgressBar
to initialize inner Console (#10875)Added support for shorthand notation to instantiate loggers with the
LightningCLI
(#11533)Added a
LOGGER_REGISTRY
instance to register custom loggers to theLightningCLI
(#11533)Added info message when the
Trainer
argumentslimit_*_batches
,overfit_batches
, orval_check_interval
are set to1
or1.0
(#11950)Added a
PrecisionPlugin.teardown
method (#10990)Added
LightningModule.lr_scheduler_step
(#10249)Added support for no pre-fetching to
DataFetcher
(#11606)Added support for optimizer step progress tracking with manual optimization (#11848)
Return the output of the
optimizer.step
. This can be useful forLightningLite
users, manual optimization users, or users overridingLightningModule.optimizer_step
(#11711)Teardown the active loop and strategy on exception (#11620)
Added a
MisconfigurationException
if user providedopt_idx
in scheduler config doesn’t match with actual optimizer index of its respective optimizer (#11247)Added a
loggers
property toTrainer
which returns a list of loggers provided by the user (#11683)Added a
loggers
property toLightningModule
which retrieves theloggers
property fromTrainer
(#11683)Added support for DDP when using a
CombinedLoader
for the training data (#11648)Added a warning when using
DistributedSampler
during validation/testing (#11479)Added support for
Bagua
training strategy (#11146)Added support for manually returning a
poptorch.DataLoader
in a*_dataloader
hook (#12116)Added
rank_zero
module to centralize utilities (#11747)Added a
_Stateful
support forLightningDataModule
(#11637)Added
_Stateful
support forPrecisionPlugin
(#11638)Added
Accelerator.is_available
to check device availability (#11797)Enabled static type-checking on the signature of
Trainer
(#11888)Added utility functions for moving optimizers to devices (#11758)
Added a warning when saving an instance of
nn.Module
withsave_hyperparameters()
(#12068)Added
estimated_stepping_batches
property toTrainer
(#11599)Added support for pluggable Accelerators (#12030)
Added profiling for
on_load_checkpoint
/on_save_checkpoint
callback and LightningModule hooks (#12149)Added
LayerSync
andNativeSyncBatchNorm
plugins (#11754)Added optional
storage_options
argument toTrainer.save_checkpoint()
to pass to customCheckpointIO
implementations (#11891)Added support to explicitly specify the process group backend for parallel strategies (#11745)
Added
device_ids
andnum_devices
property toTrainer
(#12151)Added
Callback.state_dict()
andCallback.load_state_dict()
methods (#12232)Added
AcceleratorRegistry
(#12180)Added support for Habana Accelerator (HPU) (#11808)
Added support for dataclasses in
apply_to_collections
(#11889)
[1.6.0] - Changed¶
Make
benchmark
flag optional and set its value based on the deterministic flag (#11944)Implemented a new native and rich format in
_print_results
method of theEvaluationLoop
(#11332)Do not print an empty table at the end of the
EvaluationLoop
(#12427)Set the
prog_bar
flag to False inLightningModule.log_grad_norm
(#11472)Raised exception in
init_dist_connection()
when torch distributed is not available (#10418)The
monitor
argument in theEarlyStopping
callback is no longer optional (#10328)Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
Raised
MisconfigurationException
whenenable_progress_bar=False
and a progress bar instance has been passed in the callback list (#10520)Moved
trainer.connectors.env_vars_connector._defaults_from_env_vars
toutilities.argsparse._defaults_from_env_vars
(#10501)Changes in
LightningCLI
required for the new major release of jsonargparse v4.0.0 (#10426)Renamed
refresh_rate_per_second
parameter torefresh_rate
forRichProgressBar
signature (#10497)Moved ownership of the
PrecisionPlugin
intoTrainingTypePlugin
and updated all references (#10570)Fault Tolerant relies on
signal.SIGTERM
to gracefully exit instead ofsignal.SIGUSR1
(#10605)Loop.restarting=...
now sets the value recursively for all subloops (#11442)Raised an error if the
batch_size
cannot be inferred from the current batch if it contained a string or was a custom batch object (#10541)The validation loop is now disabled when
overfit_batches > 0
is set in the Trainer (#9709)Moved optimizer related logics from
Accelerator
toTrainingTypePlugin
(#10596)Moved ownership of the lightning optimizers from the
Trainer
to theStrategy
(#11444)Moved ownership of the data fetchers from the DataConnector to the Loops (#11621)
Moved
batch_to_device
method fromAccelerator
toTrainingTypePlugin
(#10649)The
DDPSpawnPlugin
no longer overrides thepost_dispatch
plugin hook (#10034)Integrate the progress bar implementation with progress tracking (#11213)
The
LightningModule.{add_to_queue,get_from_queue}
hooks no longer get atorch.multiprocessing.SimpleQueue
and instead receive a list based queue (#10034)Changed
training_step
,validation_step
,test_step
andpredict_step
method signatures inAccelerator
and updated input from caller side (#10908)Changed the name of the temporary checkpoint that the
DDPSpawnPlugin
and related plugins save (#10934)LoggerCollection
returns only unique logger names and versions (#10976)Redesigned process creation for spawn-based plugins (
DDPSpawnPlugin
,TPUSpawnPlugin
, etc.) (#10896)All spawn-based plugins now spawn processes immediately upon calling
Trainer.{fit,validate,test,predict}
The hooks/callbacks
prepare_data
,setup
,configure_sharded_model
andteardown
now run under initialized process group for spawn-based plugins just like their non-spawn counterpartsSome configuration errors that were previously raised as
MisconfigurationException
s will now be raised asProcessRaisedException
(torch>=1.8) or asException
(torch<1.8)Removed the
TrainingTypePlugin.pre_dispatch()
method and merged it withTrainingTypePlugin.setup()
(#11137)
Changed profiler to index and display the names of the hooks with a new pattern [
] . (#11026) Changed
batch_to_device
entry in profiling from stage-specific to generic, to match profiling of other hooks (#11031)Changed the info message for finalizing ddp-spawn worker processes to a debug-level message (#10864)
Removed duplicated file extension when uploading model checkpoints with
NeptuneLogger
(#11015)Removed
__getstate__
and__setstate__
ofRichProgressBar
(#11100)The
DDPPlugin
andDDPSpawnPlugin
and their subclasses now remove theSyncBatchNorm
wrappers inteardown()
to enable proper support at inference after fitting (#11078)Moved ownership of the
Accelerator
instance to theTrainingTypePlugin
; all training-type plugins now take an optional parameteraccelerator
(#11022)Renamed the
TrainingTypePlugin
toStrategy
(#11120)Renamed the
ParallelPlugin
toParallelStrategy
(#11123)Renamed the
DataParallelPlugin
toDataParallelStrategy
(#11183)Renamed the
DDPPlugin
toDDPStrategy
(#11142)Renamed the
DDP2Plugin
toDDP2Strategy
(#11185)Renamed the
DDPShardedPlugin
toDDPShardedStrategy
(#11186)Renamed the
DDPFullyShardedPlugin
toDDPFullyShardedStrategy
(#11143)Renamed the
DDPSpawnPlugin
toDDPSpawnStrategy
(#11145)Renamed the
DDPSpawnShardedPlugin
toDDPSpawnShardedStrategy
(#11210)Renamed the
DeepSpeedPlugin
toDeepSpeedStrategy
(#11194)Renamed the
HorovodPlugin
toHorovodStrategy
(#11195)Renamed the
TPUSpawnPlugin
toTPUSpawnStrategy
(#11190)Renamed the
IPUPlugin
toIPUStrategy
(#11193)Renamed the
SingleDevicePlugin
toSingleDeviceStrategy
(#11182)Renamed the
SingleTPUPlugin
toSingleTPUStrategy
(#11182)Renamed the
TrainingTypePluginsRegistry
toStrategyRegistry
(#11233)
Marked the
ResultCollection
,ResultMetric
, andResultMetricCollection
classes as protected (#11130)Marked
trainer.checkpoint_connector
as protected (#11550)The epoch start/end hooks are now called by the
FitLoop
instead of theTrainingEpochLoop
(#11201)DeepSpeed does not require lightning module zero 3 partitioning (#10655)
Moved
Strategy
classes to thestrategies
directory (#11226)Renamed
training_type_plugin
file tostrategy
(#11239)Changed
DeviceStatsMonitor
to group metrics based on the logger’sgroup_separator
(#11254)Raised
UserWarning
if evaluation is triggered withbest
ckpt and trainer is configured with multiple checkpoint callbacks (#11274)Trainer.logged_metrics
now always contains scalar tensors, even when a Python scalar was logged (#11270)The tuner now uses the checkpoint connector to copy and restore its state (#11518)
Changed
MisconfigurationException
toModuleNotFoundError
whenrich
isn’t available (#11360)The
trainer.current_epoch
value is now increased by 1 during and afteron_train_end
(#8578)The
trainer.global_step
value now accounts for multiple optimizers and TBPTT splits (#11805)The
trainer.global_step
value is now increased right after theoptimizer.step()
call which will impact users who access it during an intra-training validation hook (#11805)The filename of checkpoints created with
ModelCheckpoint(filename='{step}')
is different compared to previous versions. A checkpoint saved after 1 step will be namedstep=1.ckpt
instead ofstep=0.ckpt
(#11805)Inherit from
ABC
forAccelerator
: Users need to implementauto_device_count
(#11521)Changed
parallel_devices
property inParallelStrategy
to be lazy initialized (#11572)Updated
TQDMProgressBar
to run a separate progress bar for each eval dataloader (#11657)Sorted
SimpleProfiler(extended=False)
summary based on mean duration for each hook (#11671)Avoid enforcing
shuffle=False
for eval dataloaders (#11575)When using DP (data-parallel), Lightning will no longer automatically reduce all tensors returned in training_step; it will only reduce the loss unless
training_step_end
is overridden (#11594)When using DP (data-parallel), the
training_epoch_end
hook will no longer receive reduced outputs fromtraining_step
and instead get the full tensor of results from all GPUs (#11594)Changed default logger name to
lightning_logs
for consistency (#11762)Rewrote
accelerator_connector
(#11448)When manual optimization is used with DDP, we no longer force
find_unused_parameters=True
(#12425)Disable loading dataloades if corresponding
limit_batches=0
(#11576)Removed
is_global_zero
check intraining_epoch_loop
beforelogger.save
. If you have a custom logger that implementssave
the Trainer will now callsave
on all ranks by default. To change this behavior add@rank_zero_only
to yoursave
implementation (#12134)Disabled tuner with distributed strategies (#12179)
Marked
trainer.logger_connector
as protected (#12195)Move
Strategy.process_dataloader
function call fromfit/evaluation/predict_loop.py
todata_connector.py
(#12251)ModelCheckpoint(save_last=True, every_n_epochs=N)
now saves a “last” checkpoint every epoch (disregardingevery_n_epochs
) instead of only once at the end of training (#12418)The strategies that support
sync_batchnorm
now only apply it when fitting (#11919)Avoided fallback on CPU if no devices are provided for other accelerators (#12410)
Modified
supporters.py
so that in the accumulator element (for loss) is created directly on the device (#12430)Removed
EarlyStopping.on_save_checkpoint
andEarlyStopping.on_load_checkpoint
in favor ofEarlyStopping.state_dict
andEarlyStopping.load_state_dict
(#11887)Removed
BaseFinetuning.on_save_checkpoint
andBaseFinetuning.on_load_checkpoint
in favor ofBaseFinetuning.state_dict
andBaseFinetuning.load_state_dict
(#11887)Removed
BackboneFinetuning.on_save_checkpoint
andBackboneFinetuning.on_load_checkpoint
in favor ofBackboneFinetuning.state_dict
andBackboneFinetuning.load_state_dict
(#11887)Removed
ModelCheckpoint.on_save_checkpoint
andModelCheckpoint.on_load_checkpoint
in favor ofModelCheckpoint.state_dict
andModelCheckpoint.load_state_dict
(#11887)Removed
Timer.on_save_checkpoint
andTimer.on_load_checkpoint
in favor ofTimer.state_dict
andTimer.load_state_dict
(#11887)Replaced PostLocalSGDOptimizer with a dedicated model averaging component (#12378)
[1.6.0] - Deprecated¶
Deprecated
training_type_plugin
property in favor ofstrategy
inTrainer
and updated the references (#11141)Deprecated
Trainer.{validated,tested,predicted}_ckpt_path
and replaced with read-only propertyTrainer.ckpt_path
set when checkpoints loaded viaTrainer.{fit,validate,test,predict}
(#11696)Deprecated
ClusterEnvironment.master_{address,port}
in favor ofClusterEnvironment.main_{address,port}
(#10103)Deprecated
DistributedType
in favor of_StrategyType
(#10505)Deprecated the
precision_plugin
constructor argument fromAccelerator
(#10570)Deprecated
DeviceType
in favor of_AcceleratorType
(#10503)Deprecated the property
Trainer.slurm_job_id
in favor of the newSLURMEnvironment.job_id()
method (#10622)Deprecated the access to the attribute
IndexBatchSamplerWrapper.batch_indices
in favor ofIndexBatchSamplerWrapper.seen_batch_indices
(#10870)Deprecated
on_init_start
andon_init_end
callback hooks (#10940)Deprecated
Trainer.call_hook
in favor ofTrainer._call_callback_hooks
,Trainer._call_lightning_module_hook
,Trainer._call_ttp_hook
, andTrainer._call_accelerator_hook
(#10979)Deprecated
TrainingTypePlugin.post_dispatch
in favor ofTrainingTypePlugin.teardown
(#10939)Deprecated
ModelIO.on_hpc_{save/load}
in favor ofCheckpointHooks.on_{save/load}_checkpoint
(#10911)Deprecated
Trainer.run_stage
in favor ofTrainer.{fit,validate,test,predict}
(#11000)Deprecated
Trainer.lr_schedulers
in favor ofTrainer.lr_scheduler_configs
which returns a list of dataclasses instead of dictionaries (#11443)Deprecated
Trainer.verbose_evaluate
in favor ofEvaluationLoop(verbose=...)
(#10931)Deprecated
Trainer.should_rank_save_checkpoint
Trainer property (#11068)Deprecated
Trainer.lightning_optimizers
(#11444)Deprecated
TrainerOptimizersMixin
and moved functionality tocore/optimizer.py
(#11155)Deprecated the
on_train_batch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#12182)Deprecated the
training_epoch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#12182)Deprecated
TrainerCallbackHookMixin
(#11148)Deprecated
TrainerDataLoadingMixin
and moved functionality toTrainer
andDataConnector
(#11282)Deprecated function
pytorch_lightning.callbacks.device_stats_monitor.prefix_metric_keys
(#11254)Deprecated
Callback.on_epoch_start
hook in favour ofCallback.on_{train/val/test}_epoch_start
(#11578)Deprecated
Callback.on_epoch_end
hook in favour ofCallback.on_{train/val/test}_epoch_end
(#11578)Deprecated
LightningModule.on_epoch_start
hook in favor ofLightningModule.on_{train/val/test}_epoch_start
(#11578)Deprecated
LightningModule.on_epoch_end
hook in favor ofLightningModule.on_{train/val/test}_epoch_end
(#11578)Deprecated
on_before_accelerator_backend_setup
callback hook in favour ofsetup
(#11568)Deprecated
on_batch_start
andon_batch_end
callback hooks in favor ofon_train_batch_start
andon_train_batch_end
(#11577)Deprecated
on_configure_sharded_model
callback hook in favor ofsetup
(#11627)Deprecated
pytorch_lightning.utilities.distributed.rank_zero_only
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_only
(#11747)Deprecated
pytorch_lightning.utilities.distributed.rank_zero_debug
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_debug
(#11747)Deprecated
pytorch_lightning.utilities.distributed.rank_zero_info
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_info
(#11747)Deprecated
pytorch_lightning.utilities.warnings.rank_zero_warn
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_warn
(#11747)Deprecated
pytorch_lightning.utilities.warnings.rank_zero_deprecation
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_deprecation
(#11747)Deprecated
pytorch_lightning.utilities.warnings.LightningDeprecationWarning
in favor ofpytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
(#11747)Deprecated
on_pretrain_routine_start
andon_pretrain_routine_end
callback hooks in favor ofon_fit_start
(#11794)Deprecated
LightningModule.on_pretrain_routine_start
andLightningModule.on_pretrain_routine_end
hooks in favor ofon_fit_start
(#12122)Deprecated
agg_key_funcs
andagg_default_func
parameters fromLightningLoggerBase
(#11871)Deprecated
LightningLoggerBase.update_agg_funcs
(#11871)Deprecated
LightningLoggerBase.agg_and_log_metrics
in favor ofLightningLoggerBase.log_metrics
(#11832)Deprecated passing
weights_save_path
to theTrainer
constructor in favor of adding theModelCheckpoint
callback withdirpath
directly to the list of callbacks (#12084)Deprecated
pytorch_lightning.profiler.AbstractProfiler
in favor ofpytorch_lightning.profiler.Profiler
(#12106)Deprecated
pytorch_lightning.profiler.BaseProfiler
in favor ofpytorch_lightning.profiler.Profiler
(#12150)Deprecated
BaseProfiler.profile_iterable
(#12102)Deprecated
LoggerCollection
in favor oftrainer.loggers
(#12147)Deprecated
PrecisionPlugin.on_{save,load}_checkpoint
in favor ofPrecisionPlugin.{state_dict,load_state_dict}
(#11978)Deprecated
LightningDataModule.on_save/load_checkpoint
in favor ofstate_dict/load_state_dict
(#11893)Deprecated
Trainer.use_amp
in favor ofTrainer.amp_backend
(#12312)Deprecated
LightingModule.use_amp
in favor ofTrainer.amp_backend
(#12315)Deprecated specifying the process group backend through the environment variable
PL_TORCH_DISTRIBUTED_BACKEND
(#11745)Deprecated
ParallelPlugin.torch_distributed_backend
in favor ofDDPStrategy.process_group_backend
property (#11745)Deprecated
ModelCheckpoint.save_checkpoint
in favor ofTrainer.save_checkpoint
(#12456)Deprecated
Trainer.devices
in favor ofTrainer.num_devices
andTrainer.device_ids
(#12151)Deprecated
Trainer.root_gpu
in favor ofTrainer.strategy.root_device.index
when GPU is used (#12262)Deprecated
Trainer.num_gpus
in favor ofTrainer.num_devices
when GPU is used (#12384)Deprecated
Trainer.ipus
in favor ofTrainer.num_devices
when IPU is used (#12386)Deprecated
Trainer.num_processes
in favor ofTrainer.num_devices
(#12388)Deprecated
Trainer.data_parallel_device_ids
in favor ofTrainer.device_ids
(#12072)Deprecated returning state from
Callback.on_save_checkpoint
in favor of returning state inCallback.state_dict
for checkpointing (#11887)Deprecated passing only the callback state to
Callback.on_load_checkpoint(callback_state)
in favor of passing the callback state toCallback.load_state_dict
and in 1.8, passing the entire checkpoint dictionary toCallback.on_load_checkpoint(checkpoint)
(#11887)Deprecated
Trainer.gpus
in favor ofTrainer.device_ids
orTrainer.num_devices
(#12436)Deprecated
Trainer.tpu_cores
in favor ofTrainer.num_devices
(#12437)
[1.6.0] - Removed¶
Removed deprecated parameter
method
inpytorch_lightning.utilities.model_helpers.is_overridden
(#10507)Remove deprecated method
ClusterEnvironment.creates_children
(#10339)Removed deprecated
TrainerModelHooksMixin.is_function_implemented
andTrainerModelHooksMixin.has_arg
(#10322)Removed deprecated
pytorch_lightning.utilities.device_dtype_mixin.DeviceDtypeModuleMixin
in favor ofpytorch_lightning.core.mixins.device_dtype_mixin.DeviceDtypeModuleMixin
(#10442)Removed deprecated
LightningModule.loaded_optimizer_states_dict
property (#10346)Removed deprecated
Trainer.fit(train_dataloader=)
,Trainer.validate(val_dataloaders=)
, andTrainer.test(test_dataloader=)
(#10325)Removed deprecated
has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
andhas_teardown_predict
datamodule lifecycle properties (#10350)Removed deprecated
every_n_val_epochs
parameter of ModelCheckpoint (#10366)Removed deprecated
import pytorch_lightning.profiler.profilers
in favor ofimport pytorch_lightning.profiler
(#10443)Removed deprecated property
configure_slurm_dpp
from accelerator connector (#10370)Removed deprecated arguments
num_nodes
andsync_batchnorm
fromDDPPlugin
,DDPSpawnPlugin
,DeepSpeedPlugin
(#10357)Removed deprecated property
is_slurm_managing_tasks
from AcceleratorConnector (#10353)Removed deprecated
LightningModule.log(tbptt_reduce_fx, tbptt_reduce_token, sync_dist_op)
(#10423)Removed deprecated
Plugin.task_idx
(#10441)Removed deprecated method
master_params
from PrecisionPlugin (#10372)Removed the automatic detachment of “extras” returned from
training_step
. For example,return {'loss': ..., 'foo': foo.detach()}
will now be necessary iffoo
has gradients which you do not want to store (#10424)Removed deprecated passthrough methods and properties from
Accelerator
base class:Removed deprecated signature for
transfer_batch_to_device
hook. The new argumentdataloader_idx
is now required (#10480)Removed deprecated
utilities.distributed.rank_zero_{warn/deprecation}
(#10451)Removed deprecated
mode
argument fromModelSummary
class (#10449)Removed deprecated
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#10482)Removed deprecated
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#10482)Removed deprecated
disable_validation
property from Trainer (#10450)Removed deprecated
CheckpointConnector.hpc_load
property in favor ofCheckpointConnector.restore
(#10525)Removed deprecated
reload_dataloaders_every_epoch
fromTrainer
in favour ofreload_dataloaders_every_n_epochs
(#10481)Removed the
precision_plugin
attribute fromAccelerator
in favor of its equivalent attributeprecision_plugin
in theTrainingTypePlugin
(#10570)Removed
DeepSpeedPlugin.{precision,amp_type,amp_level}
properties (#10657)Removed patching of
on_before_batch_transfer
,transfer_batch_to_device
andon_after_batch_transfer
hooks inLightningModule
(#10603)Removed argument
return_result
from theDDPSpawnPlugin.spawn()
method (#10867)Removed the property
TrainingTypePlugin.results
and corresponding properties in subclasses (#10034)Removed the
mp_queue
attribute fromDDPSpawnPlugin
andTPUSpawnPlugin
(#10034)Removed unnecessary
_move_optimizer_state
method overrides fromTPUSpawnPlugin
andSingleTPUPlugin
(#10849)Removed
should_rank_save_checkpoint
property fromTrainingTypePlugin
(#11070)Removed
model_sharded_context
method fromAccelerator
(#10886)Removed method
pre_dispatch
from thePrecisionPlugin
(#10887)Removed method
setup_optimizers_in_pre_dispatch
from thestrategies
and achieve the same logic insetup
andpre_dispatch
methods (#10906)Removed methods
pre_dispatch
,dispatch
andpost_dispatch
from theAccelerator
(#10885)Removed method
training_step
,test_step
,validation_step
andpredict_step
from theAccelerator
(#10890)Removed
TrainingTypePlugin.start_{training,evaluating,predicting}
hooks and the same in all subclasses (#10989, #10896)Removed
Accelerator.on_train_start
(#10999)Removed support for Python 3.6 (#11117)
Removed
Strategy.init_optimizers
in favor ofStrategy.setup_optimizers
(#11236)Removed
profile("training_step_and_backward")
inClosure
class since we already profile callstraining_step
andbackward
(#11222)Removed
Strategy.optimizer_zero_grad
(#11246)Removed
Strategy.on_gpu
(#11537)Removed
Strategy.on_tpu
property (#11536)Removed the abstract property
LightningLoggerBase.experiment
(#11603)Removed
FitLoop.current_epoch
getter and setter (#11562)Removed access to
_short_id
inNeptuneLogger
(#11517)Removed
log_text
andlog_image
from theLightningLoggerBase
API (#11857)Removed calls to
profile("model_forward")
in favor of profilingtraining_step
(#12032)Removed
get_mp_spawn_kwargs
fromDDPSpawnStrategy
andTPUSpawnStrategy
in favor of configuration in the_SpawnLauncher
(#11966)Removed
_aggregate_metrics
,_reduce_agg_metrics
, and_finalize_agg_metrics
fromLightningLoggerBase
(#12053)Removed the
AcceleratorConnector.device_type
property (#12081)Removed
AcceleratorConnector.num_nodes
(#12107)Removed
AcceleratorConnector.has_ipu
property (#12111)Removed
AcceleratorConnector.use_ipu
property (#12110)Removed
AcceleratorConnector.has_tpu
property (#12109)Removed
AcceleratorConnector.use_dp
property (#12112)Removed
configure_sync_batchnorm
fromParallelStrategy
and all other strategies that inherit from it (#11754)Removed public attribute
sync_batchnorm
from strategies (#11754)Removed
AcceleratorConnector.root_gpu
property (#12262)Removed
AcceleratorConnector.tpu_id
property (#12387)Removed
AcceleratorConnector.num_gpus
property (#12384)Removed
AcceleratorConnector.num_ipus
property (#12386)Removed
AcceleratorConnector.num_processes
property (#12388)Removed
AcceleratorConnector.parallel_device_ids
property (#12072)Removed
AcceleratorConnector.devices
property (#12435)Removed
AcceleratorConnector.parallel_devices
property (#12075)Removed
AcceleratorConnector.tpu_cores
property (#12437)
[1.6.0] - Fixed¶
Fixed an issue where
ModelCheckpoint
could delete last checkpoint from the old directory whendirpath
has changed during resumed training (#12225)Fixed an issue where
ModelCheckpoint
could delete older checkpoints whendirpath
has changed during resumed training (#12045)Fixed an issue where
HorovodStrategy.teardown()
did not complete gracefully if an exception was thrown during callback setup #11752Fixed security vulnerabilities CVE-2020-1747 and CVE-2020-14343 caused by the
PyYAML
dependency (#11099)Fixed security vulnerability “CWE-94: Improper Control of Generation of Code (Code Injection)” (#12212)
Fixed logging on
{test,validation}_epoch_end
with multiple dataloaders (#11132)Reset the validation progress tracking state after sanity checking (#11218)
Fixed double evaluation bug with fault-tolerance enabled where the second call was completely skipped (#11119)
Fixed an issue with the
TPUSpawnPlugin
handling theXLA_USE_BF16
environment variable incorrectly (#10990)Fixed wrong typehint for
Trainer.lightning_optimizers
(#11155)Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy (#11307)
Fixed bug that forced overriding
configure_optimizers
with the CLI (#11672)Fixed type promotion when tensors of higher category than float are logged (#11401)
Fixed
SimpleProfiler
summary (#11414)No longer set a
DistributedSampler
to thepoptorch.DataLoader
when IPUs are used (#12114)Fixed bug where progress bar was not being disabled when not in rank zero during predict (#11377)
Fixed the mid-epoch warning call while resuming training (#11556)
Fixed
LightningModule.{un,}toggle_model
when only 1 optimizer is used (#12088)Fixed an issue in
RichProgressbar
to display the metrics logged only on main progress bar (#11690)Fixed
RichProgressBar
progress when refresh rate does not evenly divide the total counter (#11668)Fixed
RichProgressBar
progress validation bar total when using multiple validation runs within a single training epoch (#11668)Configure native Deepspeed schedulers with interval=’step’ (#11788), (#12031)
Update
RichProgressBarTheme
styles after detecting light theme on colab (#10993)Fixed passing
_ddp_params_and_buffers_to_ignore
(#11949)Fixed an
AttributeError
when callingsave_hyperparameters
and no parameters need saving (#11827)Fixed environment variable priority for global rank determination (#11406)
Fixed an issue that caused the Trainer to produce identical results on subsequent runs without explicit re-seeding (#11870)
Fixed an issue that caused the Tuner to affect the random state (#11870)
Fixed to avoid common hook warning if no hook is overridden (#12131)
Fixed deepspeed keeping old sub-folders in same ckpt path (#12194)
Fixed returning logged metrics instead of callback metrics during evaluation (#12224)
Fixed the case where
logger=None
is passed to the Trainer (#12249)Fixed bug where the global step tracked by
ModelCheckpoint
was still set even if no checkpoint was saved (#12418)Fixed bug where
ModelCheckpoint
was overriding theepoch
andstep
logged values (#12418)Fixed bug where monitoring the default
epoch
andstep
values withModelCheckpoint
would fail (#12418)Fixed initializing optimizers unnecessarily in
DDPFullyShardedStrategy
(#12267)Fixed check for horovod module (#12377)
Fixed logging to loggers with multiple eval dataloaders (#12454)
Fixed an issue with resuming from a checkpoint trained with QAT (#11346)
[1.5.10] - 2022-02-08¶
[1.5.10] - Fixed¶
Fixed an issue to avoid validation loop run on restart (#11552)
The
RichProgressBar
now correctly shows theon_epoch
logged values on train epoch end (#11689)Fixed an issue to make the
step
argument inWandbLogger.log_image
work (#11716)Fixed
restore_optimizers
for mapping states (#11757)With
DPStrategy
, the batch is not explicitly moved to the device (#11780)Fixed an issue to avoid val bar disappear after
trainer.validate()
(#11700)Fixed supporting remote filesystems with
Trainer.weights_save_path
for fault-tolerant training (#11776)Fixed check for available modules (#11526)
Fixed bug where the path for “last” checkpoints was not getting saved correctly which caused newer runs to not remove the previous “last” checkpoint (#11481)
Fixed bug where the path for best checkpoints was not getting saved correctly when no metric was monitored which caused newer runs to not use the best checkpoint (#11481)
[1.5.9] - 2022-01-20¶
[1.5.9] - Fixed¶
Pinned sphinx-autodoc-typehints with <v1.15 (#11400)
Skipped testing with PyTorch 1.7 and Python 3.9 on Ubuntu (#11217)
Fixed type promotion when tensors of higher category than float are logged (#11401)
Fixed the format of the configuration saved automatically by the CLI’s
SaveConfigCallback
(#11532)
[1.5.9] - Changed¶
[1.5.8] - 2022-01-05¶
[1.5.8] - Fixed¶
Fixed
LightningCLI
race condition while saving the config (#11199)Fixed the default value used with
log(reduce_fx=min|max)
(#11310)Fixed data fetcher selection (#11294)
Fixed a race condition that could result in incorrect (zero) values being observed in prediction writer callbacks (#11288)
Fixed dataloaders not getting reloaded the correct amount of times when setting
reload_dataloaders_every_n_epochs
andcheck_val_every_n_epoch
(#10948)Fixed deepspeed strategy not restoring the lr-scheduler states when lr-scheduler(s) are configured through
LightningModule.configure_optimizer
(#11322)
[1.5.7] - 2021-12-21¶
[1.5.7] - Fixed¶
Fixed
NeptuneLogger
when using DDP (#11030)Fixed a bug to disable logging hyperparameters in logger if there are no hparams (#11105)
Avoid the deprecated
onnx.export(example_outputs=...)
in torch 1.10 (#11116)Fixed an issue when torch-scripting a
LightningModule
after training withTrainer(sync_batchnorm=True)
(#11078)Fixed an
AttributeError
occurring when using aCombinedLoader
(multiple dataloaders) for prediction (#11111)Fixed bug where
Trainer(track_grad_norm=..., logger=False)
would fail (#11114)Fixed an incorrect warning being produced by the model summary when using
bf16
precision on CPU (#11161)
[1.5.7] - Changed¶
[1.5.6] - 2021-12-15¶
[1.5.6] - Fixed¶
Fixed a bug where the DeepSpeedPlugin arguments
cpu_checkpointing
andcontiguous_memory_optimization
were not being forwarded to deepspeed correctly (#10874)Fixed an issue with
NeptuneLogger
causing checkpoints to be uploaded with a duplicated file extension (#11015)Fixed support for logging within callbacks returned from
LightningModule
(#10991)Fixed running sanity check with
RichProgressBar
(#10913)Fixed support for
CombinedLoader
while checking for warning raised with eval dataloaders (#10994)The TQDM progress bar now correctly shows the
on_epoch
logged values on train epoch end (#11069)Fixed bug where the TQDM updated the training progress bar during
trainer.validate
(#11069)
[1.5.5] - 2021-12-07¶
[1.5.5] - Fixed¶
Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally (#10815)
Fixed an issue with
SignalConnector
not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled (#10611)Fixed
SignalConnector._has_already_handler
check for callable type (#10483)Fixed an issue to return the results for each dataloader separately instead of duplicating them for each (#10810)
Improved exception message if
rich
version is less than10.2.2
(#10839)Fixed uploading best model checkpoint in NeptuneLogger (#10369)
Fixed early schedule reset logic in PyTorch profiler that was causing data leak (#10837)
Fixed a bug that caused incorrect batch indices to be passed to the
BasePredictionWriter
hooks when using a dataloader withnum_workers > 0
(#10870)Fixed an issue with item assignment on the logger on rank > 0 for those who support it (#10917)
Fixed importing
torch_xla.debug
fortorch-xla<1.8
(#10836)Fixed an issue with
DDPSpawnPlugin
and related plugins leaving a temporary checkpoint behind (#10934)Fixed a
TypeError
occurring in theSingalConnector.teardown()
method (#10961)
[1.5.4] - 2021-11-30¶
[1.5.4] - Fixed¶
Fixed support for
--key.help=class
with theLightningCLI
(#10767)Fixed
_compare_version
for python packages (#10762)Fixed TensorBoardLogger
SummaryWriter
not close before spawning the processes (#10777)Fixed a consolidation error in Lite when attempting to save the state dict of a sharded optimizer (#10746)
Fixed the default logging level for batch hooks associated with training from
on_step=False, on_epoch=True
toon_step=True, on_epoch=False
(#10756)
[1.5.4] - Removed¶
[1.5.3] - 2021-11-24¶
[1.5.3] - Fixed¶
Fixed
ShardedTensor
state dict hook registration to check if torch distributed is available (#10621)Fixed an issue with
self.log
not respecting a tensor’sdtype
when applying computations (#10076)Fixed LigtningLite
_wrap_init
popping unexisting keys from DataLoader signature parameters (#10613)Fixed signals being registered within threads (#10610)
Fixed an issue that caused Lightning to extract the batch size even though it was set by the user in
LightningModule.log
(#10408)Fixed
Trainer(move_metrics_to_cpu=True)
not moving the evaluation logged results to CPU (#10631)Fixed the
{validation,test}_step
outputs getting moved to CPU withTrainer(move_metrics_to_cpu=True)
(#10631)Fixed an issue with collecting logged test results with multiple dataloaders (#10522)
[1.5.2] - 2021-11-16¶
[1.5.2] - Fixed¶
Fixed
CombinedLoader
andmax_size_cycle
didn’t receive aDistributedSampler
(#10374)Fixed an issue where class or init-only variables of dataclasses were passed to the dataclass constructor in
utilities.apply_to_collection
(#9702)Fixed
isinstance
not working withinit_meta_context
, materialized model not being moved to the device (#10493)Fixed an issue that prevented the Trainer to shutdown workers when execution is interrupted due to failure(#10463)
Squeeze the early stopping monitor to remove empty tensor dimensions (#10461)
Fixed sampler replacement logic with
overfit_batches
to only replace the sample whenSequentialSampler
is not used (#10486)Fixed scripting causing false positive deprecation warnings (#10470, #10555)
Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
Fixed propagation of device and dtype information to submodules of LightningLite when they inherit from
DeviceDtypeModuleMixin
(#10559)
[1.5.1] - 2021-11-09¶
[1.5.1] - Fixed¶
Fixed
apply_to_collection(defaultdict)
(#10316)Fixed failure when
DataLoader(batch_size=None)
is passed (#10345)Fixed interception of
__init__
arguments for sub-classed DataLoader re-instantiation in Lite (#10334)Fixed issue with pickling
CSVLogger
after a call toCSVLogger.save
(#10388)Fixed an import error being caused by
PostLocalSGD
whentorch.distributed
not available (#10359)Fixed the logging with
on_step=True
in epoch-level hooks causing unintended side-effects. Logging withon_step=True
in epoch-level hooks will now correctly raise an error (#10409)Fixed deadlocks for distributed training with
RichProgressBar
(#10428)Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
Fixed dataloader workers with
persistent_workers
being deleted on every iteration (#10434)
[1.5.0] - 2021-11-02¶
[1.5.0] - Added¶
Added support for monitoring the learning rate without schedulers in
LearningRateMonitor
(#9786)Added registration of
ShardedTensor
state dict hooks inLightningModule.__init__
if the PyTorch version supportsShardedTensor
(#8944)Added error handling including calling of
on_keyboard_interrupt()
andon_exception()
for all entrypoints (fit, validate, test, predict) (#8819)Added a flavor of
training_step
that takesdataloader_iter
as an argument (#8807)Added a
state_key
property to theCallback
base class (#6886)Added progress tracking to loops:
Integrated
TrainingEpochLoop.total_batch_idx
(#8598)Added
BatchProgress
and integratedTrainingEpochLoop.is_last_batch
(#9657)Avoid optional
Tracker
attributes (#9320)Reset
current
progress counters when restarting an epoch loop that had already finished (#9371)Call
reset_on_restart
in the loop’sreset
hook instead of when loading a checkpoint (#9561)Use
completed
overprocessed
inreset_on_restart
(#9656)Renamed
reset_on_epoch
toreset_on_run
(#9658)
Added
batch_size
andrank_zero_only
arguments forlog_dict
to matchlog
(#8628)Added a check for unique GPU ids (#8666)
Added
ResultCollection
state_dict to the Loopstate_dict
and added support for distributed reload (#8641)Added DeepSpeed collate checkpoint utility function (#8701)
Added a
handles_accumulate_grad_batches
property to the training type plugins (#8856)Added a warning to
WandbLogger
when reusing a wandb run (#8714)Added
log_graph
argument forwatch
method ofWandbLogger
(#8662)LightningCLI
additions:Added
LightningCLI(run=False|True)
to choose whether to run aTrainer
subcommand (#8751)Added support to call any trainer function from the
LightningCLI
via subcommands (#7508)Allow easy trainer re-instantiation (#7508)
Automatically register all optimizers and learning rate schedulers (#9565)
Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
Support passing lists of callbacks via command line (#8815)
Support shorthand notation to instantiate models (#9588)
Support shorthand notation to instantiate datamodules (#10011)
Added
multifile
option toLightningCLI
to enable/disable config saving to preserve multiple files structure (#9073)
Fault-tolerant training:
Added
FastForwardSampler
andCaptureIterableDataset
injection to data loading utilities (#8366)Added
DataFetcher
to control fetching flow (#8890)Added
SharedCycleIteratorState
to prevent infinite loop (#8889)Added
CaptureMapDataset
for state management in map-style datasets (#8891)Added Fault Tolerant Training to
DataFetcher
(#8891)Replaced old prefetch iterator with new
DataFetcher
in training loop (#8953)Added partial support for global random state fault-tolerance in map-style datasets (#8950)
Converted state to tuple explicitly when setting Python random state (#9401)
Added support for restarting an optimizer loop (multiple optimizers) (#9537)
Added support for restarting within Evaluation Loop (#9563)
Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
Checkpoint saving and loading extensibility:
Added
CheckpointIO
plugin to expose checkpoint IO from training type plugin (#8743)Refactored
CheckpointConnector
to offload validation logic to theCheckpointIO
plugin (#9045)Added
remove_checkpoint
toCheckpointIO
plugin by moving the responsibility out of theModelCheckpoint
callback (#9373)Added
XLACheckpointIO
plugin (#9972)
Loop customization:
Added
Closure
andAbstractClosure
classes (#8642)Refactored
TrainingBatchLoop
and extractedOptimizerLoop
, splitting off automatic optimization into its own loop (#9191)Removed
TrainingBatchLoop.backward()
; manual optimization now calls directly intoAccelerator.backward()
and automatic optimization handles backward in newOptimizerLoop
(#9265)Extracted
ManualOptimization
logic fromTrainingBatchLoop
into its own separate loop class (#9266)Marked
OptimizerLoop.backward
as protected (#9514)Marked
FitLoop.should_accumulate
as protected (#9515)Marked several methods in
PredictionLoop
as protected:on_predict_start
,on_predict_epoch_end
,on_predict_end
,on_predict_model_eval
(#9516)Marked several methods in
EvaluationLoop
as protected:get_max_batches
,on_evaluation_model_eval
,on_evaluation_model_train
,on_evaluation_start
,on_evaluation_epoch_start
,on_evaluation_epoch_end
,on_evaluation_end
,reload_evaluation_dataloaders
(#9516)Marked several methods in
EvaluationEpochLoop
as protected:on_evaluation_batch_start
,evaluation_step
,evaluation_step_end
(#9516)Added
yielding_training_step
example (#9983)
Added support for saving and loading state of multiple callbacks of the same type (#7187)
Added DeepSpeed Stage 1 support (#8974)
Added
Python dataclass
support forLightningDataModule
(#8272)Added sanitization of tensors when they get logged as hyperparameters in
TensorBoardLogger
(#9031)Added
InterBatchParallelDataFetcher
(#9020)Added
DataLoaderIterDataFetcher
(#9020)Added
DataFetcher
withinFit / Evaluation
Loop (#9047)Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
Added Rich integration:
Added input validation logic for precision (#9080)
Added support for CPU AMP autocast (#9084)
Added
on_exception
callback hook (#9183)Added a warning to DeepSpeed when inferring batch size (#9221)
Added
ModelSummary
callback (#9344)Added
log_images
,log_text
andlog_table
toWandbLogger
(#9545)Added
PL_RECONCILE_PROCESS
environment variable to enable process reconciliation regardless of cluster environment settings (#9389)Added
get_device_stats
to the Accelerator interface and added its implementation for GPU and TPU (#9586)Added a warning when an unknown key is encountered in the optimizer configuration, and when
OneCycleLR
is used with"interval": "epoch"
(#9666)Added
DeviceStatsMonitor
callback (#9712)Added
enable_progress_bar
to the Trainer constructor (#9664)Added
pl_legacy_patch
load utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166)Added support for
torch.use_deterministic_algorithms
(#9121)Added automatic parameters tying for TPUs (#9525)
Added support for
torch.autograd.set_detect_anomaly
throughTrainer
constructor argumentdetect_anomaly
(#9848)Added
enable_model_summary
flag to Trainer (#9699)Added
strategy
argument to Trainer (#8597)Added
init_meta_context
,materialize_module
utilities (#9920)Added
TPUPrecisionPlugin
(#10020)Added
torch.bfloat16
support:Added
kfold
example for loop customization (#9965)LightningLite:
Added
PrecisionPlugin.forward_context
, making it the default implementation for all{train,val,test,predict}_step_context()
methods (#9988)Added
DDPSpawnPlugin.spawn()
for spawning new processes of a given function (#10018, #10022)Added
TrainingTypePlugin.{_setup_model, _setup_optimizer}
methods (#9994, #10064)Implemented
DataParallelPlugin._setup_model
(#10010)Implemented
DeepSpeedPlugin._setup_model_and_optimizers
(#10009, #10064)Implemented
{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers
(#10028, #10064)Added optional
model
argument to theoptimizer_step
methods in accelerators and plugins (#10023)Updated precision attributes in
DeepSpeedPlugin
(#10164)Added the ability to return a result from rank 0 in
DDPSpawnPlugin.spawn
(#10162)Added
pytorch_lightning.lite
package (#10175)Added
LightningLite
documentation (#10043)Added
LightningLite
examples (#9987)Make the
_LiteDataLoader
an iterator and add supports for custom dataloader (#10279)
Added
use_omegaconf
argument tosave_hparams_to_yaml
plugin (#9170)Added
ckpt_path
argument forTrainer.fit()
(#10061)Added
auto_device_count
method toAccelerators
(#10222)Added support for
devices="auto"
(#10264)Added a
filename
argument inModelCheckpoint.format_checkpoint_name
(#9818)Added support for empty
gpus
list to run on CPU (#10246)Added a warning if multiple batch sizes are found from ambiguous batch (#10247)
[1.5.0] - Changed¶
Trainer now raises a
MisconfigurationException
when its methods are called withckpt_path="best"
but a checkpoint callback isn’t configured (#9841)Setting
Trainer(accelerator="ddp_cpu")
now does not spawn a subprocess ifnum_processes
is kept1
along withnum_nodes > 1
(#9603)Module imports are now catching
ModuleNotFoundError
instead ofImportError
(#9867)pytorch_lightning.loggers.neptune.NeptuneLogger
is now consistent with the new neptune-client API; the old neptune-client API is supported byNeptuneClient
from the neptune-contrib repo (#6867)Parsing of
enums
type hyperparameters to be saved in thehaprams.yaml
file by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170)Parsing of the
gpus
Trainer argument has changed:gpus="n"
(str) no longer selects the GPU index n and instead selects the first n devices (#8770)iteration_count
and other index attributes in the loops has been replaced with progress dataclasses (#8477)The
trainer.lightning_module
reference is now properly set at the very beginning of a run (#8536)The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
The
Trainer
functionsreset_{train,val,test,predict}_dataloader
,reset_train_val_dataloaders
, andrequest_dataloader
model
argument is now optional (#8536)Saved checkpoints will no longer use the type of a
Callback
as the key to avoid issues with unpickling (#6886)Improved string conversion for
ResultCollection
(#8622)LightningCLI
changes:LightningCLI.init_parser
now returns the parser instance (#8721)LightningCLI.add_core_arguments_to_parser
,LightningCLI.parse_arguments
now take aparser
argument (#8721)LightningCLI.instantiate_trainer
now takes a config and a list of callbacks (#8721)Split
LightningCLI.add_core_arguments_to_parser
intoLightningCLI.add_default_arguments_to_parser
+LightningCLI.add_core_arguments_to_parser
(#8721)
The accelerator and training type plugin
setup
hooks no longer have amodel
argument (#8536)The accelerator and training type plugin
update_global_step
hook has been removed (#8856)The coverage of
self.log
-ing in anyLightningModule
orCallback
hook has been improved (#8498)self.log
-ing without aTrainer
reference now raises a warning instead of an exception (#9733)Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloader
now takes aRunningStage
enum instance (#8858)Changed
rank_zero_warn
toNotImplementedError
in the{train, val, test, predict}_dataloader
hooks thatLightning(Data)Module
uses (#9161)Moved
block_ddp_sync_behaviour
out ofTrainingBatchLoop
to loop utilities (#9192)Executing the
optimizer_closure
is now required when overriding theoptimizer_step
hook (#9360)Changed logging of
LightningModule
andLightningDataModule
hyperparameters to raise an exception only if there are colliding keys with different values (#9496)seed_everything
now fails when an invalid seed value is passed instead of selecting a random seed (#8787)The Trainer now calls
TrainingTypePlugin
collective APIs directly instead of going through the Accelerator reference (#9677, #9901)The tuner now uses a unique filename to save a temporary checkpoint (#9682)
Changed
HorovodPlugin.all_gather
to return atorch.Tensor
instead of a list (#9696)Changed Trainer connectors to be protected attributes:
Configuration Validator (#9779)
The
current_epoch
andglobal_step
attributes now get restored irrespective of the Trainer task (#9413)Trainer now raises an exception when requesting
amp_level
with nativeamp_backend
(#9755)Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_norm
now raises an exception if parameternorm_type <= 0
(#9765)Updated error message for interactive incompatible plugins (#9896)
Moved the
optimizer_step
andclip_gradients
hook from theAccelerator
andTrainingTypePlugin
into thePrecisionPlugin
(#10143, #10029)NativeMixedPrecisionPlugin
and its subclasses now take an optionalGradScaler
instance (#10055)Trainer is now raising a
MisconfigurationException
instead of a warning ifTrainer.{validate/test}
is missing required methods (#10016)Changed default value of the
max_steps
Trainer argument fromNone
to -1 (#9460)LightningModule now raises an error when calling
log(on_step=False, on_epoch=False)
(#10227)Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
Raised
MisconfigurationException
when total length ofdataloader
across ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827)Changed the model size calculation using
ByteCounter
(#10123)Enabled
on_load_checkpoint
forLightningDataModule
for alltrainer_fn
(#10238)Allowed separate config files for parameters with class type when LightningCLI is in
subclass_mode=False
(#10286)
[1.5.0] - Deprecated¶
Deprecated Trainer argument
terminate_on_nan
in favor ofdetect_anomaly
(#9175)Deprecated
Trainer.terminate_on_nan
public attribute access (#9849)Deprecated
LightningModule.summarize()
in favor ofpytorch_lightning.utilities.model_summary.summarize()
(#8513)Deprecated
LightningModule.model_size
(#8343)Deprecated
DataModule
properties:train_transforms
,val_transforms
,test_transforms
,size
,dims
(#8851)Deprecated
add_to_queue
,get_from_queue
fromLightningModule
in favor of corresponding methods in theDDPSpawnPlugin
(#9118)Deprecated
LightningModule.get_progress_bar_dict
andTrainer.progress_bar_dict
in favor ofpytorch_lightning.callbacks.progress.base.get_standard_metrics
andProgressBarBase.get_metrics
(#8985)Deprecated
prepare_data_per_node
flag on Trainer and set it as a property ofDataHooks
, accessible in theLightningModule
andLightningDataModule
(#8958)Deprecated the
TestTubeLogger
(#9065)Deprecated
on_{train/val/test/predict}_dataloader()
fromLightningModule
andLightningDataModule
(#9098)Deprecated
on_keyboard_interrupt
callback hook in favor of newon_exception
hook (#9260)Deprecated passing
process_position
to theTrainer
constructor in favor of adding theProgressBar
callback withprocess_position
directly to the list of callbacks (#9222)Deprecated passing
flush_logs_every_n_steps
as a Trainer argument, instead pass it to the logger init if supported (#9366)Deprecated
LightningLoggerBase.close
,LoggerCollection.close
in favor ofLightningLoggerBase.finalize
,LoggerCollection.finalize
(#9422)Deprecated passing
progress_bar_refresh_rate
to theTrainer
constructor in favor of adding theProgressBar
callback withrefresh_rate
directly to the list of callbacks, or passingenable_progress_bar=False
to disable the progress bar (#9616)Deprecated
LightningDistributed
and moved the broadcast logic toDDPPlugin
andDDPSpawnPlugin
directly (#9691)Deprecated passing
stochastic_weight_avg
to theTrainer
constructor in favor of adding theStochasticWeightAveraging
callback directly to the list of callbacks (#8989)Deprecated Accelerator collective API
barrier
,broadcast
, andall_gather
in favor of calling theTrainingTypePlugin
collective API directly (#9677)Deprecated
checkpoint_callback
from theTrainer
constructor in favor ofenable_checkpointing
(#9754)Deprecated the
LightningModule.on_post_move_to_device
method (#9525)Deprecated
pytorch_lightning.core.decorators.parameter_validation
in favor ofpytorch_lightning.utilities.parameter_tying.set_shared_parameters
(#9525)Deprecated passing
weights_summary
to theTrainer
constructor in favor of adding theModelSummary
callback withmax_depth
directly to the list of callbacks (#9699)Deprecated
log_gpu_memory
,gpu_metrics
, and util funcs in favor ofDeviceStatsMonitor
callback (#9921)Deprecated
GPUStatsMonitor
andXLAStatsMonitor
in favor ofDeviceStatsMonitor
callback (#9924)Deprecated setting
Trainer(max_steps=None)
; To turn off the limit, setTrainer(max_steps=-1)
(default) (#9460)Deprecated access to the
AcceleratorConnector.is_slurm_managing_tasks
attribute and marked it as protected (#10101)Deprecated access to the
AcceleratorConnector.configure_slurm_ddp
method and marked it as protected (#10101)Deprecated passing
resume_from_checkpoint
to theTrainer
constructor in favor oftrainer.fit(ckpt_path=)
(#10061)Deprecated
ClusterEnvironment.creates_children()
in favor ofClusterEnvironment.creates_processes_externally
(property) (#10106)Deprecated
PrecisionPlugin.master_params()
in favor ofPrecisionPlugin.main_params()
(#10105)Deprecated
lr_sch_names
fromLearningRateMonitor
(#10066)Deprecated
ProgressBar
callback in favor ofTQDMProgressBar
(#10134)
[1.5.0] - Removed¶
Removed deprecated
metrics
(#8586)Removed the deprecated
outputs
argument in both theLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#8587)Removed the deprecated
TrainerLoggingMixin
class (#8609)Removed the deprecated
TrainerTrainingTricksMixin
class (#8679)Removed the deprecated
optimizer_idx
fromtraining_step
as an accepted argument in manual optimization (#8576)Removed support for the deprecated
on_save_checkpoint
signature. The hook now takes acheckpoint
positional parameter (#8697)Removed support for the deprecated
on_load_checkpoint
signature. The hook now takes apl_module
positional parameter (#8697)Removed the deprecated
save_function
property inModelCheckpoint
(#8680)Removed the deprecated
model
argument fromModelCheckpoint.save_checkpoint
(#8688)Removed the deprecated
sync_step
argument fromWandbLogger
(#8763)Removed the deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#8826)Removed
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#8850)Removed
on_reset_*_dataloader
hooks in TrainingType Plugins and Accelerators (#8858)Removed deprecated
GradInformation
module in favor ofpytorch_lightning.utilities.grads
(#8831)Removed
TrainingTypePlugin.on_save
andAccelerator.on_save
(#9023)Removed
{Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step
(#9746)Removed deprecated
connect_precision_plugin
andconnect_training_type_plugin
fromAccelerator
(#9019)Removed
on_train_epoch_end
fromAccelerator
(#9035)Removed
InterBatchProcessor
in favor ofDataLoaderIterDataFetcher
(#9052)Removed
Plugin
inbase_plugin.py
in favor of accessingTrainingTypePlugin
andPrecisionPlugin
directly instead (#9066)Removed
teardown
fromParallelPlugin
(#8943)Removed deprecated
profiled_functions
argument fromPyTorchProfiler
(#9178)Removed deprecated
pytorch_lighting.utilities.argparse_utils
module (#9166)Removed deprecated property
Trainer.running_sanity_check
in favor ofTrainer.sanity_checking
(#9209)Removed deprecated
BaseProfiler.output_filename
arg from it and its descendants in favor ofdirpath
andfilename
(#9214)Removed deprecated property
ModelCheckpoint.period
in favor ofModelCheckpoint.every_n_epochs
(#9213)Removed deprecated
auto_move_data
decorator (#9231)Removed deprecated property
LightningModule.datamodule
in favor ofTrainer.datamodule
(#9233)Removed deprecated properties
DeepSpeedPlugin.cpu_offload*
in favor ofoffload_optimizer
,offload_parameters
andpin_memory
(#9244)Removed deprecated property
AcceleratorConnector.is_using_torchelastic
in favor ofTorchElasticEnvironment.is_using_torchelastic()
(#9729)Removed
pytorch_lightning.utilities.debugging.InternalDebugger
(#9680)Removed
call_configure_sharded_model_hook
property fromAccelerator
andTrainingTypePlugin
(#9612)Removed
TrainerProperties
mixin and moved property definitions directly intoTrainer
(#9495)Removed a redundant warning with
ModelCheckpoint(monitor=None)
callback (#9875)Remove
epoch
fromtrainer.logged_metrics
(#9904)Remove deprecated
distributed_backend
fromTrainer
(#10017)Removed
process_idx
from the{DDPSpawnPlugin,TPUSpawnPlugin}.new_process
methods (#10022)Removed automatic patching of
{train,val,test,predict}_dataloader()
on theLightningModule
(#9764)Removed
pytorch_lightning.trainer.connectors.OptimizerConnector
(#10120)
[1.5.0] - Fixed¶
Fixed ImageNet evaluation in example (#10179)
Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
Fixed
move_metrics_to_cpu
moving the loss to CPU while training on device (#9308)Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
Fixed an issue with freeing memory of datafetchers during teardown (#9387)
Fixed a bug where the training step output needed to be
deepcopy
-ed (#9349)Fixed an issue with freeing memory allocated by the data iterators in
Loop.on_run_end
(#9386, #9915)Fixed
BasePredictionWriter
not returning the batch indices in a non-distributed setting (#9432)Fixed an error when running in XLA environments with no TPU attached (#9572)
Fixed check on torchmetrics logged whose
compute()
output is a multielement tensor (#9582)Fixed gradient accumulation for
DDPShardedPlugin
(#9122)Fixed missing DeepSpeed distributed call (#9540)
Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin
,DDPSpawnPlugin
,DDPShardedPlugin
,DDPSpawnShardedPlugin
(#9096)Fixed
trainer.accumulate_grad_batches
to be an int on init. The default value for it is nowNone
inside Trainer (#9652)Fixed
broadcast
inDDPPlugin
andDDPSpawnPlugin
to respect thesrc
input (#9691)Fixed
self.log(on_epoch=True, reduce_fx=sum))
for theon_batch_start
andon_train_batch_start
hooks (#9791)Fixed
self.log(on_epoch=True)
for theon_batch_start
andon_train_batch_start
hooks (#9780)Fixed restoring training state during
Trainer.fit
only (#9413)Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
Fixed DeepSpeed GPU device IDs (#9847)
Reset
val_dataloader
intuner/batch_size_scaling
(#9857)Fixed use of
LightningCLI
in computer_vision_fine_tuning.py example (#9934)Fixed issue with non-init dataclass fields in
apply_to_collection
(#9963)Reset
val_dataloader
intuner/batch_size_scaling
for binsearch (#9975)Fixed logic to check for spawn in dataloader
TrainerDataLoadingMixin._worker_check
(#9902)Fixed
train_dataloader
getting loaded twice when resuming from a checkpoint duringTrainer.fit()
(#9671)Fixed
LearningRateMonitor
logging with multiple param groups optimizer with no scheduler (#10044)Fixed undesired side effects being caused by
Trainer
patching dataloader methods on theLightningModule
(#9764)Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
Fixed
on_before_optimizer_step
getting called before the optimizer closure (including backward) has run (#10167)Fixed monitor value in
ModelCheckpoint
getting moved to the wrong device in a special case where it becomes NaN (#10118)Fixed creation of
dirpath
inBaseProfiler
if it doesn’t exist (#10073)Fixed incorrect handling of sigterm (#10189)
Fixed bug where
log(on_step=True, on_epoch=True, sync_dist=True)
wouldn’t reduce the value on step (#10227)Fixed an issue with
pl.utilities.seed.reset_seed
converting thePL_SEED_WORKERS
environment variable tobool
(#10099)Fixed iterating over a logger collection when
fast_dev_run > 0
(#10232)Fixed
batch_size
inResultCollection
not being reset to 1 on epoch end (#10242)Fixed
distrib_type
not being set when training plugin instances are being passed to the Trainer (#10251)
[1.4.9] - 2021-09-30¶
[1.4.8] - 2021-09-22¶
Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#9389)
Fixed
add_argparse_args
raisingTypeError
when args are typed astyping.Generic
in Python 3.6 (#9554)Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)
[1.4.7] - 2021-09-14¶
[1.4.6] - 2021-09-07¶
Fixed an issues with export to ONNX format when a model has multiple inputs (#8800)
Removed deprecation warnings being called for
on_{task}_dataloader
(#9279)Fixed save/load/resume from checkpoint for DeepSpeed Plugin ( #8397, #8644, #8627)
Fixed
EarlyStopping
running on train epoch end whencheck_val_every_n_epoch>1
is set (#9156)Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8333)
Fixed the Apex and DeepSpeed plugin closure running after the
on_before_optimizer_step
hook (#9288)Fixed the Native AMP plugin closure not running with manual optimization (#9288)
Fixed bug where data-loading functions where not getting the correct running stage passed (#8858)
Fixed intra-epoch evaluation outputs staying in memory when the respective
*_epoch_end
hook wasn’t overridden (#9261)Fixed error handling in DDP process reconciliation when
_sync_dir
was not initialized (#9267)Fixed PyTorch Profiler not enabled for manual optimization (#9316)
Fixed inspection of other args when a container is specified in
save_hyperparameters
(#9125)Fixed signature of
Timer.on_train_epoch_end
andStochasticWeightAveraging.on_train_epoch_end
to prevent unwanted deprecation warnings (#9347)
[1.4.5] - 2021-08-31¶
Fixed reduction using
self.log(sync_dict=True, reduce_fx={mean,max})
(#9142)Fixed not setting a default value for
max_epochs
ifmax_time
was specified on theTrainer
constructor (#9072)Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations (#9150)
Fixed
DDP
“CUDA error: initialization error” due to acopy
instead ofdeepcopy
onResultCollection
(#9239)
[1.4.4] - 2021-08-24¶
[1.4.3] - 2021-08-17¶
Fixed plateau scheduler stepping on incomplete epoch (#8861)
Fixed infinite loop with
CycleIterator
and multiple loaders (#8889)Fixed
StochasticWeightAveraging
with a list of learning rates not applying them to each param group (#8747)Restore original loaders if replaced by entrypoint (#8885)
Fixed lost reference to
_Metadata
object inResultMetricCollection
(#8932)Ensure the existence of
DDPPlugin._sync_dir
inreconciliate_processes
(#8939)
[1.4.2] - 2021-08-10¶
Fixed recursive call for
apply_to_collection(include_none=False)
(#8719)Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer (#8804)
Fixed comments and exception message for metrics_to_scalars (#8782)
Fixed typo error in LightningLoggerBase.after_save_checkpoint docstring (#8737)
[1.4.1] - 2021-08-03¶
Fixed
trainer.fit_loop.split_idx
always returningNone
(#8601)Fixed references for
ResultCollection.extra
(#8622)Fixed reference issues during epoch end result collection (#8621)
Fixed horovod auto-detection when horovod is not installed and the launcher is
mpirun
(#8610)Fixed an issue with
training_step
outputs not getting collected correctly fortraining_epoch_end
(#8613)Fixed distributed types support for CPUs (#8667)
Fixed a deadlock issue with DDP and torchelastic (#8655)
Fixed
accelerator=ddp
choice for CPU (#8645)
[1.4.0] - 2021-07-27¶
[1.4.0] - Added¶
Added
extract_batch_size
utility and corresponding tests to extract batch dimension from multiple batch types (#8357)Added support for named parameter groups in
LearningRateMonitor
(#7987)Added
dataclass
support forpytorch_lightning.utilities.apply_to_collection
(#7935)Added support to
LightningModule.to_torchscript
for saving to custom filesystems withfsspec
(#7617)Added
KubeflowEnvironment
for use with thePyTorchJob
operator in KubeflowAdded LightningCLI support for config files on object stores (#7521)
Added
ModelPruning(prune_on_train_epoch_end=True|False)
to choose when to apply pruning (#7704)Added support for checkpointing based on a provided time interval during training (#7515)
Progress tracking
Added support for passing a
LightningDataModule
positionally as the second argument totrainer.{validate,test,predict}
(#7431)Added argument
trainer.predict(ckpt_path)
(#7430)Added
clip_grad_by_value
support for TPUs (#7025)Added support for passing any class to
is_overridden
(#7918)Added
sub_dir
parameter toTensorBoardLogger
(#6195)Added correct
dataloader_idx
to batch transfer hooks (#6241)Added
include_none=bool
argument toapply_to_collection
(#7769)Added
apply_to_collections
to apply a function to two zipped collections (#7769)Added
ddp_fully_sharded
support (#7487)Added
should_rank_save_checkpoint
property to Training Plugins (#7684)Added
log_grad_norm
hook toLightningModule
to customize the logging of gradient norms (#7873)Added
save_config_filename
init argument toLightningCLI
to ease resolving name conflicts (#7741)Added
save_config_overwrite
init argument toLightningCLI
to ease overwriting existing config files (#8059)Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
Added trainer stage hooks for Training Plugins and Accelerators (#7864)
Added the
on_before_optimizer_step
hook (#8048)Added IPU Accelerator (#7867)
Fault-tolerant training
Added
{,load_}state_dict
toResultCollection
(#7948)Added
{,load_}state_dict
toLoops
(#8197)Added
FastForwardSampler
andCaptureIterableDataset
(#8307)Set
Loop.restarting=False
at the end of the first iteration (#8362)Save the loops state with the checkpoint (opt-in) (#8362)
Save a checkpoint to restore the state on exception (opt-in) (#8362)
Added
state_dict
andload_state_dict
utilities forCombinedLoader
+ utilities for dataloader (#8364)
Added
rank_zero_only
toLightningModule.log
function (#7966)Added
metric_attribute
toLightningModule.log
function (#7966)Added a warning if
Trainer(log_every_n_steps)
is a value too high for the training dataloader (#7734)Added LightningCLI support for argument links applied on instantiation (#7895)
Added LightningCLI support for configurable callbacks that should always be present (#7964)
Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
Added support for
torch.nn.UninitializedParameter
inModelSummary
(#7642)Added support
LightningModule.save_hyperparameters
whenLightningModule
is a dataclass (#7992)Added support for overriding
optimizer_zero_grad
andoptimizer_step
when using accumulate_grad_batches (#7980)Added
logger
boolean flag tosave_hyperparameters
(#7960)Added support for calling scripts using the module syntax (
python -m package.script
) (#8073)Added support for optimizers and learning rate schedulers to
LightningCLI
(#8093)Added XLA Profiler (#8014)
Added
PrecisionPlugin.{pre,post}_backward
(#8328)Added
on_load_checkpoint
andon_save_checkpoint
hooks to thePrecisionPlugin
base class (#7831)Added
max_depth
parameter inModelSummary
(#8062)Added
XLAStatsMonitor
callback (#8235)Added
restore
function andrestarting
attribute to baseLoop
(#8247)Added support for
save_hyperparameters
inLightningDataModule
(#3792)Added the
ModelCheckpoint(save_on_train_epoch_end)
to choose when to run the saving logic (#8389)Added
LSFEnvironment
for distributed training with the LSF resource managerjsrun
(#5102)Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'
(#7808)Added
tpu_spawn_debug
to plugin registry (#7933)Enabled traditional/manual launching of DDP processes through
LOCAL_RANK
andNODE_RANK
environment variable assignments (#7480)Added
quantize_on_fit_end
argument toQuantizationAwareTraining
(#8464)Added experimental support for loop specialization (#8226)
Added support for
devices
flag to Trainer (#8440)Added private
prevent_trainer_and_dataloaders_deepcopy
context manager on theLightningModule
(#8472)Added support for providing callables to the Lightning CLI instead of types (#8400)
[1.4.0] - Changed¶
Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
Changed the
Trainer
’scheckpoint_callback
argument to allow only boolean values (#7539)Log epoch metrics before the
on_evaluation_end
hook (#7272)Explicitly disallow calling
self.log(on_epoch=False)
during epoch-only or single-call hooks (#7874)Changed these
Trainer
methods to be protected:call_setup_hook
,call_configure_sharded_model
,pre_dispatch
,dispatch
,post_dispatch
,call_teardown_hook
,run_train
,run_sanity_check
,run_evaluate
,run_evaluation
,run_predict
,track_output_for_epoch_end
Changed
metrics_to_scalars
to work with any collection or value (#7888)Changed
clip_grad_norm
to usetorch.nn.utils.clip_grad_norm_
(#7025)Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint
now runs at the end of the training epoch by default (#8389)EarlyStopping
now runs at the end of the training epoch by default (#8286)Refactored Loops
Moved attributes
global_step
,current_epoch
,max/min_steps
,max/min_epochs
,batch_idx
, andtotal_batch_idx
to TrainLoop (#7437)Refactored result handling in training loop (#7506)
Moved attributes
hiddens
andsplit_idx
to TrainLoop (#7507)Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
Simplified “should run validation” logic (#7682)
Simplified logic for updating the learning rate for schedulers (#7682)
Removed the
on_epoch
guard from the “should stop” validation check (#7701)Refactored internal loop interface; added new classes
FitLoop
,TrainingEpochLoop
,TrainingBatchLoop
(#7871, #8077)Removed
pytorch_lightning/trainer/training_loop.py
(#7985)Refactored evaluation loop interface; added new classes
DataLoaderLoop
,EvaluationLoop
,EvaluationEpochLoop
(#7990, #8077)Removed
pytorch_lightning/trainer/evaluation_loop.py
(#8056)Restricted public access to several internal functions (#8024)
Refactored trainer
_run_*
functions and separate evaluation loops (#8065)Refactored prediction loop interface; added new classes
PredictionLoop
,PredictionEpochLoop
(#7700, #8077)Removed
pytorch_lightning/trainer/predict_loop.py
(#8094)Moved result teardown to the loops (#8245)
Improve
Loop
API to better handle childrenstate_dict
andprogress
(#8334)
Refactored logging
Renamed and moved
core/step_result.py
totrainer/connectors/logger_connector/result.py
(#7736)Dramatically simplify the
LoggerConnector
(#7882)trainer.{logged,progress_bar,callback}_metrics
are now updated on-demand (#7882)Completely overhaul the
Result
object in favor ofResultMetric
(#7882)Improve epoch-level reduction time and overall memory usage (#7882)
Allow passing
self.log(batch_size=...)
(#7891)Each of the training loops now keeps its own results collection (#7891)
Remove
EpochResultStore
andHookResultStore
in favor ofResultCollection
(#7909)Remove
MetricsHolder
(#7909)
Moved
ignore_scalar_return_in_dp
warning suppression to the DataParallelPlugin class (#7421)Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*
to the metric name (#7351)Raised
ValueError
when aNone
value isself.log
-ed (#7771)Changed
resolve_training_type_plugins
to allow settingnum_nodes
andsync_batchnorm
fromTrainer
setting (#7026)Default
seed_everything(workers=True)
in theLightningCLI
(#7504)Changed
model.state_dict()
inCheckpointConnector
to allowtraining_type_plugin
to customize the model’sstate_dict()
(#7474)MLflowLogger
now uses the env variableMLFLOW_TRACKING_URI
as default tracking URI (#7457)Changed
Trainer
arg and functionality fromreload_dataloaders_every_epoch
toreload_dataloaders_every_n_epochs
(#5043)Changed
WandbLogger(log_model={True/'all'})
to log models as artifacts (#6231)MLFlowLogger now accepts
run_name
as an constructor argument (#7622)Changed
teardown()
inAccelerator
to allowtraining_type_plugin
to customizeteardown
logic (#7579)Trainer.fit
now raises an error when using manual optimization with unsupported features such asgradient_clip_val
oraccumulate_grad_batches
(#7788)Accelerator hooks are called regardless if
LightningModule
overrides the same hooks (#7826)Moved profilers to their own file (#7822)
The
on_after_backward
hook is now called on accumulating iterations. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328)The mixed precision loss is no longer unscaled before the
on_after_backward
hook. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328)The
TrainingTypePlugin.{pre,post}_backward
hooks no longer take theoptimizer, opt_idx, should_accumulate
arguments (#8328)The
PrecisionPlugin.backward
hooks no longer returns a value (#8328)The
PrecisionPlugin.backward
hooks no longer takes ashould_accumulate
argument (#8328)Added the
on_before_backward
hook (#7865)LightningCLI
now aborts with a clearer message if config already exists and disables save config duringfast_dev_run
(#7963)Saved the
LightningCLI
config onsetup
and only on the main process (#8017)Dropped the
LightningCLI
ArgumentParser
when pickling (#8017)Skip
broadcast
if distributed not initialized for the spawn plugins (#8017)Trainer(resume_from_checkpoint=...)
now restores the model directly afterLightningModule.setup()
, which is beforeLightningModule.configure_sharded_model()
(#7652)Moved
torch.cuda.set_device()
to enable collective calls earlier in setup (#8312)Used XLA utility API to move data to CPU (Single TPU core) (#8078)
Improved error messages in
replace_sampler
when theDataLoader
attributes are not included in the signature or the signature is missing optional arguments (#8519)Moved
DeviceDtypeModuleMixin
andHyperparametersMixin
mixin tocore
(#8396)Return the
default_root_dir
as thelog_dir
when the logger is aLoggerCollection
(#8187)
[1.4.0] - Deprecated¶
Deprecated
LightningModule.loaded_optimizer_states_dict
(#8229)Standardized the dataloaders arguments of
trainer.{fit,valdiate,test,tune}
(#7431)Deprecated
DataModule
properties:has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
,has_teardown_predict
(#7657)Deprecated
TrainerModelHooksMixin
in favor ofpytorch_lightning.utilities.signature_utils
(#7422)Deprecated
num_nodes
andsync_batchnorm
arguments inDDPPlugin
andDDPSpawnPlugin
(#7026)Deprecated
self.log(sync_dist_op)
in favor ofself.log(reduce_fx)
. (#7891)Deprecated
is_overridden(model=...)
in favor ofis_overridden(instance=...)
(#7918)Deprecated automatically detaching returned extras with grads (#7994)
Deprecated default value of
monitor
argument in EarlyStopping callback to enforcemonitor
as a required argument (#7907)Deprecated importing
rank_zero_{warn,deprecation}
directly frompytorch_lightning.utilities.distributed
(#8085)Deprecated the use of
CheckpointConnector.hpc_load()
in favor ofCheckpointConnector.restore()
(#7652)Deprecated
ModelCheckpoint(every_n_val_epochs)
in favor ofModelCheckpoint(every_n_epochs)
(#8383)Deprecated
DDPPlugin.task_idx
in favor ofDDPPlugin.local_rank
(#8203)Deprecated the
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#8025)Deprecated the
Trainer.disable_validation
property in favor ofnot Trainer.enable_validation
(#8291)Deprecated
mode
parameter inModelSummary
in favor ofmax_depth
(#8062)Deprecated
reload_dataloaders_every_epoch
argument ofTrainer
in favor ofreload_dataloaders_every_n_epochs
(#5043)Deprecated
distributed_backend
argument forTrainer
(#8575)
[1.4.0] - Removed¶
Dropped official support/testing for PyTorch <1.6 (#8288)
Removed
ProfilerConnector
(#7654)Pruned deprecated classif. metrics from
pytorch_lightning.metrics.functional.classification
(#7499)Removed deprecated data parallel classes
LightningDataParallel
andLightningDistributedDataParallel
frompytorch_lightning.overrides.data_parallel
(#7510)Removed deprecated trainer attributes -
get_model
andaccelerator_backend
(#7502)Removed support for automatically monitoring the
val_loss
key withModelCheckpoint
. Pass yourmonitor
of choice to theModelCheckpoint
instance instead (#8293)Removed support for
self.log(tbptt_reduce_fx)
andself.log(tbptt_pad_token)
. Please, open a discussion explaining your use-case if you relied on these. (#7644)Removed deprecated utils modules
model_utils
,warning_utils
,xla_device_utils
and partiallyargparse_utils
(#7503)Removed
RPCPlugin
andRPCSequentialPlugin
. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101)Removed deprecated trainer attributes -
on_cpu
,on_tpu
,use_tpu
,on_gpu
,use_dp
,use_ddp
,use_ddp2
,use_horovod
,use_single_gpu
(#7501)Removed deprecated
optimizer
argument inLightningModule.manual_backward()
; Toggling optimizers in manual optimization should be done usingLightningModule.{un}toggle_optimizer()
(#8287)Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
Removed environment variable
PL_EXP_VERSION
from DDP subprocesses (7403)
[1.4.0] - Fixed¶
Fixed the
GPUStatsMonitor
callbacks to use the correct GPU IDs ifCUDA_VISIBLE_DEVICES
set (#8260)Fixed
lr_scheduler
checkpointed state by callingupdate_lr_schedulers
before saving checkpoints (#7877)Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
Fixed
None
loss keys getting added intraining_epoch_end
when using manual optimization and not returning a loss (#7772)Fixed a bug where
precision=64
withaccelerator='ddp_spawn'
would throw a pickle error (#6924)Do not override the existing
epoch
value inlogged_metrics
when already logged by the user (#7982)Support for manual optimization with DeepSpeed (#7970)
Fixed
dataloader_idx
argument value when predicting with only oneDataLoader
(#7941)Fixed passing the
stage
argument ofCallback.{setup,teardown}
as a keyword (#7973)Fixed metrics generated during
validation sanity checking
are cleaned on end (#8171)Fixed
log_gpu_memory
metrics not being added tologging
when nothing else is logged (#8174)Fixed a bug where calling
log
with aMetric
instance would raise an error if it was a nested attribute of the model (#8181)Fixed a bug where using
precision=64
would cause buffers with complex dtype to be cast to real (#8208)Fixed
is_overridden
returning true for wrapped functions with no changes (#8296)Fixed a bug where
truncated_bptt_steps
would throw an AttributeError when the target RNN has multiple hidden states (#8145)Fixed
self.optimizers()
not returning a single optimizer if it had been wrapped (#8326)Fixed the
on_after_backward
hook not getting called when using manual optimization and no plugins (#8328)Fixed the
LightningModule.backward
hook only getting called with theapex
plugin when using manual optimization (#8328)Fixed moving batch to device before sending it to the
on_*_batch_start
/on_*_batch_end
callbacks and model hooks (#7378)Fixed passing a custom
DDPPlugin
when choosingaccelerator="ddp_cpu"
for the accelerator (#6208)Fixed missing call to
LightningModule.untoggle_optimizer
in training loop when running gradient accumulation with multiple optimizers (#8284)Fixed hash of LightningEnum to work with value instead of name (#8421).
Fixed a bug where an extra checkpoint was saved at the end of training if the
val_check_interval
did not align with the number of training batches (#7724)Fixed hash of LightningEnum to work with value instead of name(#8421).
Fixed
move_data_to_device
to return the batch if the objectto
function didn’t returnself
(#8433)Fixed progress bar updates for Pod Training (#8258)
Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
Fixed memory leaks on GPU by moving
optimizer_states
,ResultCollection.extra
,ResultMetric
attributes, andLoggerConnector
metrics tocpu
. Also, delete the DDP wrapper onteardown
(#8490)Fixed
SWA
callback using LightningModuleprevent_trainer_and_dataloaders_deepcopy
to avoid OOM (#8472)Fixed
ModelPruning
callbackon_save_checkpoint
to avoid making adeepcopy
potentially leading to OOM (#8472)Fixed the sampler replacement logic for
DataLoader
s which do not define allDataLoader
attributes as__init__
parameters (#8519)Fixed DeepSpeed Windows support (#8488)
Fixed DeepSpeed not properly setting the trainer
lr_schedulers
attribute (#8527)Fixed experiment version and log-dir divergence in DDP when using multiple
Trainer
instances in sequence (7403)Enabled manual optimization for TPUs (#8458)
Fixed
accumulate_grad_batches
not been recomputed during model reload (#5334)Fixed a
TypeError
when wrapping optimizers in theHorovodPlugin
and runningTrainer.test
(#7840)Fixed
BackboneFinetuning
restoration (#8501)Fixed
lr_scheduler
with metric (e.g.torch.optim.lr_scheduler.ReduceLROnPlateau
) when usingautomatic_optimization = False
(#7643)Fixed
DeepSpeed
breaking with no schedulers (#8580)
[1.3.8] - 2021-07-01¶
[1.3.8] - Fixed¶
Fixed a sync deadlock when checkpointing a
LightningModule
that uses a torchmetrics 0.4Metric
(#8218)Fixed compatibility TorchMetrics v0.4 (#8206)
Added torchelastic check when sanitizing GPUs (#8095)
Fixed a DDP info message that was never shown (#8111)
Fixed metrics deprecation message at module import level (#8163)
Fixed a bug where an infinite recursion would be triggered when using the
BaseFinetuning
callback on a model that contains aModuleDict
(#8170)Added a mechanism to detect
deadlock
forDDP
when only 1 process trigger anException
. The mechanism willkill the processes
when it happens (#8167)Fixed NCCL error when selecting non-consecutive device ids (#8165)
Fixed SWA to also work with
IterableDataset
(#8172)
[1.3.7] - 2021-06-22¶
[1.3.7] - Fixed¶
Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
Fixed setting a
DistributedSampler
when using a distributed plugin in a custom accelerator (#7814)Improved
PyTorchProfiler
chrome traces names (#8009)Fixed moving the best score to device in
EarlyStopping
callback for TPU devices (#7959)Fixes access to
callback_metrics
in ddp_spawn (#7916)
[1.3.6] - 2021-06-15¶
[1.3.6] - Fixed¶
Fixed logs overwriting issue for remote filesystems (#7889)
Fixed
DataModule.prepare_data
could only be called on the global rank 0 process (#7945)Fixed setting
worker_init_fn
to seed dataloaders correctly when using DDP (#7942)Fixed
BaseFinetuning
callback to properly handle parent modules w/ parameters (#7931)
[1.3.5] - 2021-06-08¶
[1.3.5] - Added¶
Added warning to Training Step output (#7779)
[1.3.5] - Fixed¶
[1.3.5] - Changed¶
Move
training_output
validation to aftertrain_step_end
(#7868)
[1.3.4] - 2021-06-01¶
[1.3.4] - Fixed¶
[1.3.3] - 2021-05-27¶
[1.3.3] - Changed¶
Changed calling of
untoggle_optimizer(opt_idx)
out of the closure function (#7563)
[1.3.3] - Fixed¶
Fixed
ProgressBar
pickling after callingtrainer.predict
(#7608)Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
Fixed dataloaders are not reset when tuning the model (#7566)
Fixed print errors in
ProgressBar
whentrainer.fit
is not called (#7674)Fixed global step update when the epoch is skipped (#7677)
Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)
[1.3.2] - 2021-05-18¶
[1.3.2] - Changed¶
DataModule
s now avoid duplicate{setup,teardown,prepare_data}
calls for the same stage (#7238)
[1.3.2] - Fixed¶
Fixed parsing of multiple training dataloaders (#7433)
Fixed recursive passing of
wrong_type
keyword argument inpytorch_lightning.utilities.apply_to_collection
(#7433)Fixed setting correct
DistribType
forddp_cpu
(spawn) backend (#7492)Fixed incorrect number of calls to LR scheduler when
check_val_every_n_epoch > 1
(#7032)
[1.3.1] - 2021-05-11¶
[1.3.1] - Fixed¶
[1.3.0] - 2021-05-06¶
[1.3.0] - Added¶
Added support for the
EarlyStopping
callback to run at the end of the training epoch (#6944)Added synchronization points before and after
setup
hooks are run (#7202)Added a
teardown
hook toClusterEnvironment
(#6942)Added utils for metrics to scalar conversions (#7180)
Added utils for NaN/Inf detection for gradients and parameters (#6834)
Added more explicit exception message when trying to execute
trainer.test()
ortrainer.validate()
withfast_dev_run=True
(#6667)Added
LightningCLI
class to provide simple reproducibility with minimum boilerplate training CLI ( #4492, #6862, #7156, #7299)Added
gradient_clip_algorithm
argument to Trainer for gradient clipping by value (#6123).Added a way to print to terminal without breaking up the progress bar (#5470)
Added support to checkpoint after training steps in
ModelCheckpoint
callback (#6146)Added
TrainerStatus.{INITIALIZING,RUNNING,FINISHED,INTERRUPTED}
(#7173)Added
Trainer.validate()
method to perform one evaluation epoch over the validation set (#4948)Added
LightningEnvironment
for Lightning-specific DDP (#5915)Added
teardown()
hook to LightningDataModule (#4673)Added
auto_insert_metric_name
parameter toModelCheckpoint
(#6277)Added arg to
self.log
that enables users to give custom names when dealing with multiple dataloaders (#6274)Added
teardown
method toBaseProfiler
to enable subclasses defining post-profiling steps outside of__del__
(#6370)Added
setup
method toBaseProfiler
to enable subclasses defining pre-profiling steps for every process (#6633)Added no return warning to predict (#6139)
Added
Trainer.predict
config validation (#6543)Added
AbstractProfiler
interface (#6621)Added support for including module names for forward in the autograd trace of
PyTorchProfiler
(#6349)Added support for the PyTorch 1.8.1 autograd profiler (#6618)
Added
outputs
parameter to callback’son_validation_epoch_end
&on_test_epoch_end
hooks (#6120)Added
configure_sharded_model
hook (#6679)Added support for
precision=64
, enabling training with double precision (#6595)Added support for DDP communication hooks (#6736)
Added
artifact_location
argument toMLFlowLogger
which will be passed to theMlflowClient.create_experiment
call (#6677)Added
model
parameter to precision plugins’clip_gradients
signature ( #6764, #7231)Added
is_last_batch
attribute toTrainer
(#6825)Added
LightningModule.lr_schedulers()
for manual optimization (#6567)Added
MpModelWrapper
in TPU Spawn (#7045)Added
max_time
Trainer argument to limit training time (#6823)Added
on_predict_{batch,epoch}_{start,end}
hooks (#7141)Added new
EarlyStopping
parametersstopping_threshold
anddivergence_threshold
(#6868)Added
debug
flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219)Added new
UnrepeatedDistributedSampler
andIndexBatchSamplerWrapper
for tracking distributed predictions (#7215)Added
trainer.predict(return_predictions=None|False|True)
(#7215)Added
BasePredictionWriter
callback to implement prediction saving (#7127)Added
trainer.tune(scale_batch_size_kwargs, lr_find_kwargs)
arguments to configure the tuning algorithms (#7258)Added
tpu_distributed
check for TPU Spawn barrier (#7241)Added device updates to TPU Spawn for Pod training (#7243)
Added warning when missing
Callback
and usingresume_from_checkpoint
(#7254)DeepSpeed single file saving (#6900)
Added Training type Plugins Registry ( #6982, #7063, #7214, #7224 )
Add
ignore
param tosave_hyperparameters
(#6056)
[1.3.0] - Changed¶
Changed
LightningModule.truncated_bptt_steps
to be property (#7323)Changed
EarlyStopping
callback from by default runningEarlyStopping.on_validation_end
if only training is run. Setcheck_on_train_epoch_end
to run the callback at the end of the train epoch instead of at the end of the validation epoch (#7069)Renamed
pytorch_lightning.callbacks.swa
topytorch_lightning.callbacks.stochastic_weight_avg
(#6259)Refactor
RunningStage
andTrainerState
usage ( #4945, #7173)Added
RunningStage.SANITY_CHECKING
Added
TrainerFn.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
Changed
trainer.evaluating
to returnTrue
if validating or testing
Changed
setup()
andteardown()
stage argument to take any of{fit,validate,test,predict}
(#6386)Changed profilers to save separate report files per state and rank (#6621)
The trainer no longer tries to save a checkpoint on exception or run callback’s
on_train_end
functions (#6864)Changed
PyTorchProfiler
to usetorch.autograd.profiler.record_function
to record functions (#6349)Disabled
lr_scheduler.step()
in manual optimization (#6825)Changed warnings and recommendations for dataloaders in
ddp_spawn
(#6762)pl.seed_everything
will now also set the seed on theDistributedSampler
(#7024)Changed default setting for communication of multi-node training using
DDPShardedPlugin
(#6937)trainer.tune()
now returns the tuning result (#7258)LightningModule.from_datasets()
now acceptsIterableDataset
instances as training datasets. (#7503)Changed
resume_from_checkpoint
warning to an error when the checkpoint file does not exist (#7075)Automatically set
sync_batchnorm
fortraining_type_plugin
(#6536)Allowed training type plugin to delay optimizer creation (#6331)
Removed ModelSummary validation from train loop on_trainer_init (#6610)
Moved
save_function
to accelerator (#6689)Improved verbose logging for
EarlyStopping
callback (#6811)Run ddp_spawn dataloader checks on Windows (#6930)
Updated mlflow with using
resolve_tags
(#6746)Moved
save_hyperparameters
to its own function (#7119)Replaced
_DataModuleWrapper
with__new__
(#7289)Reset
current_fx
properties on lightning module in teardown (#7247)Auto-set
DataLoader.worker_init_fn
withseed_everything
(#6960)Remove
model.trainer
call inside of dataloading mixin (#7317)Split profilers module (#6261)
Ensure accelerator is valid if running interactively (#5970)
Disabled batch transfer in DP mode (#6098)
[1.3.0] - Deprecated¶
Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339)Deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#7323)Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339)Deprecated
LightningModule.grad_norm
in favor ofpytorch_lightning.utilities.grads.grad_norm
(#7292)Deprecated the
save_function
property from theModelCheckpoint
callback (#7201)Deprecated
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#7066)Deprecated
TrainerLoggingMixin
in favor of a separate utilities module for metric handling (#7180)Deprecated
TrainerTrainingTricksMixin
in favor of a separate utilities module for NaN/Inf detection for gradients and parameters (#6834)period
has been deprecated in favor ofevery_n_val_epochs
in theModelCheckpoint
callback (#6146)Deprecated
trainer.running_sanity_check
in favor oftrainer.sanity_checking
(#4945)Deprecated
Profiler(output_filename)
in favor ofdirpath
andfilename
(#6621)Deprecated
PyTorchProfiler(profiled_functions)
in favor ofrecord_functions
(#6349)Deprecated
@auto_move_data
in favor oftrainer.predict
(#6993)Deprecated
Callback.on_load_checkpoint(checkpoint)
in favor ofCallback.on_load_checkpoint(trainer, pl_module, checkpoint)
(#7253)Deprecated metrics in favor of
torchmetrics
( #6505, #6530, #6540, #6547, #6515, #6572, #6573, #6584, #6636, #6637, #6649, #6659, #7131, )Deprecated the
LightningModule.datamodule
getter and setter methods; access them throughTrainer.datamodule
instead (#7168)Deprecated the use of
Trainer(gpus="i")
(string) for selecting the i-th GPU; from v1.5 this will set the number of GPUs instead of the index (#6388)
[1.3.0] - Removed¶
Removed the
exp_save_path
property from theLightningModule
(#7266)Removed training loop explicitly calling
EarlyStopping.on_validation_end
if no validation is run (#7069)Removed
automatic_optimization
as a property from the training loop in favor ofLightningModule.automatic_optimization
(#7130)Removed evaluation loop legacy returns for
*_epoch_end
hooks (#6973)Removed support for passing a bool value to
profiler
argument of Trainer (#6164)Removed no return warning from val/test step (#6139)
Removed passing a
ModelCheckpoint
instance toTrainer(checkpoint_callback)
(#6166)Removed deprecated Trainer argument
enable_pl_optimizer
andautomatic_optimization
(#6163)Removed deprecated metrics (#6161)
from
pytorch_lightning.metrics.functional.classification
removedto_onehot
,to_categorical
,get_num_classes
,roc
,multiclass_roc
,average_precision
,precision_recall_curve
,multiclass_precision_recall_curve
from
pytorch_lightning.metrics.functional.reduction
removedreduce
,class_reduce
Removed deprecated
ModelCheckpoint
argumentsprefix
,mode="auto"
(#6162)Removed
mode='auto'
fromEarlyStopping
(#6167)Removed
epoch
andstep
arguments fromModelCheckpoint.format_checkpoint_name()
, these are now included in themetrics
argument (#7344)Removed legacy references for magic keys in the
Result
object (#6016)Removed deprecated
LightningModule
hparams
setter (#6207)Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the
"log"/"progress_bar"
magic keys. Useself.log
instead (#6734)Removed
trainer.fit()
return value of1
. It has no return now (#7237)Removed
logger_connector
legacy code (#6733)Removed unused mixin attributes (#6487)
[1.3.0] - Fixed¶
Fixed NaN errors in progress bars when training with iterable datasets with no length defined (#7306)
Fixed attaching train and validation dataloaders when
reload_dataloaders_every_epoch=True
andnum_sanity_val_steps=0
(#7207)Added a barrier in the accelerator
teardown
to synchronize processes before execution finishes (#6814)Fixed multi-node DDP sub-process launch by using
local_rank
instead ofglobal_rank
for main process assertion (#7061)Fixed incorrect removal of
WORLD_SIZE
environment variable in DDP training when launching with torch distributed/torchelastic (#6942)Made the
Plugin.reduce
method more consistent across all Plugins to reflect a mean-reduction by default (#6011)Move lightning module to correct device type when using LightningDistributedWrapper (#6070)
Do not print top-k verbose log with
ModelCheckpoint(monitor=None)
(#6109)Fixed
ModelCheckpoint(save_top_k=0, save_last=True)
not saving thelast
checkpoint (#6136)Fixed
.teardown(stage='fit')
and.on_fit_{start,end}()
getting called duringtrainer.test
(#6386)Fixed LightningModule
all_gather
on cpu tensors (#6416)Fixed torch distributed not available in setup hook for DDP (#6506)
Fixed
trainer.tuner.{lr_find,scale_batch_size}
not setting theTrainer
state properly (#7258)Fixed bug where the learning rate schedulers did not follow the optimizer frequencies (#4868)
Fixed pickle error checker to now check for
pickle.PickleError
to catch all pickle errors (#6917)Fixed a bug where the outputs object passed to
LightningModule.training_epoch_end
was different from the object passed to theon_train_end_epoch
hook (#6969)Fixed a bug where the outputs passed to
train_batch_end
would be lists even when using a single optimizer and no truncated backprop through time steps (#6969)Fixed bug for trainer error handling which would cause hang for distributed training (#6864)
Fixed
self.device
not returning the correct device in replicas of data-parallel (#6414)Fixed
lr_find
trying beyondnum_training
steps and suggesting a too high learning rate (#7076)Fixed logger creating incorrect version folder in DDP with repeated
Trainer.fit
calls (#7077)Fixed metric objects passed directly to
self.log
not being reset correctly (#7055)Fixed
CombinedLoader
in distributed settings for validation / testing (#7102)Fixed the save_dir in
WandbLogger
when the run was initiated externally (#7106)Fixed
num_sanity_val_steps
affecting reproducibility of training data shuffling (#7014)Fixed resetting device after
fitting/evaluating/predicting
(#7188)Fixed bug where
trainer.tuner.scale_batch_size(max_trials=0)
would not return the correct batch size result (#7262)Fixed metrics not being properly logged with
precision=16
andmanual_optimization
(#7228)Fixed
BaseFinetuning
properly reloadingoptimizer_states
when usingresume_from_checkpoint
(#6891)Fixed
parameters_to_ignore
not properly set to DDPWrapper (#7239)Fixed parsing of
fast_dev_run=True
with the built-inArgumentParser
(#7240)Fixed handling an
IterableDataset
that fails to produce a batch at the beginning of an epoch (#7294)Fixed
LightningModule.save_hyperparameters()
when attempting to save an empty container (#7268)Fixed
apex
not properly instantiated when running withddp
(#7274)Fixed optimizer
state
not moved toGPU
(#7277)Fixed custom init args for
WandbLogger
(#6989)Fixed a bug where an error would be raised if the train dataloader sometimes produced None for a batch (#7342)
Fixed examples ( #6600, #6638, #7096, #7246, #6357, #6476, #6294, #6373, #6088, #7398 )
Resolved schedule step bug for PyTorch Profiler (#6674, #6681)
Updated logic for checking TPUs availability (#6767)
Resolve TPU miss rendezvous (#6781)
Fixed auto-scaling mode when calling tune method on trainer (#7321)
Fixed finetuning complex models correctly unfreezes (#6880)
Ensure we set the eval/train flag correctly on accelerator model (#6877)
Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic (#6802)Fixed matching the number of outputs of backward with forward for AllGatherGrad (#6625)
Fixed the
gradient_clip_algorithm
has no effect (#6928)Fixed CUDA OOM detection and handling (#6934)
Fixed
unfreeze_and_add_param_group
expectsmodules
rather thanmodule
(#6822)Fixed DPP + SyncBN when move on device (#6838)
Fixed missing arguments in
lr_find
call (#6784)Fixed
set_default_tensor_type
totorch.DoubleTensor
with precision=64 (#7108)Fixed
NeptuneLogger.log_text(step=None)
(#7194)
[1.2.9] - 2021-04-20¶
[1.2.9] - Fixed¶
[1.2.8] - 2021-04-14¶
[1.2.8] - Added¶
Added TPUSpawn + IterableDataset error message (#6875)
[1.2.8] - Fixed¶
Fixed process rank not being available right away after
Trainer
instantiation (#6941)Fixed
sync_dist
for tpus (#6950)Fixed
AttributeError
forrequire_backward_grad_sync
when running manual optimization with sharded plugin (#6915)Fixed
--gpus
default for parser returned byTrainer.add_argparse_args
(#6898)Fixed TPU Spawn all gather (#6896)
Fixed
EarlyStopping
logic whenmin_epochs
ormin_steps
requirement is not met (#6705)Fixed csv extension check (#6436)
Fixed checkpoint issue when using Horovod distributed backend (#6958)
Fixed tensorboard exception raising (#6901)
Fixed setting the eval/train flag correctly on accelerator model (#6983)
Fixed DDP_SPAWN compatibility with bug_report_model.py (#6892)
Fixed bug where
BaseFinetuning.flatten_modules()
was duplicating leaf node parameters (#6879)Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic:
[1.2.7] - 2021-04-06¶
[1.2.7] - Fixed¶
Fixed resolve a bug with omegaconf and xm.save (#6741)
Fixed an issue with IterableDataset when len is not defined (#6828)
Sanitize None params during pruning (#6836)
Enforce an epoch scheduler interval when using SWA (#6588)
Fixed TPU Colab hang issue, post training (#6816)
Fixed a bug where
TensorBoardLogger
would give a warning and not log correctly to a symbolic linksave_dir
(#6730)Fixed bug where
predict
could not be used whenprogress_bar_refresh_rate=0
(#6884)
[1.2.6] - 2021-03-30¶
[1.2.6] - Changed¶
Changed the behavior of
on_epoch_start
to run at the beginning of validation & test epoch (#6498)
[1.2.6] - Removed¶
Removed legacy code to include
step
dictionary returns incallback_metrics
. Useself.log_dict
instead. (#6682)
[1.2.6] - Fixed¶
Fixed
DummyLogger.log_hyperparams
raising aTypeError
when running withfast_dev_run=True
(#6398)Fixed error on TPUs when there was no
ModelCheckpoint
(#6654)Fixed
trainer.test
freeze on TPUs (#6654)Fixed a bug where gradients were disabled after calling
Trainer.predict
(#6657)Fixed bug where no TPUs were detected in a TPU pod env (#6719)
[1.2.5] - 2021-03-23¶
[1.2.5] - Changed¶
[1.2.5] - Fixed¶
[1.2.4] - 2021-03-16¶
[1.2.4] - Changed¶
Changed the default of
find_unused_parameters
back toTrue
in DDP and DDP Spawn (#6438)
[1.2.4] - Fixed¶
Expose DeepSpeed loss parameters to allow users to fix loss instability (#6115)
Fixed DP reduction with collection (#6324)
Fixed an issue where the tuner would not tune the learning rate if also tuning the batch size (#4688)
Fixed broadcast to use PyTorch
broadcast_object_list
and addreduce_decision
(#6410)Fixed logger creating directory structure too early in DDP (#6380)
Fixed DeepSpeed additional memory use on rank 0 when default device not set early enough (#6460)
Fixed an issue with
Tuner.scale_batch_size
not finding the batch size attribute in the datamodule (#5968)Fixed an exception in the layer summary when the model contains torch.jit scripted submodules (#6511)
Fixed when Train loop config was run during
Trainer.predict
(#6541)
[1.2.3] - 2021-03-09¶
[1.2.3] - Fixed¶
Fixed
ModelPruning(make_pruning_permanent=True)
pruning buffers getting removed when saved during training (#6073)Fixed when
_stable_1d_sort
to work whenn >= N
(#6177)Fixed
AttributeError
whenlogger=None
on TPU (#6221)Fixed PyTorch Profiler with
emit_nvtx
(#6260)Fixed
trainer.test
frombest_path
hangs after callingtrainer.fit
(#6272)Fixed
SingleTPU
callingall_gather
(#6296)Ensure we check DeepSpeed/Sharded in multi-node DDP (#6297
Check
LightningOptimizer
doesn’t delete optimizer hooks (#6305Resolve memory leak for evaluation (#6326
Ensure that clip gradients is only called if the value is greater than 0 (#6330
Fixed
Trainer
not resettinglightning_optimizers
when callingTrainer.fit()
multiple times (#6372)
[1.2.2] - 2021-03-02¶
[1.2.2] - Added¶
Added
checkpoint
parameter to callback’son_save_checkpoint
hook (#6072)
[1.2.2] - Changed¶
[1.2.2] - Fixed¶
Fixed epoch level schedulers not being called when
val_check_interval < 1.0
(#6075)Fixed multiple early stopping callbacks (#6197)
Fixed incorrect usage of
detach()
,cpu()
,to()
(#6216)Fixed LBFGS optimizer support which didn’t converge in automatic optimization (#6147)
Prevent
WandbLogger
from dropping values (#5931)Fixed error thrown when using valid distributed mode in multi node (#6297
[1.2.1] - 2021-02-23¶
[1.2.1] - Fixed¶
[1.2.0] - 2021-02-18¶
[1.2.0] - Added¶
Added
DataType
,AverageMethod
andMDMCAverageMethod
enum in metrics (#5657)Added support for summarized model total params size in megabytes (#5590)
Added support for multiple train loaders (#1959)
Added
Accuracy
metric now generalizes to Top-k accuracy for (multi-dimensional) multi-class inputs using thetop_k
parameter (#4838)Added
Accuracy
metric now enables the computation of subset accuracy for multi-label or multi-dimensional multi-class inputs with thesubset_accuracy
parameter (#4838)Added
HammingDistance
metric to compute the hamming distance (loss) (#4838)Added
max_fpr
parameter toauroc
metric for computing partial auroc metric (#3790)Added
StatScores
metric to compute the number of true positives, false positives, true negatives and false negatives (#4839)Added
R2Score
metric (#5241)Added
LambdaCallback
(#5347)Added
BackboneLambdaFinetuningCallback
(#5377)Accelerator
all_gather
supports collection (#5221)Added
image_gradients
functional metric to compute the image gradients of a given input image. (#5056)Added
MetricCollection
(#4318)Added
.clone()
method to metrics (#4318)Added
IoU
class interface (#4704)Support to tie weights after moving model to TPU via
on_post_move_to_device
hookAdded missing val/test hooks in
LightningModule
(#5467)The
Recall
andPrecision
metrics (and their functional counterpartsrecall
andprecision
) can now be generalized to Recall@K and Precision@K with the use oftop_k
parameter (#4842)Added
PyTorchProfiler
(#5560)Added compositional metrics (#5464)
Added Trainer method
predict(...)
for high performance predictions (#5579)Added
on_before_batch_transfer
andon_after_batch_transfer
data hooks (#3671)Added AUC/AUROC class interface (#5479)
Added
PredictLoop
object (#5752)Added
LightningModule.configure_callbacks
to enable the definition of model-specific callbacks (#5621)Added
dim
toPSNR
metric for mean-squared-error reduction (#5957)Added promxial policy optimization template to pl_examples (#5394)
Added
log_graph
toCometLogger
(#5295)Added possibility for nested loaders (#5404)
Added
sync_step
to Wandb logger (#5351)Added
StochasticWeightAveraging
callback (#5640)Added
LightningDataModule.from_datasets(...)
(#5133)Added
PL_TORCH_DISTRIBUTED_BACKEND
env variable to select backend (#5981)Added
Trainer
flag to activate Stochastic Weight Averaging (SWA)Trainer(stochastic_weight_avg=True)
(#6038)
[1.2.0] - Changed¶
Changed
stat_scores
metric now calculates stat scores over all classes and gains new parameters, in line with the newStatScores
metric (#4839)Changed
computer_vision_fine_tunning
example to useBackboneLambdaFinetuningCallback
(#5377)Changed
automatic casting
for LoggerConnectormetrics
(#5218)Changed
iou
[func] to allow float input (#4704)Metric
compute()
method will no longer automatically callreset()
(#5409)Set PyTorch 1.4 as min requirements, also for testing and examples
torchvision>=0.5
andtorchtext>=0.5
(#5418)Changed
callbacks
argument inTrainer
to allowCallback
input (#5446)Changed the default of
find_unused_parameters
toFalse
in DDP (#5185)Changed
ModelCheckpoint
version suffixes to start at 1 (#5008)Progress bar metrics tensors are now converted to float (#5692)
Changed the default value for the
progress_bar_refresh_rate
Trainer argument in Google COLAB notebooks to 20 (#5516)Extended support for purely iteration-based training (#5726)
Made
LightningModule.global_rank
,LightningModule.local_rank
andLightningModule.logger
read-only properties (#5730)Forced
ModelCheckpoint
callbacks to run after all others to guarantee all states are saved to the checkpoint (#5731)Refactored Accelerators and Plugins:
Added base classes for plugins (#5715)
Added parallel plugins for DP, DDP, DDPSpawn, DDP2 and Horovod (#5714)
Precision Plugins (#5718)
Added new Accelerators for CPU, GPU and TPU (#5719)
Added RPC and Sharded plugins (#5732)
Added missing
LightningModule
-wrapper logic to new plugins and accelerator (#5734)Moved device-specific teardown logic from training loop to accelerator (#5973)
Moved accelerator_connector.py to the connectors subfolder (#6033)
Trainer only references accelerator (#6039)
Made parallel devices optional across all plugins (#6051)
Enabled
self.log
in callbacks (#5094)Renamed xxx_AVAILABLE as protected (#5082)
Unified module names in Utils (#5199)
Refactor: clean trainer device & distributed getters (#5300)
Simplified training phase as LightningEnum (#5419)
Updated metrics to use LightningEnum (#5689)
Changed the seq of
on_train_batch_end
,on_batch_end
&on_train_epoch_end
,on_epoch_end hooks
(#5688)Refactored
setup_training
and removetest_mode
(#5388)Disabled training with zero
num_training_batches
when insufficientlimit_train_batches
(#5703)Refactored
EpochResultStore
(#5522)Update
lr_finder
to check for attribute if not runningfast_dev_run
(#5990)LightningOptimizer manual optimizer is more flexible and expose
toggle_model
(#5771)MlflowLogger
limit parameter value length to 250 char (#5893)Re-introduced fix for Hydra directory sync with multiple process (#5993)
[1.2.0] - Deprecated¶
Function
stat_scores_multiple_classes
is deprecated in favor ofstat_scores
(#4839)Moved accelerators and plugins to its
legacy
pkg (#5645)Deprecated
LightningDistributedDataParallel
in favor of new wrapper moduleLightningDistributedModule
(#5185)Deprecated
LightningDataParallel
in favor of new wrapper moduleLightningParallelModule
(#5670)Renamed utils modules (#5199)
argparse_utils
>>argparse
model_utils
>>model_helpers
warning_utils
>>warnings
xla_device_utils
>>xla_device
Deprecated using
'val_loss'
to set theModelCheckpoint
monitor (#6012)Deprecated
.get_model()
with explicit.lightning_module
property (#6035)Deprecated Trainer attribute
accelerator_backend
in favor ofaccelerator
(#6034)
[1.2.0] - Removed¶
[1.2.0] - Fixed¶
Fixed distributed setting and
ddp_cpu
only withnum_processes>1
(#5297)Fixed
num_workers
for Windows example (#5375)Fixed loading yaml (#5619)
Fixed support custom DataLoader with DDP if they can be re-instantiated (#5745)
Fixed repeated
.fit()
calls ignore max_steps iteration bound (#5936)Fixed throwing
MisconfigurationError
on unknown mode (#5255)Resolve bug with Finetuning (#5744)
Fixed
ModelCheckpoint
race condition in file existence check (#5155)Fixed some compatibility with PyTorch 1.8 (#5864)
Fixed forward cache (#5895)
Fixed recursive detach of tensors to CPU (#6007)
Fixed passing wrong strings for scheduler interval doesn’t throw an error (#5923)
Fixed wrong
requires_grad
state afterreturn None
with multiple optimizers (#5738)Fixed add
on_epoch_end
hook at the end ofvalidation
,test
epoch (#5986)Fixed missing
process_dataloader
call forTPUSpawn
when in distributed mode (#6015)Fixed progress bar flickering by appending 0 to floats/strings (#6009)
Fixed synchronization issues with TPU training (#6027)
Fixed
hparams.yaml
saved twice when usingTensorBoardLogger
(#5953)Fixed
fairscale
compatible with PT 1.8 (#5996)Ensured
process_dataloader
is called whentpu_cores > 1
to use Parallel DataLoader (#6015)Attempted SLURM auto resume call when non-shell call fails (#6002)
Fixed wrapping optimizers upon assignment (#6006)
Fixed allowing hashing of metrics with lists in their state (#5939)
[1.1.8] - 2021-02-08¶
[1.1.8] - Fixed¶
[1.1.7] - 2021-02-03¶
[1.1.7] - Fixed¶
Fixed
TensorBoardLogger
not closingSummaryWriter
onfinalize
(#5696)Fixed filtering of pytorch “unsqueeze” warning when using DP (#5622)
Fixed
num_classes
argument in F1 metric (#5663)Fixed
log_dir
property (#5537)Fixed a race condition in
ModelCheckpoint
when checking if a checkpoint file exists (#5144)Remove unnecessary intermediate layers in Dockerfiles (#5697)
Fixed auto learning rate ordering (#5638)
[1.1.6] - 2021-01-26¶
[1.1.6] - Changed¶
[1.1.6] - Fixed¶
Fixed
toggle_optimizer
to resetrequires_grad
state (#5574)Fixed FileNotFoundError for best checkpoint when using DDP with Hydra (#5629)
Fixed an error when logging a progress bar metric with a reserved name (#5620)
Fixed
Metric
’sstate_dict
not included when child modules (#5614)Fixed Neptune logger creating multiple experiments when GPUs > 1 (#3256)
Fixed duplicate logs appearing in console when using the python logging module (#5509)
Fixed tensor printing in
trainer.test()
(#5138)Fixed not using dataloader when
hparams
present (#4559)
[1.1.5] - 2021-01-19¶
[1.1.5] - Fixed¶
[1.1.4] - 2021-01-12¶
[1.1.4] - Added¶
Add automatic optimization property setter to lightning module (#5169)
[1.1.4] - Changed¶
Changed deprecated
enable_pl_optimizer=True
(#5244)
[1.1.4] - Fixed¶
Fixed
transfer_batch_to_device
for DDP withlen(devices_ids) == 1
(#5195)Logging only on
not should_accumulate()
during training (#5417)Resolve interpolation bug with Hydra (#5406)
Check environ before selecting a seed to prevent warning message (#4743)
Fixed signature mismatch in
model_to_device
ofDDPCPUHPCAccelerator
(#5505)
[1.1.3] - 2021-01-05¶
[1.1.3] - Added¶
[1.1.3] - Changed¶
[1.1.3] - Fixed¶
Fixed
trainer.test
returning non-test metrics (#5214)Fixed metric state reset (#5273)
Fixed
--num-nodes
onDDPSequentialPlugin
(#5327)Fixed invalid value for
weights_summary
(#5296)Fixed
Trainer.test
not using the latestbest_model_path
(#5161)Fixed existence check for hparams not using underlying filesystem (#5250)
Fixed
LightningOptimizer
AMP bug (#5191)Fixed casted key to string in
_flatten_dict
(#5354)
[1.1.2] - 2020-12-23¶
[1.1.2] - Added¶
[1.1.2] - Removed¶
enable_pl_optimizer=False
by default to temporarily fix AMP issues (#5163)
[1.1.2] - Fixed¶
Metric reduction with Logging (#5150)
Remove nan loss in manual optimization (#5121)
Un-balanced logging properly supported (#5119)
Fix hanging in DDP HPC accelerators (#5157)
Fix reset
TensorRunningAccum
(#5106)Updated
DALIClassificationLoader
to not use deprecated arguments (#4925)Corrected call to
torch.no_grad
(#5124)
[1.1.1] - 2020-12-15¶
[1.1.1] - Added¶
Add a notebook example to reach a quick baseline of ~94% accuracy on CIFAR10 using Resnet in Lightning (#4818)
[1.1.1] - Changed¶
[1.1.1] - Removed¶
[1.1.1] - Fixed¶
Fixed trainer by default
None
inDDPAccelerator
(#4915)Fixed
LightningOptimizer
to expose optimizer attributes (#5095)Do not warn when the
name
key is used in thelr_scheduler
dict (#5057)Check if optimizer supports closure (#4981)
Add deprecated metric utility functions back to functional ( #5067, #5068)
Allow any input in
to_onnx
andto_torchscript
(#4378)Fixed
DDPHPCAccelerator
hangs in DDP construction by callinginit_device
(#5157)
[1.1.0] - 2020-12-09¶
[1.1.0] - Added¶
Added “monitor” key to saved
ModelCheckpoints
(#4383)Added
ConfusionMatrix
class interface (#4348)Added multiclass AUROC metric (#4236)
Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
Added optimizer hooks in callbacks (#4379)
Added option to log momentum (#4384)
Added
current_score
toModelCheckpoint.on_save_checkpoint
(#4721)Added logging using
self.log
in train and evaluation for epoch end hooks ( #4552, #4495, #4439, #4684, #4913)Added ability for DDP plugin to modify optimizer state saving (#4675)
Added
prefix
argument in loggers (#4557)Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
Added
PrecisionRecallCurve, ROC, AveragePrecision
class metric (#4549)Added custom
Apex
andNativeAMP
asPrecision plugins
(#4355)Added
DALI MNIST
example (#3721)Added
sharded plugin
for DDP for multi-gpu training memory optimizations ( #4639, #4686, #4737, #4773)Added
experiment_id
to the NeptuneLogger (#3462)Added
PyTorch Geometric
integration example with Lightning (#4568)Added
all_gather
method toLightningModule
which allows gradient based tensor synchronizations for use-cases such as negative sampling. (#5012)Enabled
self.log
in most functions (#4969)Added changeable extension variable for
ModelCheckpoint
(#4977)
[1.1.0] - Changed¶
Tuner algorithms will be skipped if
fast_dev_run=True
(#3903)WandbLogger
does not force wandbreinit
arg to True anymore and creates a run only when needed (#4648)Changed
automatic_optimization
to be a model attribute (#4602)Changed
Simple Profiler
report to order by percentage time spent + num calls (#4880)Simplify optimization Logic (#4984)
Classification metrics overhaul (#4837)
Updated
fast_dev_run
to accept integer representing num_batches (#4629)Refactored optimizer (#4658)
[1.1.0] - Deprecated¶
[1.1.0] - Removed¶
[1.1.0] - Fixed¶
Added feature to move tensors to CPU before saving (#4309)
Fixed
LoggerConnector
to have logged metrics on root device in DP (#4138)Auto convert tensors to contiguous format when
gather_all
(#4907)Fixed
PYTHONPATH
for ddp test model (#4528)Fixed allowing logger to support indexing (#4595)
Fixed DDP and manual_optimization (#4976)
[1.0.8] - 2020-11-24¶
[1.0.8] - Added¶
[1.0.8] - Changed¶
Consistently use
step=trainer.global_step
inLearningRateMonitor
independently oflogging_interval
(#4376)Metric states are no longer as default added to
state_dict
(#4685)Renamed class metric
Fbeta
>>FBeta
(#4656)Model summary: add 1 decimal place (#4745)
Do not override
PYTHONWARNINGS
(#4700)Changed
init_ddp_connection
moved fromDDP
toDDPPlugin
(#4407)
[1.0.8] - Fixed¶
Fixed checkpoint
hparams
dict casting whenomegaconf
is available (#4770)Fixed incomplete progress bars when total batches not divisible by refresh rate (#4577)
Updated SSIM metric (#4566)
Fixed batch_arg_name - add
batch_arg_name
to all calls to_adjust_batch_size
bug (#4812)Fixed
torchtext
data to GPU (#4785)Fixed a crash bug in MLFlow logger (#4716)
[1.0.7] - 2020-11-17¶
[1.0.7] - Added¶
Added lambda closure to
manual_optimizer_step
(#4618)
[1.0.7] - Changed¶
[1.0.7] - Fixed¶
Prevent crash if
sync_dist=True
on CPU (#4626)Fixed average pbar Metrics (#4534)
Fixed
setup
callback hook to correctly pass the LightningModule through (#4608)Allowing decorate model init with saving
hparams
inside (#4662)Fixed
split_idx
set byLoggerConnector
inon_trainer_init
toTrainer
(#4697)
[1.0.6] - 2020-11-11¶
[1.0.6] - Added¶
Added metrics aggregation in Horovod and fixed early stopping (#3775)
Added
manual_optimizer_step
which work withAMP Native
andaccumulated_grad_batches
(#4485)Added
persistent(mode)
method to metrics, to enable and disable metric states being added tostate_dict
(#4482)Added congratulations at the end of our notebooks (#4555)
Added parameters
move_metrics_to_cpu
in Trainer to disable gpu leak (#4592)
[1.0.6] - Changed¶
[1.0.6] - Fixed¶
Fixed feature-lack in
hpc_load
(#4526)Fixed metrics states being overridden in DDP mode (#4482)
Fixed
lightning_getattr
,lightning_hasattr
not finding the correct attributes in datamodule (#4347)Fixed automatic optimization AMP by
manual_optimization_step
(#4485)Replace
MisconfigurationException
with warning inModelCheckpoint
Callback (#4560)Fixed logged keys in mlflow logger (#4412)
Fixed
is_picklable
by catchingAttributeError
(#4508)Fixed multi test dataloaders dict
AttributeError
error (#4480)Fixed show progress bar only for
progress_rank 0
onDDP_SLURM
(#4437)
[1.0.5] - 2020-11-03¶
[1.0.5] - Added¶
[1.0.5] - Changed¶
W&B log in sync with
Trainer
step (#4405)Hook
on_after_backward
is called only whenoptimizer_step
is being called (#4439)Moved
track_and_norm_grad
intotraining loop
and called only whenoptimizer_step
is being called (#4439)Changed type checker with explicit cast of
ref_model
object (#4457)Changed
distributed_backend
->accelerator
(#4429)
[1.0.5] - Deprecated¶
Deprecated passing
ModelCheckpoint
instance tocheckpoint_callback
Trainer argument (#4336)
[1.0.5] - Fixed¶
Disable saving checkpoints if not trained (#4372)
Fixed error using
auto_select_gpus=True
withgpus=-1
(#4209)Disabled training when
limit_train_batches=0
(#4371)Fixed that metrics do not store computational graph for all seen data (#4313)
Fixed AMP unscale for
on_after_backward
(#4439)Fixed TorchScript export when module includes Metrics (#4428)
Fixed TorchScript trace method’s data to device and docstring (#4360)
Fixed CSV logger warning (#4419)
Fixed skip DDP parameter sync (#4301)
Fixed
WandbLogger
_sanitize_callable function (#4422)Fixed
AMP Native
_unscale
gradient (#4441)
[1.0.4] - 2020-10-27¶
[1.0.4] - Added¶
Added
dirpath
andfilename
parameter inModelCheckpoint
(#4213)Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
Added
strict
option to the scheduler dictionary (#3586)Added
fsspec
support for profilers (#4162)Added autogenerated helptext to
Trainer.add_argparse_args
(#4344)Added support for string values in
Trainer
’sprofiler
parameter (#3656)Added
optimizer_closure
tooptimizer.step
when supported (#4190)Added unification of regression metrics (#4166)
Added checkpoint load from Bytes (#4314)
[1.0.4] - Changed¶
[1.0.4] - Deprecated¶
[1.0.4] - Fixed¶
Fixed setting device ids in DDP (#4297)
Fixed synchronization of best model path in
ddp_accelerator
(#4323)Fixed
WandbLogger
not uploading checkpoint artifacts at the end of training (#4341)Fixed
FBeta
computation (#4183)Fixed
accumulation across batches
has completedbefore breaking training loop
(#4278)Fixed
ModelCheckpoint
don’t increase current_epoch and global_step when not training (#4291)Fixed
COMET_EXPERIMENT_KEY
environment variable usage in comet logger (#4230)
[1.0.3] - 2020-10-20¶
[1.0.3] - Added¶
Added persistent flag to
Metric.add_state
(#4195)
[1.0.3] - Changed¶
[1.0.3] - Fixed¶
[1.0.2] - 2020-10-15¶
[1.0.2] - Added¶
Added trace functionality to the function
to_torchscript
(#4142)
[1.0.2] - Changed¶
Called
on_load_checkpoint
before loadingstate_dict
(#4057)
[1.0.2] - Removed¶
Removed duplicate metric vs step log for train loop (#4173)
[1.0.2] - Fixed¶
[1.0.1] - 2020-10-14¶
[1.0.1] - Added¶
Added getstate/setstate method for torch.save serialization (#4127)
[1.0.0] - 2020-10-13¶
[1.0.0] - Added¶
Added Explained Variance Metric + metric fix (#4013)
Added Metric <-> Lightning Module integration tests (#4008)
Added parsing OS env vars in
Trainer
(#4022)Added classification metrics (#4043)
Updated explained variance metric (#4024)
Enabled plugins (#4041)
Enabled custom clusters (#4048)
Enabled passing in custom accelerators (#4050)
Added
LightningModule.toggle_optimizer
(#4058)Added
LightningModule.manual_backward
(#4063)Added
output
argument to*_epoch_end
hooks (#3967)
[1.0.0] - Changed¶
[1.0.0] - Removed¶
Removed support for EvalResult and TrainResult (#3968)
Removed deprecated trainer flags:
overfit_pct
,log_save_interval
,row_log_interval
(#3969)Removed deprecated early_stop_callback (#3982)
Removed deprecated model hooks (#3980)
Removed deprecated callbacks (#3979)
Removed
trainer
argument inLightningModule.backward
#4056)
[1.0.0] - Fixed¶
[0.10.0] - 2020-10-07¶
[0.10.0] - Added¶
Enable PyTorch 1.7 compatibility (#3541)
Added
LightningModule.to_torchscript
to support exporting asScriptModule
(#3258)Added warning when dropping unpicklable
hparams
(#2874)Added EMB similarity (#3349)
Added
ModelCheckpoint.to_yaml
method (#3048)Allow
ModelCheckpoint
monitor to beNone
, meaning it will always save (#3630)Disabled optimizers setup during testing (#3059)
Added support for datamodules to save and load checkpoints when training (#3563)
Added support for datamodule in learning rate finder (#3425)
Added gradient clip test for native AMP (#3754)
Added dist lib to enable syncing anything across devices (#3762)
Added
broadcast
toTPUBackend
(#3814)Added
XLADeviceUtils
class to check XLA device type (#3274)
[0.10.0] - Changed¶
Refactored accelerator backends:
moved TPU
xxx_step
to backend (#3118)refactored DDP backend
forward
(#3119)refactored GPU backend
__step
(#3120)remove obscure forward call in eval + CPU backend
___step
(#3123)reduced all simplified forward (#3126)
added hook base method (#3127)
refactor eval loop to use hooks - use
test_mode
for if so we can split later (#3129)moved
___step_end
hooks (#3130)training forward refactor (#3134)
training AMP scaling refactor (#3135)
eval step scaling factor (#3136)
add eval loop object to streamline eval loop (#3138)
refactored dataloader process hook (#3139)
refactored inner eval loop (#3141)
final inner eval loop hooks (#3154)
clean up hooks in
run_evaluation
(#3156)clean up data reset (#3161)
expand eval loop out (#3165)
moved hooks around in eval loop (#3195)
remove
_evaluate
fx (#3197)Trainer.fit
hook clean up (#3198)DDPs train hooks (#3203)
reduced accelerator selection (#3211)
group prepare data hook (#3212)
added data connector (#3285)
modular is_overridden (#3290)
adding
Trainer.tune()
(#3293)move
run_pretrain_routine
->setup_training
(#3294)move train outside of setup training (#3297)
move
prepare_data
to data connector (#3307)moved accelerator router (#3309)
train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
duplicate data interface definition up into DataHooks class (#3344)
inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
all logging related calls in a connector (#3395)
added model connector (#3407)
moved eval loop logging to loggers (#3408)
moved eval loop (#3412#3408)
move
lr_finder
(#3434)move specific accelerator code (#3457)
group connectors (#3472)
apex plugin (#3502)
precision plugins (#3504)
Result - make monitor default to
checkpoint_on
to simplify (#3571)reference to the Trainer on the
LightningDataModule
(#3684)add
.log
to lightning module (#3686, #3699, #3701, #3704, #3715)enable tracking original metric when step and epoch are both true (#3685)
deprecated results obj, added support for simpler comms (#3681)
move backends back to individual files (#3712)
fixes logging for eval steps (#3763)
decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806, #3817, #3819, #3927)
remove weight loading hack for ddp_cpu (#3808)
separate
torchelastic
from DDP (#3810)separate SLURM from DDP (#3809)
decoupled DDP2 (#3816)
bug fix with logging val epoch end + monitor (#3812)
callback system and init DDP (#3836)
epoch can now log independently (#3843)
test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
fixed
init_slurm_connection
causing hostname errors (#3856)moves init apex from LM to apex connector (#3923)
moves sync bn to each backend (#3925)
moves configure ddp to each backend (#3924)
Deprecation warning (#3844)
Changed
LearningRateLogger
toLearningRateMonitor
(#3251)Used
fsspec
instead ofgfile
for all IO (#3320)Swapped
torch.load
forfsspec
load in DDP spawn backend (#3787)Swapped
torch.load
forfsspec
load in cloud_io loading (#3692)Added support for
to_disk()
to use remote filepaths withfsspec
(#3930)Updated model_checkpoint’s to_yaml to use
fsspec
open (#3801)Fixed
fsspec
is inconsistent when doingfs.ls
(#3805)
Refactor
GPUStatsMonitor
to improve training speed (#3257)Changed IoU score behavior for classes absent in target and pred (#3098)
Changed IoU
remove_bg
bool toignore_index
optional int (#3098)Changed defaults of
save_top_k
andsave_last
toNone
in ModelCheckpoint (#3680)row_log_interval
andlog_save_interval
are now based on training loop’sglobal_step
instead of epoch-internal batch index (#3667)Silenced some warnings. verified ddp refactors (#3483)
Cleaning up stale logger tests (#3490)
Allow
ModelCheckpoint
monitor to beNone
(#3633)Enable
None
model checkpoint default (#3669)Skipped
best_model_path
ifcheckpoint_callback
isNone
(#2962)Used
raise .. from ..
to explicitly chain exceptions (#3750)Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
Write predictions in LightningModule instead of EvalResult #3882
[0.10.0] - Deprecated¶
Deprecated
TrainResult
andEvalResult
, useself.log
andself.write
from theLightningModule
to log metrics and write predictions.training_step
can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681)Deprecate
early_stop_callback
Trainer argument (#3845)Rename Trainer arguments
row_log_interval
>>log_every_n_steps
andlog_save_interval
>>flush_logs_every_n_steps
(#3748)
[0.10.0] - Removed¶
Removed experimental Metric API (#3943, #3949, #3946), listed changes before final removal:
Added hooks to metric module interface (#2528)
Added error when AUROC metric is used for multiclass problems (#3350)
Fixed
ModelCheckpoint
withsave_top_k=-1
option not tracking the best models when a monitor metric is available (#3735)Fixed counter-intuitive error being thrown in
Accuracy
metric for zero target tensor (#3764)Fixed aggregation of metrics (#3517)
Fixed Metric aggregation (#3321)
Fixed RMSLE metric (#3188)
Renamed
reduction
toclass_reduction
in classification metrics (#3322)Changed
class_reduction
similar to sklearn for classification metrics (#3322)Renaming of precision recall metric (#3308)
[0.10.0] - Fixed¶
Fixed
on_train_batch_start
hook to end epoch early (#3700)Fixed
num_sanity_val_steps
is clipped tolimit_val_batches
(#2917)Fixed ONNX model save on GPU (#3145)
Fixed
GpuUsageLogger
to work on different platforms (#3008)Fixed auto-scale batch size not dumping
auto_lr_find
parameter (#3151)Fixed
batch_outputs
with optimizer frequencies (#3229)Fixed setting batch size in
LightningModule.datamodule
when usingauto_scale_batch_size
(#3266)Fixed Horovod distributed backend compatibility with native AMP (#3404)
Fixed batch size auto scaling exceeding the size of the dataset (#3271)
Fixed getting
experiment_id
from MLFlow only once instead of each training loop (#3394)Fixed
overfit_batches
which now correctly disables shuffling for the training loader. (#3501)Fixed gradient norm tracking for
row_log_interval > 1
(#3489)Fixed
ModelCheckpoint
name formatting (#3164)Fixed example implementation of AutoEncoder (#3190)
Fixed invalid paths when remote logging with TensorBoard (#3236)
Fixed change
t()
totranspose()
as XLA devices do not support.t()
on 1-dim tensor (#3252)Fixed (weights only) checkpoints loading without PL (#3287)
Fixed
gather_all_tensors
cross GPUs in DDP (#3319)Fixed CometML save dir (#3419)
Fixed forward key metrics (#3467)
Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
Fixed global step increment in training loop when
training_epoch_end
hook is used (#3673)Fixed dataloader shuffling not getting turned off with
overfit_batches > 0
anddistributed_backend = "ddp"
(#3534)Fixed determinism in
DDPSpawnBackend
when usingseed_everything
in main process (#3335)Fixed
ModelCheckpoint
period
to actually save everyperiod
epochs (#3630)Fixed
val_progress_bar
total withnum_sanity_val_steps
(#3751)Fixed Tuner dump: add
current_epoch
to dumped_params (#3261)Fixed
current_epoch
andglobal_step
properties mismatch betweenTrainer
andLightningModule
(#3785)Fixed learning rate scheduler for optimizers with internal state (#3897)
Fixed
tbptt_reduce_fx
when non-floating tensors are logged (#3796)Fixed model checkpoint frequency (#3852)
Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
Fixed
TrainerEvaluationLoopMixin
activatesmodel.train()
at the end (#3858)Fixed
overfit_batches
when using with multiple val/test_dataloaders (#3857)Fixed enables
training_step
to returnNone
(#3862)Fixed init nan for checkpointing (#3863)
Fixed for
load_from_checkpoint
(#2776)Fixes incorrect
batch_sizes
when Dataloader returns a dict with multiple tensors (#3668)Fixed unexpected signature for
validation_step
(#3947)
[0.9.0] - 2020-08-20¶
[0.9.0] - Added¶
Added basic
CSVLogger
(#2721)Added SSIM metrics (#2671)
Added BLEU metrics (#2535)
Added support to export a model to ONNX format (#2596)
Added support for
Trainer(num_sanity_val_steps=-1)
to check all validation data before training (#2246)Added struct. output:
Added class
LightningDataModule
(#2668)Added support for PyTorch 1.6 (#2745)
Added call DataModule hooks implicitly in trainer (#2755)
Added support for Mean in DDP Sync (#2568)
Added remaining
sklearn
metrics:AveragePrecision
,BalancedAccuracy
,CohenKappaScore
,DCG
,Hamming
,Hinge
,Jaccard
,MeanAbsoluteError
,MeanSquaredError
,MeanSquaredLogError
,MedianAbsoluteError
,R2Score
,MeanPoissonDeviance
,MeanGammaDeviance
,MeanTweedieDeviance
,ExplainedVariance
(#2562)Added support for
limit_{mode}_batches (int)
to work with infinite dataloader (IterableDataset) (#2840)Added support returning python scalars in DP (#1935)
Added support to Tensorboard logger for OmegaConf
hparams
(#2846)Added tracking of basic states in
Trainer
(#2541)Tracks all outputs including TBPTT and multiple optimizers (#2890)
Added GPU Usage Logger (#2932)
Added
strict=False
forload_from_checkpoint
(#2819)Added saving test predictions on multiple GPUs (#2926)
Auto log the computational graph for loggers that support this (#3003)
Added warning when changing monitor and using results obj (#3014)
Added a hook
transfer_batch_to_device
to theLightningDataModule
(#3038)
[0.9.0] - Changed¶
Truncated long version numbers in progress bar (#2594)
Enabling val/test loop disabling (#2692)
Refactored into
accelerator
module:Using
.comet.config
file forCometLogger
(#1913)Updated hooks arguments - breaking for
setup
andteardown
(#2850)Using
gfile
to support remote directories (#2164)Moved optimizer creation after device placement for DDP backends (#2904)
Support
**DictConfig
forhparam
serialization (#2519)Removed callback metrics from test results obj (#2994)
Re-enabled naming metrics in ckpt name (#3060)
Changed progress bar epoch counting to start from 0 (#3061)
[0.9.0] - Deprecated¶
Deprecated Trainer attribute
ckpt_path
, which will now be set byweights_save_path
(#2681)
[0.9.0] - Removed¶
Removed deprecated: (#2760)
core decorator
data_loader
Module hook
on_sanity_check_start
and loadingload_from_metrics
package
pytorch_lightning.logging
Trainer arguments:
show_progress_bar
,num_tpu_cores
,use_amp
,print_nan_grads
LR Finder argument
num_accumulation_steps
[0.9.0] - Fixed¶
Fixed
accumulate_grad_batches
for last batch (#2853)Fixed setup call while testing (#2624)
Fixed local rank zero casting (#2640)
Fixed single scalar return from training (#2587)
Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
Fixed
dtype
anddevice
properties not getting updated in submodules (#2657)Fixed
fast_dev_run
to run for all dataloaders (#2581)Fixed
save_dir
in loggers getting ignored by default value ofweights_save_path
when user did not specifyweights_save_path
(#2681)Fixed
weights_save_path
getting ignored whenlogger=False
is passed to Trainer (#2681)Fixed TPU multi-core and Float16 (#2632)
Fixed test metrics not being logged with
LoggerCollection
(#2723)Fixed data transfer to device when using
torchtext.data.Field
andinclude_lengths is True
(#2689)Fixed shuffle argument for distributed sampler (#2789)
Fixed logging interval (#2694)
Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1
(#2738)Fixed correct CWD for ddp sub-processes when using Hydra (#2719)
Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES
(#2739)Fixed false
num_classes
warning in metrics (#2781)Fixed shell injection vulnerability in subprocess call (#2786)
Fixed LR finder and
hparams
compatibility (#2821)Fixed
ModelCheckpoint
not saving the latest information whensave_last=True
(#2881)Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
Fixed apex gradient clipping (#2829)
Fixed save apex scaler states (#2828)
Fixed a model loading issue with inheritance and variable positional arguments (#2911)
Fixed passing
non_blocking=True
when transferring a batch object that does not support it (#2910)Fixed checkpointing to remote file paths (#2925)
Fixed adding val step argument to metrics (#2986)
Fixed an issue that caused
Trainer.test()
to stall in ddp mode (#2997)Fixed gathering of results with tensors of varying shape (#3020)
Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
Fixed automatic batch scaling not working with half precision (#3045)
Fixed setting device to root gpu (#3042)
[0.8.5] - 2020-07-09¶
[0.8.5] - Added¶
[0.8.5] - Removed¶
Removed auto val reduce (#2462)
[0.8.5] - Fixed¶
Flattening Wandb Hyperparameters (#2459)
Fixed using the same DDP python interpreter and actually running (#2482)
Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
Made
TensorBoardLogger
andCometLogger
pickleable (#2518)Fixed a problem with
MLflowLogger
creating multiple run folders (#2502)Fixed global_step increment (#2455)
Fixed TPU hanging example (#2488)
Fixed
argparse
default value bug (#2526)Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
Fixed Trainer
.fit()
returning last not best weights in “ddp_spawn” (#2565)Fixed passing (do not pass) TPU weights back on test (#2566)
[0.8.4] - 2020-07-01¶
[0.8.4] - Added¶
[0.8.4] - Changed¶
Enabled no returns from eval (#2446)
[0.8.4] - Fixed¶
[0.8.3] - 2020-06-29¶
[0.8.3] - Fixed¶
[0.8.2] - 2020-06-28¶
[0.8.2] - Added¶
Added TorchText support for moving data to GPU (#2379)
[0.8.2] - Changed¶
[0.8.2] - Removed¶
Moved
TrainsLogger
to Bolts (#2384)
[0.8.2] - Fixed¶
Fixed parsing TPU arguments and TPU tests (#2094)
Fixed number batches in case of multiple dataloaders and
limit_{*}_batches
(#1920, #2226)Fixed an issue with forward hooks not being removed after model summary (#2298)
Fix for
load_from_checkpoint()
not working with absolute path on Windows (#2294)Fixed an issue how _has_len handles
NotImplementedError
e.g. raised bytorchtext.data.Iterator
(#2293), (#2307)Fixed
average_precision
metric (#2319)Fixed ROC metric for CUDA tensors (#2304)
Fixed lost compatibility with custom datatypes implementing
.to
(#2335)Fixed loading model with kwargs (#2387)
Fixed sum(0) for
trainer.num_val_batches
(#2268)Fixed checking if the parameters are a
DictConfig
Object (#2216)Fixed SLURM weights saving (#2341)
Fixed swaps LR scheduler order (#2356)
Fixed adding tensorboard
hparams
logging test (#2342)Fixed use model ref for tear down (#2360)
Fixed logger crash on DDP (#2388)
Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
Fixed loading past checkpoints from v0.7.x (#2405)
Fixed loading model without arguments (#2403)
Fixed Windows compatibility issue (#2358)
[0.8.1] - 2020-06-19¶
[0.8.1] - Fixed¶
[0.8.0] - 2020-06-18¶
[0.8.0] - Added¶
Added
overfit_batches
,limit_{val|test}_batches
flags (overfit now uses training set for all three) (#2213)Added metrics
Allow dataloaders without sampler field present (#1907)
Added option
save_last
to save the model at the end of every epoch inModelCheckpoint
(#1908)Early stopping checks
on_validation_end
(#1458)Speed up single-core TPU training by loading data using
ParallelLoader
(#2033)Added a model hook
transfer_batch_to_device
that enables moving custom data structures to the target device (#1756)Added black formatter for the code with code-checker on pull (#1610)
Added back the slow spawn ddp implementation as
ddp_spawn
(#2115)Added loading checkpoints from URLs (#1667)
Added a callback method
on_keyboard_interrupt
for handling KeyboardInterrupt events during training (#2134)Added a decorator
auto_move_data
that moves data to the correct device when using the LightningModule for inference (#1905)Added
ckpt_path
option toLightningModule.test(...)
to load particular checkpoint (#2190)Added
setup
andteardown
hooks for model (#2229)
[0.8.0] - Changed¶
Allow user to select individual TPU core to train on (#1729)
Removed non-finite values from loss in
LRFinder
(#1862)Allow passing model hyperparameters as complete kwarg list (#1896)
Renamed
ModelCheckpoint
’s attributesbest
tobest_model_score
andkth_best_model
tokth_best_model_path
(#1799)Re-Enable Logger’s
ImportError
s (#1938)Changed the default value of the Trainer argument
weights_summary
fromfull
totop
(#2029)Raise an error when lightning replaces an existing sampler (#2020)
Enabled
prepare_data
from correct processes - clarify local vs global rank (#2166)Remove explicit flush from tensorboard logger (#2126)
Changed epoch indexing from 1 instead of 0 (#2206)
[0.8.0] - Deprecated¶
Deprecated flags: (#2213)
overfit_pct
in favour ofoverfit_batches
val_percent_check
in favour oflimit_val_batches
test_percent_check
in favour oflimit_test_batches
Deprecated
ModelCheckpoint
’s attributesbest
andkth_best_model
(#1799)Dropped official support/testing for older PyTorch versions <1.3 (#1917)
Deprecated Trainer
proc_rank
in favour ofglobal_rank
(#2166, #2269)
[0.8.0] - Removed¶
Removed unintended Trainer argument
progress_bar_callback
, the callback should be passed in byTrainer(callbacks=[...])
instead (#1855)Removed obsolete
self._device
in Trainer (#1849)Removed deprecated API (#2073)
Packages:
pytorch_lightning.pt_overrides
,pytorch_lightning.root_module
Modules:
pytorch_lightning.logging.comet_logger
,pytorch_lightning.logging.mlflow_logger
,pytorch_lightning.logging.test_tube_logger
,pytorch_lightning.overrides.override_data_parallel
,pytorch_lightning.core.model_saving
,pytorch_lightning.core.root_module
Trainer arguments:
add_row_log_interval
,default_save_path
,gradient_clip
,nb_gpu_nodes
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
Trainer attributes:
nb_gpu_nodes
,num_gpu_nodes
,gradient_clip
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
,default_save_path
,tng_tqdm_dic
[0.8.0] - Fixed¶
Run graceful training teardown on interpreter exit (#1631)
Fixed user warning when apex was used together with learning rate schedulers (#1873)
Fixed multiple calls of
EarlyStopping
callback (#1863)Fixed an issue with
Trainer.from_argparse_args
when passing in unknown Trainer args (#1932)Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
Fixed root node resolution for SLURM cluster with dash in host name (#1954)
Fixed
LearningRateLogger
in multi-scheduler setting (#1944)Fixed test configuration check and testing (#1804)
Fixed an issue with Trainer constructor silently ignoring unknown/misspelled arguments (#1820)
Fixed
save_weights_only
in ModelCheckpoint (#1780)Allow use of same
WandbLogger
instance for multiple training loops (#2055)Fixed an issue with
_auto_collect_arguments
collecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself
(#2048)Fixed mistake in parameters’ grad norm tracking (#2012)
Fixed CPU and hanging GPU crash (#2118)
Fixed an issue with the model summary and
example_input_array
depending on a specific ordering of the submodules in a LightningModule (#1773)Fixed Tpu logging (#2230)
[0.7.6] - 2020-05-16¶
[0.7.6] - Added¶
Added callback for logging learning rates (#1498)
Added transfer learning example (for a binary classification task in computer vision) (#1564)
Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723).Added auto scaling of batch size (#1638)
The progress bar metrics now also get updated in
training_epoch_end
(#1724)Enable
NeptuneLogger
to work withdistributed_backend=ddp
(#1753)Added option to provide seed to random generators to ensure reproducibility (#1572)
Added override for hparams in
load_from_ckpt
(#1797)Added support multi-node distributed execution under
torchelastic
(#1811, #1818)Added dummy logger for internally disabling logging for some features (#1836)
[0.7.6] - Changed¶
Enable
non-blocking
for device transfers to GPU (#1843)Replace mata_tags.csv with hparams.yaml (#1271)
Reduction when
batch_size < num_gpus
(#1609)Updated LightningTemplateModel to look more like Colab example (#1577)
Don’t convert
namedtuple
totuple
when transferring the batch to target device (#1589)Allow passing hparams as keyword argument to LightningModule when loading from checkpoint (#1639)
Args should come after the last positional argument (#1807)
Made ddp the default if no backend specified with multiple GPUs (#1789)
[0.7.6] - Deprecated¶
Deprecated
tags_csv
in favor ofhparams_file
(#1271)
[0.7.6] - Fixed¶
Fixed broken link in PR template (#1675)
Fixed ModelCheckpoint not None checking filepath (#1654)
Trainer now calls
on_load_checkpoint()
when resuming from a checkpoint (#1666)Fixed sampler logic for ddp with iterable dataset (#1734)
Fixed
_reset_eval_dataloader()
for IterableDataset (#1560)Fixed Horovod distributed backend to set the
root_gpu
property (#1669)Fixed wandb logger
global_step
affects other loggers (#1492)Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
Fixed bugs that prevent lr finder to be used together with early stopping and validation dataloaders (#1676)
Fixed a bug in Trainer that prepended the checkpoint path with
version_
when it shouldn’t (#1748)Fixed lr key name in case of param groups in LearningRateLogger (#1719)
Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
Fixed num processes wasn’t being set properly and auto sampler was ddp failing (#1819)
Fixed bugs in semantic segmentation example (#1824)
Fixed saving native AMP scaler state (#1777)
Fixed native amp + ddp (#1788)
Fixed
hparam
logging with metrics (#1647)
[0.7.5] - 2020-04-27¶
[0.7.5] - Changed¶
Allow logging of metrics together with
hparams
(#1630)
[0.7.5] - Removed¶
Removed Warning from trainer loop (#1634)
[0.7.5] - Fixed¶
[0.7.4] - 2020-04-26¶
[0.7.4] - Added¶
Added flag
replace_sampler_ddp
to manually disable sampler replacement in DDP (#1513)Added
auto_select_gpus
flag to trainer that enables automatic selection of available GPUs on exclusive mode systems.Added learning rate finder (#1347)
Added support for DDP mode in clusters without SLURM (#1387)
Added
test_dataloaders
parameter toTrainer.test()
(#1434)Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
(#1475)Added speed parity tests (max 1 sec difference per epoch)(#1482)
Added
ddp_cpu
backend for testing ddp without GPUs (#1158)Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')
(#1529)Added support for 8 core distributed training on Kaggle TPU’s (#1568)
[0.7.4] - Changed¶
Changed the default behaviour to no longer include a NaN check with each training iteration (#1475)
Decoupled the progress bar from trainer` it is a callback now and can be customized or even be replaced entirely (#1450).
Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
Updated semantic segmentation example with custom U-Net and logging (#1371)
Disabled val and test shuffling (#1600)
[0.7.4] - Deprecated¶
Deprecated
training_tqdm_dict
in favor ofprogress_bar_dict
(#1450).
[0.7.4] - Removed¶
Removed
test_dataloaders
parameter fromTrainer.fit()
(#1434)
[0.7.4] - Fixed¶
Added the possibility to pass nested metrics dictionaries to loggers (#1582)
Fixed memory leak from opt return (#1528)
Fixed saving checkpoint before deleting old ones (#1453)
Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()
results (#1459)Fixed optimizer configuration when
configure_optimizers
returns dict withoutlr_scheduler
(#1443)Fixed
LightningModule
- mixing hparams and arguments inLightningModule.__init__()
crashes load_from_checkpoint() (#1505)Added a missing call to the
on_before_zero_grad
model hook (#1493).Allow use of sweeps with
WandbLogger
(#1512)Fixed a bug that caused the
callbacks
Trainer argument to reference a global variable (#1534).Fixed a bug that set all boolean CLI arguments from
Trainer.add_argparse_args
always to True (#1571)Fixed do not copy the batch when training on a single GPU (#1576, #1579)
Fixed soft checkpoint removing on DDP (#1408)
Fixed automatic parser bug (#1585)
Fixed bool conversion from string (#1606)
[0.7.3] - 2020-04-09¶
[0.7.3] - Added¶
Added
rank_zero_warn
for warning only in rank 0 (#1428)
[0.7.3] - Fixed¶
[0.7.2] - 2020-04-07¶
[0.7.2] - Added¶
Added same step loggers’ metrics aggregation (#1278)
Added parity test between a vanilla MNIST model and lightning model (#1284)
Added parity test between a vanilla RNN model and lightning model (#1351)
Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
Added support for hierarchical
dict
(#1152)Added
TrainsLogger
class (#1122)Added type hints to
pytorch_lightning.core
(#946)Added support for
IterableDataset
in validation and testing (#1104)Added support for non-primitive types in
hparams
forTensorboardLogger
(#1130)Added a check that stops the training when loss or weights contain
NaN
orinf
values. (#1097)Added support for
IterableDataset
whenval_check_interval=1.0
(default), this will trigger validation at the end of each epoch. (#1283)Added
summary
method to Profilers. (#1259)Added informative errors if user defined dataloader has zero length (#1280)
Added testing for python 3.8 (#915)
Added model configuration checking (#1199)
Added support for optimizer frequencies through
LightningModule.configure_optimizers()
(#1269)Added option to run without an optimizer by returning
None
fromconfigure_optimizers
. (#1279)Added a warning when the number of data loader workers is small. (#1378)
[0.7.2] - Changed¶
Changed (renamed and refatored)
TensorRunningMean
->TensorRunningAccum
: running accumulations were generalized. (#1278)Changed
progress_bar_refresh_rate
trainer flag to disable progress bar when set to 0. (#1108)Enhanced
load_from_checkpoint
to also forward params to the model (#1307)Updated references to
self.forward()
to instead use the__call__
interface. (#1211)Changed default behaviour of
configure_optimizers
to use no optimizer rather than Adam. (#1279)Allow to upload models on W&B (#1339)
On DP and DDP2 unsqueeze is automated now (#1319)
Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
Did not interfere with a default sampler (#1318)
Remove default Adam optimizer (#1317)
Give warnings for unimplemented required lightning methods (#1317)
Made
evaluate
method private >>Trainer._evaluate(...)
. (#1260)Simplify the PL examples structure (shallower and more readable) (#1247)
Changed min max gpu memory to be on their own plots (#1358)
Remove
.item
which causes sync issues (#1254)Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
Change default logger to dedicated one (#1064)
[0.7.2] - Deprecated¶
[0.7.2] - Removed¶
[0.7.2] - Fixed¶
Fixed
model_checkpoint
when saving all models (#1359)Trainer.add_argparse_args
classmethod fixed. Now it adds a type for the arguments (#1147)Fixed bug related to type checking of
ReduceLROnPlateau
lr schedulers(#1126)Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch
(#1196)Fixed all warnings and errors in the docs build process (#1191)
Fixed an issue where
val_percent_check=0
would not disable validation (#1251)Fixed average of incomplete
TensorRunningMean
(#1309)Fixed
WandbLogger.watch
withwandb.init()
(#1311)Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
Fixed a bug that would cause
trainer.test()
to run on the validation set when overloadingvalidation_epoch_end
andtest_end
(#1353)Fixed
WandbLogger.watch
- use of the watch method without importingwandb
(#1311)Fixed
WandbLogger
to be used with ‘ddp’ - allow reinits in sub-processes (#1149, #1360)Made
training_epoch_end
behave likevalidation_epoch_end
(#1357)Fixed
fast_dev_run
running validation twice (#1365)Fixed pickle error from quick patch
__code__
(#1352)Fixed checkpointing interval (#1272)
Fixed validation and training loops run the partial dataset (#1192)
Fixed running
on_validation_end
only on main process in DDP (#1125)Fixed
load_spawn_weights
only in proc rank 0 (#1385)Fixes using deprecated
use_amp
attribute (#1145)Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1377)
Fixed
Unimplemented backend XLA
error on TPU (#1387)
[0.7.1] - 2020-03-07¶
[0.7.1] - Fixed¶
Fixes
print
issues anddata_loader
(#1080)
[0.7.0] - 2020-03-06¶
[0.7.0] - Added¶
Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
Added
reload_dataloaders_every_epoch=False
flag for trainer. Some users require reloading data every epoch (#926)Added
progress_bar_refresh_rate=50
flag for trainer. Throttle refresh rate on notebooks (#926)Updated governance docs
Added a check to ensure that the metric used for early stopping exists before training commences (#542)
Added
optimizer_idx
argument tobackward
hook (#733)Added
entity
argument toWandbLogger
to be passed towandb.init
(#783)Added a tool for profiling training runs (#782)
Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804)Added option to specify
step
key when logging metrics (#808)Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759)Added Tensor Processing Unit (TPU) support (#868)
Split callbacks in multiple files (#849)
Added support for multiple loggers to be passed to
Trainer
as an iterable (e.g. list, tuple, etc.) (#903)Added support for step-based learning rate scheduling (#941)
Added support for logging
hparams
as dict (#1029)Checkpoint and early stopping now work without val. step (#1041)
Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
Added type hints for function arguments (#912, )
Added TPU gradient clipping (#963)
Added max/min number of steps in
Trainer
(#728)
[0.7.0] - Changed¶
Improved
NeptuneLogger
by addingclose_after_fit
argument to allow logging after training(#908)Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752)Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767)Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749)Moved functionality of
LightningModule.load_from_metrics
intoLightningModule.load_from_checkpoint
(#995)Changed Checkpoint path parameter from
filepath
todirpath
(#1016)Freezed models
hparams
asNamespace
property (#1029)Dropped
logging
config in package init (#1015)Renames model steps (#1051)
training_end
>>training_epoch_end
validation_end
>>validation_epoch_end
test_end
>>test_epoch_end
Refactor dataloading, supports infinite dataloader (#955)
Create single file in
TensorBoardLogger
(#777)
[0.7.0] - Deprecated¶
[0.7.0] - Removed¶
[0.7.0] - Fixed¶
Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743)Fixed a bug where the model checkpointer didn’t write to the same directory as the logger (#771)
Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777)Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832)Fixed a bug when calling
self.logger.experiment
with multiple loggers (#1009)Fixed a bug when calling
logger.append_tags
on aNeptuneLogger
with a single tag (#1009)Fixed sending back data from
.spawn
by saving and loading the trained model in/out of the process (#1017Fixed port collision on DDP (#1010)
Fixed/tested pass overrides (#918)
Fixed comet logger to log after train (#892)
Remove deprecated args to learning rate step function (#890)
[0.6.0] - 2020-01-21¶
[0.6.0] - Added¶
Added support for resuming from a specific checkpoint via
resume_from_checkpoint
argument (#516)Added support for
ReduceLROnPlateau
scheduler (#320)Added support for Apex mode
O2
in conjunction with Data Parallel (#493)Added option (
save_top_k
) to save the top k models in theModelCheckpoint
class (#128)Added
on_train_start
andon_train_end
hooks toModelHooks
(#598)Added
TensorBoardLogger
(#607)Added support for weight summary of model with multiple inputs (#543)
Added
map_location
argument toload_from_metrics
andload_from_checkpoint
(#625)Added option to disable validation by setting
val_percent_check=0
(#649)Added
NeptuneLogger
class (#648)Added
WandbLogger
class (#627)
[0.6.0] - Changed¶
Changed the default progress bar to print to stdout instead of stderr (#531)
Renamed
step_idx
tostep
,epoch_idx
toepoch
,max_num_epochs
tomax_epochs
andmin_num_epochs
tomin_epochs
(#589)Renamed
total_batch_nb
tototal_batches
,nb_val_batches
tonum_val_batches
,nb_training_batches
tonum_training_batches
,max_nb_epochs
tomax_epochs
,min_nb_epochs
tomin_epochs
,nb_test_batches
tonum_test_batches
, andnb_val_batches
tonum_val_batches
(#567)Changed gradient logging to use parameter names instead of indexes (#660)
Changed the default logger to
TensorBoardLogger
(#609)Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
[0.6.0] - Deprecated¶
[0.6.0] - Removed¶
Removed the
save_best_only
argument fromModelCheckpoint
, usesave_top_k=1
instead (#128)
[0.6.0] - Fixed¶
Fixed a bug which occurred when using Adagrad with cuda (#554)
Fixed a bug where training would be on the GPU despite setting
gpus=0
orgpus=[]
(#561)Fixed an error with
print_nan_gradients
when some parameters do not require gradient (#579)Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
Fixed support for PyTorch 1.1.0 (#552)
Fixed an issue with early stopping when using a
val_check_interval < 1.0
inTrainer
(#492)Fixed bugs relating to the
CometLogger
object that would cause it to not work properly (#481)Fixed a bug that would occur when returning
-1
fromon_batch_start
following an early exit or when the batch wasNone
(#509)Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
Fixed a bug where batch ‘segments’ would remain on the GPU when using
truncated_bptt > 1
(#532)Fixed a bug when using
IterableDataset
(#547)Fixed a bug where
.item
was called on non-tensor objects (#602)Fixed a bug where
Trainer.train
would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs
(#608)Fixed a bug where early stopping would begin two epochs early (#617)
Fixed a bug where
num_training_batches
andnum_test_batches
would sometimes be rounded down to zero (#649)Fixed a bug where an additional batch would be processed when manually setting
num_training_batches
(#653)Fixed a bug when batches did not have a
.copy
method (#701)Fixed a bug when using
log_gpu_memory=True
in Python 3.6 (#715)Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
Fixed a bug where
on_train_end
was not called when ealy stopping (#723)
[0.5.3] - 2019-11-06¶
[0.5.3] - Added¶
Added option to disable default logger, checkpointer, and early stopping by passing
logger=False
,checkpoint_callback=False
andearly_stop_callback=False
respectivelyAdded
CometLogger
for use with Comet.mlAdded
val_check_interval
argument toTrainer
allowing validition to be performed at every given number of batchesAdded functionality to save and load hyperparameters using the standard checkpoint mechanism
Added call to
torch.cuda.empty_cache
before training startsAdded option for user to override the call t
backward
Added support for truncated backprop through time via the
truncated_bptt_steps
argument inTrainer
Added option to operate on all outputs from
training_step
in DDP2Added a hook for modifying DDP init
Added a hook for modifying Apex
[0.5.3] - Changed¶
Changed experiment version to be padded with zeros (e.g.
/dir/version_9
becomes/dir/version_0009
)Changed callback metrics to include any metrics given in logs or progress bar
Changed the default for
save_best_only
inModelCheckpoint
toTrue
Added
tng_data_loader
for backwards compatibilityRenamed
MLFlowLogger.client
toMLFlowLogger.experiment
for consistencyMoved
global_step
increment to happen after the batch has been processedChanged weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
Changed progress bar functionality to add multiple progress bars for train/val/test
Changed calls to
print
to uselogging
instead
[0.5.3] - Deprecated¶
Deprecated
tng_dataloader
[0.5.3] - Fixed¶
Fixed an issue where the number of batches was off by one during training
Fixed a bug that occurred when setting a ckeckpoint callback and
early_stop_callback=False
Fixed an error when importing CometLogger
Fixed a bug where the
gpus
argument had some unexpected behaviourFixed a bug where the computed total number of batches was sometimes incorrect
Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
Fixed a bug when using the
log_gpu_memory='min_max'
option inTrainer
Fixed a bug where checkpointing would sometimes erase the current directory
[0.5.2] - 2019-10-10¶
[0.5.2] - Added¶
Added
weights_summary
argument toTrainer
to be set tofull
(full summary),top
(just top level modules) or otherAdded
tags
argument toMLFlowLogger
[0.5.2] - Changed¶
Changed default for
amp_level
toO1
[0.5.2] - Removed¶
Removed the
print_weights_summary
argument fromTrainer
[0.5.2] - Fixed¶
Fixed a bug where logs were not written properly
Fixed a bug where
logger.finalize
wasn’t called after training is completeFixed callback metric errors in DDP
Fixed a bug where
TestTubeLogger
didn’t log to the correct directory
[0.5.1] - 2019-10-05¶
[0.5.1] - Added¶
Added the
LightningLoggerBase
class for experiment loggersAdded
MLFlowLogger
for logging withmlflow
Added
TestTubeLogger
for logging withtest_tube
Added a different implementation of DDP (
distributed_backed='ddp2'
) where every node has one model using all GPUsAdded support for optimisers which require a closure (e.g. LBFGS)
Added automatic
MASTER_PORT
default for DDP when not set manuallyAdded new GPU memory logging options
'min_max'
(log only the min/max utilization) and'all'
(log all the GPU memory)
[0.5.1] - Changed¶
Changed schedulers to always be called with the current epoch
Changed
test_tube
to an optional dependencyChanged data loaders to internally use a getter instead of a python property
Disabled auto GPU loading when restoring weights to prevent out of memory errors
Changed logging, early stopping and checkpointing to occur by default
[0.5.1] - Fixed¶
Fixed a bug with samplers that do not specify
set_epoch
Fixed a bug when using the
MLFlowLogger
with unsupported data types, this will now raise a warningFixed a bug where gradient norms were always zero using
track_grad_norm
Fixed a bug which causes a crash when logging memory
[0.5.0] - 2019-09-26¶
[0.5.0] - Changed¶
Changed
data_batch
argument tobatch
throughoutChanged
batch_i
argument tobatch_idx
throughoutChanged
tng_dataloader
method totrain_dataloader
Changed
on_tng_metrics
method toon_training_metrics
Changed
gradient_clip
argument togradient_clip_val
Changed
add_log_row_interval
torow_log_interval
[0.5.0] - Fixed¶
Fixed a bug with tensorboard logging in multi-gpu setup
[0.4.9] - 2019-09-16¶
[0.4.9] - Added¶
Added the flag
log_gpu_memory
toTrainer
to deactivate logging of GPU memory utilizationAdded SLURM resubmit functionality (port from test-tube)
Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
Added option to use single gpu per node with
DistributedDataParallel
[0.4.9] - Changed¶
Changed functionality of
validation_end
andtest_end
with multiple dataloaders to be given all of the dataloaders at once rather than in separate callsChanged print_nan_grads to only print the parameter value and gradients when they contain NaN
Changed gpu API to take integers as well (e.g.
gpus=2
instead ofgpus=[0, 1]
)All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
[0.4.9] - Fixed¶
Fixed a bug where data types that implement
.to
but not.cuda
would not be properly moved onto the GPUFixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
[0.4.8] - 2019-08-31¶
[0.4.8] - Added¶
Added
test_step
andtest_end
methods, used whenTrainer.test
is calledAdded
GradientAccumulationScheduler
callback which can be used to schedule changes to the number of accumulation batchesAdded option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
[0.4.8] - Fixed¶
Fixed a bug when setting
nb_sanity_val_steps = 0
[0.4.7] - 2019-08-24¶
[0.4.7] - Changed¶
Changed the default
val_check_interval
to1.0
Changed defaults for
nb_val_batches
,nb_tng_batches
andnb_test_batches
to 0
[0.4.7] - Fixed¶
Fixed a bug where the full validation set as used despite setting
val_percent_check
Fixed a bug where an
Exception
was thrown when using a data set containing a single batchFixed a bug where an
Exception
was thrown if noval_dataloader
was givenFixed a bug where tuples were not properly transferred to the GPU
Fixed a bug where data of a non standard type was not properly handled by the trainer
Fixed a bug when loading data as a tuple
Fixed a bug where
AttributeError
could be suppressed by theTrainer
[0.4.6] - 2019-08-15¶
[0.4.6] - Added¶
Added support for data to be given as a
dict
orlist
with a single gpuAdded support for
configure_optimizers
to return a single optimizer, two list (optimizers and schedulers), or a single list
[0.4.6] - Fixed¶
Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizers
would throw anException
[0.4.5] - 2019-08-13¶
[0.4.5] - Added¶
Added
optimizer_step
method that can be overridden to change the standard optimizer behaviour
[0.4.4] - 2019-08-12¶
[0.4.4] - Added¶
Added supoort for multiple validation dataloaders
Added support for latest test-tube logger (optimised for
torch==1.2.0
)
[0.4.4] - Changed¶
validation_step
andval_dataloader
are now optionallr_scheduler
is now activated after epoch
[0.4.4] - Fixed¶
Fixed a bug where a warning would show when using
lr_scheduler
intorch>1.1.0
Fixed a bug where an
Exception
would be thrown if usingtorch.DistributedDataParallel
without using aDistributedSampler
, this now throws aWarning
instead
[0.4.3] - 2019-08-10¶
[0.4.3] - Fixed¶
Fixed a bug where accumulate gradients would scale the loss incorrectly
[0.4.2] - 2019-08-08¶
[0.4.2] - Changed¶
Changed install requirement to
torch==1.2.0
[0.4.1] - 2019-08-08¶
[0.4.1] - Changed¶
Changed install requirement to
torch==1.1.0
[0.4.0] - 2019-08-08¶
[0.4.0] - Added¶
Added 16-bit support for a single GPU
Added support for training continuation (preserves epoch, global step etc.)
[0.4.0] - Changed¶
Changed
training_step
andvalidation_step
, outputs will no longer be automatically reduced
[0.4.0] - Removed¶
Removed need for
Experiment
object inTrainer
[0.4.0] - Fixed¶
Fixed issues with reducing outputs from generative models (such as images and text)
[0.3.6] - 2019-07-25¶
[0.3.6] - Added¶
Added a decorator to do lazy data loading internally
[0.3.6] - Fixed¶
Fixed a bug where
Experiment
object was not process safe, potentially causing logs to be overwritten