Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
[1.5.7] - 2021-12-21¶
[1.5.7] - Fixed¶
Fixed
NeptuneLogger
when using DDP (#11030)Fixed a bug to disable logging hyperparameters in logger if there are no hparams (#11105)
Avoid the deprecated
onnx.export(example_outputs=...)
in torch 1.10 (#11116)Fixed an issue when torch-scripting a
LightningModule
after training withTrainer(sync_batchnorm=True)
(#11078)Fixed an
AttributeError
occuring when using aCombinedLoader
(multiple dataloaders) for prediction (#11111)Fixed bug where
Trainer(track_grad_norm=..., logger=False)
would fail (#11114)Fixed an incorrect warning being produced by the model summary when using
bf16
precision on CPU (#11161)
[1.5.7] - Changed¶
[1.5.6] - 2021-12-15¶
[1.5.6] - Fixed¶
Fixed a bug where the DeepSpeedPlugin arguments
cpu_checkpointing
andcontiguous_memory_optimization
were not being forwarded to deepspeed correctly (#10874)Fixed an issue with
NeptuneLogger
causing checkpoints to be uploaded with a duplicated file extension (#11015)Fixed support for logging within callbacks returned from
LightningModule
(#10991)Fixed running sanity check with
RichProgressBar
(#10913)Fixed support for
CombinedLoader
while checking for warning raised with eval dataloaders (#10994)The TQDM progress bar now correctly shows the
on_epoch
logged values on train epoch end (#11069)Fixed bug where the TQDM updated the training progress bar during
trainer.validate
(#11069)
[1.5.5] - 2021-12-07¶
[1.5.5] - Fixed¶
Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally (#10815)
Fixed an issue with
SignalConnector
not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled (#10611)Fixed
SignalConnector._has_already_handler
check for callable type (#10483)Fixed an issue to return the results for each dataloader separately instead of duplicating them for each (#10810)
Improved exception message if
rich
version is less than10.2.2
(#10839)Fixed uploading best model checkpoint in NeptuneLogger (#10369)
Fixed early schedule reset logic in PyTorch profiler that was causing data leak (#10837)
Fixed a bug that caused incorrect batch indices to be passed to the
BasePredictionWriter
hooks when using a dataloader withnum_workers > 0
(#10870)Fixed an issue with item assignment on the logger on rank > 0 for those who support it (#10917)
Fixed importing
torch_xla.debug
fortorch-xla<1.8
(#10836)Fixed an issue with
DDPSpawnPlugin
and related plugins leaving a temporary checkpoint behind (#10934)Fixed a
TypeError
occuring in theSingalConnector.teardown()
method (#10961)
[1.5.4] - 2021-11-30¶
[1.5.4] - Fixed¶
Fixed support for
--key.help=class
with theLightningCLI
(#10767)Fixed
_compare_version
for python packages (#10762)Fixed TensorBoardLogger
SummaryWriter
not close before spawning the processes (#10777)Fixed a consolidation error in Lite when attempting to save the state dict of a sharded optimizer (#10746)
Fixed the default logging level for batch hooks associated with training from
on_step=False, on_epoch=True
toon_step=True, on_epoch=False
(#10756)
[1.5.4] - Removed¶
[1.5.3] - 2021-11-24¶
[1.5.3] - Fixed¶
Fixed
ShardedTensor
state dict hook registration to check if torch distributed is available (#10621)Fixed an issue with
self.log
not respecting a tensor’sdtype
when applying computations (#10076)Fixed LigtningLite
_wrap_init
popping unexisting keys from DataLoader signature parameters (#10613)Fixed signals being registered within threads (#10610)
Fixed an issue that caused Lightning to extract the batch size even though it was set by the user in
LightningModule.log
(#10408)Fixed
Trainer(move_metrics_to_cpu=True)
not moving the evaluation logged results to CPU (#10631)Fixed the
{validation,test}_step
outputs getting moved to CPU withTrainer(move_metrics_to_cpu=True)
(#10631)Fixed signals being registered within threads (#10610)
Fixed an issue with collecting logged test results with multiple dataloaders (#10522)
[1.5.2] - 2021-11-16¶
[1.5.2] - Fixed¶
Fixed
CombinedLoader
andmax_size_cycle
didn’t receive aDistributedSampler
(#10374)Fixed an issue where class or init-only variables of dataclasses were passed to the dataclass constructor in
utilities.apply_to_collection
(#9702)Fixed
isinstance
not working withinit_meta_context
, materialized model not being moved to the device (#10493)Fixed an issue that prevented the Trainer to shutdown workers when execution is interrupted due to failure(#10463)
Squeeze the early stopping monitor to remove empty tensor dimensions (#10461)
Fixed sampler replacement logic with
overfit_batches
to only replace the sample whenSequentialSampler
is not used (#10486)Fixed scripting causing false positive deprecation warnings (#10470, #10555)
Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
Fixed propagation of device and dtype information to submodules of LightningLite when they inherit from
DeviceDtypeModuleMixin
(#10559)
[1.5.1] - 2021-11-09¶
[1.5.1] - Fixed¶
Fixed
apply_to_collection(defaultdict)
(#10316)Fixed failure when
DataLoader(batch_size=None)
is passed (#10345)Fixed interception of
__init__
arguments for sub-classed DataLoader re-instantiation in Lite (#10334)Fixed issue with pickling
CSVLogger
after a call toCSVLogger.save
(#10388)Fixed an import error being caused by
PostLocalSGD
whentorch.distributed
not available (#10359)Fixed the logging with
on_step=True
in epoch-level hooks causing unintended side-effects. Logging withon_step=True
in epoch-level hooks will now correctly raise an error (#10409)Fixed deadlocks for distributed training with
RichProgressBar
(#10428)Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
Fixed dataloader workers with
persistent_workers
being deleted on every iteration (#10434)
[1.5.0] - 2021-11-02¶
[1.5.0] - Added¶
Added support for monitoring the learning rate without schedulers in
LearningRateMonitor
(#9786)Added registration of
ShardedTensor
state dict hooks inLightningModule.__init__
if the PyTorch version supportsShardedTensor
(#8944)Added error handling including calling of
on_keyboard_interrupt()
andon_exception()
for all entrypoints (fit, validate, test, predict) (#8819)Added a flavor of
training_step
that takesdataloader_iter
as an argument (#8807)Added a
state_key
property to theCallback
base class (#6886)Added progress tracking to loops:
Integrated
TrainingEpochLoop.total_batch_idx
(#8598)Added
BatchProgress
and integratedTrainingEpochLoop.is_last_batch
(#9657)Avoid optional
Tracker
attributes (#9320)Reset
current
progress counters when restarting an epoch loop that had already finished (#9371)Call
reset_on_restart
in the loop’sreset
hook instead of when loading a checkpoint (#9561)Use
completed
overprocessed
inreset_on_restart
(#9656)Renamed
reset_on_epoch
toreset_on_run
(#9658)
Added
batch_size
andrank_zero_only
arguments forlog_dict
to matchlog
(#8628)Added a check for unique GPU ids (#8666)
Added
ResultCollection
state_dict to the Loopstate_dict
and added support for distributed reload (#8641)Added DeepSpeed collate checkpoint utility function (#8701)
Added a
handles_accumulate_grad_batches
property to the training type plugins (#8856)Added a warning to
WandbLogger
when reusing a wandb run (#8714)Added
log_graph
argument forwatch
method ofWandbLogger
(#8662)LightningCLI
additions:Added
LightningCLI(run=False|True)
to choose whether to run aTrainer
subcommand (#8751)Added support to call any trainer function from the
LightningCLI
via subcommands (#7508)Allow easy trainer re-instantiation (#7508)
Automatically register all optimizers and learning rate schedulers (#9565)
Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
Support passing lists of callbacks via command line (#8815)
Support shorthand notation to instantiate models (#9588)
Support shorthand notation to instantiate datamodules (#10011)
Added
multifile
option toLightningCLI
to enable/disable config saving to preserve multiple files structure (#9073)
Fault-tolerant training:
Added
FastForwardSampler
andCaptureIterableDataset
injection to data loading utilities (#8366)Added
DataFetcher
to control fetching flow (#8890)Added
SharedCycleIteratorState
to prevent infinite loop (#8889)Added
CaptureMapDataset
for state management in map-style datasets (#8891)Added Fault Tolerant Training to
DataFetcher
(#8891)Replaced old prefetch iterator with new
DataFetcher
in training loop (#8953)Added partial support for global random state fault-tolerance in map-style datasets (#8950)
Converted state to tuple explicitly when setting Python random state (#9401)
Added support for restarting an optimizer loop (multiple optimizers) (#9537)
Added support for restarting within Evaluation Loop (#9563)
Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
Checkpoint saving and loading extensibility:
Added
CheckpointIO
plugin to expose checkpoint IO from training type plugin (#8743)Refactored
CheckpointConnector
to offload validation logic to theCheckpointIO
plugin (#9045)Added
remove_checkpoint
toCheckpointIO
plugin by moving the responsibility out of theModelCheckpoint
callback (#9373)Added
XLACheckpointIO
plugin (#9972)
Loop customization:
Added
Closure
andAbstractClosure
classes (#8642)Refactored
TrainingBatchLoop
and extractedOptimizerLoop
, splitting off automatic optimization into its own loop (#9191)Removed
TrainingBatchLoop.backward()
; manual optimization now calls directly intoAccelerator.backward()
and automatic optimization handles backward in newOptimizerLoop
(#9265)Extracted
ManualOptimization
logic fromTrainingBatchLoop
into its own separate loop class (#9266)Marked
OptimizerLoop.backward
as protected (#9514)Marked
FitLoop.should_accumulate
as protected (#9515)Marked several methods in
PredictionLoop
as protected:on_predict_start
,on_predict_epoch_end
,on_predict_end
,on_predict_model_eval
(#9516)Marked several methods in
EvaluationLoop
as protected:get_max_batches
,on_evaluation_model_eval
,on_evaluation_model_train
,on_evaluation_start
,on_evaluation_epoch_start
,on_evaluation_epoch_end
,on_evaluation_end
,reload_evaluation_dataloaders
(#9516)Marked several methods in
EvaluationEpochLoop
as protected:on_evaluation_batch_start
,evaluation_step
,evaluation_step_end
(#9516)Added
yielding_training_step
example (#9983)
Added support for saving and loading state of multiple callbacks of the same type (#7187)
Added DeepSpeed Stage 1 support (#8974)
Added
Python dataclass
support forLightningDataModule
(#8272)Added sanitization of tensors when they get logged as hyperparameters in
TensorBoardLogger
(#9031)Added
InterBatchParallelDataFetcher
(#9020)Added
DataLoaderIterDataFetcher
(#9020)Added
DataFetcher
withinFit / Evaluation
Loop (#9047)Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
Added Rich integration:
Added input validation logic for precision (#9080)
Added support for CPU AMP autocast (#9084)
Added
on_exception
callback hook (#9183)Added a warning to DeepSpeed when inferring batch size (#9221)
Added
ModelSummary
callback (#9344)Added
log_images
,log_text
andlog_table
toWandbLogger
(#9545)Added
PL_RECONCILE_PROCESS
environment variable to enable process reconciliation regardless of cluster environment settings (#9389)Added
get_device_stats
to the Accelerator interface and added its implementation for GPU and TPU (#9586)Added a warning when an unknown key is encountered in the optimizer configuration, and when
OneCycleLR
is used with"interval": "epoch"
(#9666)Added
DeviceStatsMonitor
callback (#9712)Added
enable_progress_bar
to the Trainer constructor (#9664)Added
pl_legacy_patch
load utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166)Added support for
torch.use_deterministic_algorithms
(#9121)Added automatic parameters tying for TPUs (#9525)
Added support for
torch.autograd.set_detect_anomaly
throughTrainer
constructor argumentdetect_anomaly
(#9848)Added
enable_model_summary
flag to Trainer (#9699)Added
strategy
argument to Trainer (#8597)Added
init_meta_context
,materialize_module
utilities (#9920)Added
TPUPrecisionPlugin
(#10020)Added
torch.bfloat16
support:Added
kfold
example for loop customization (#9965)LightningLite:
Added
PrecisionPlugin.forward_context
, making it the default implementation for all{train,val,test,predict}_step_context()
methods (#9988)Added
DDPSpawnPlugin.spawn()
for spawning new processes of a given function (#10018, #10022)Added
TrainingTypePlugin.{_setup_model, _setup_optimizer}
methods (#9994, #10064)Implemented
DataParallelPlugin._setup_model
(#10010)Implemented
DeepSpeedPlugin._setup_model_and_optimizers
(#10009, #10064)Implemented
{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers
(#10028, #10064)Added optional
model
argument to theoptimizer_step
methods in accelerators and plugins (#10023)Updated precision attributes in
DeepSpeedPlugin
(#10164)Added the ability to return a result from rank 0 in
DDPSpawnPlugin.spawn
(#10162)Added
pytorch_lightning.lite
package (#10175)Added
LightningLite
documentation (#10043)Added
LightningLite
examples (#9987)Make the
_LiteDataLoader
an iterator and add supports for custom dataloader (#10279)
Added
use_omegaconf
argument tosave_hparams_to_yaml
plugin (#9170)Added
ckpt_path
argument forTrainer.fit()
(#10061)Added
auto_device_count
method toAccelerators
(#10222)Added support for
devices="auto"
(#10264)Added a
filename
argument inModelCheckpoint.format_checkpoint_name
(#9818)Added support for empty
gpus
list to run on CPU (#10246)Added a warning if multiple batch sizes are found from ambiguous batch (#10247)
[1.5.0] - Changed¶
Trainer now raises a
MisconfigurationException
when its methods are called withckpt_path="best"
but a checkpoint callback isn’t configured (#9841)Setting
Trainer(accelerator="ddp_cpu")
now does not spawn a subprocess ifnum_processes
is kept1
along withnum_nodes > 1
(#9603)Module imports are now catching
ModuleNotFoundError
instead ofImportError
(#9867)pytorch_lightning.loggers.neptune.NeptuneLogger
is now consistent with the new neptune-client API; the old neptune-client API is supported byNeptuneClient
from the neptune-contrib repo (#6867)Parsing of
enums
type hyperparameters to be saved in thehaprams.yaml
file by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170)Parsing of the
gpus
Trainer argument has changed:gpus="n"
(str) no longer selects the GPU index n and instead selects the first n devices (#8770)iteration_count
and other index attributes in the loops has been replaced with progress dataclasses (#8477)The
trainer.lightning_module
reference is now properly set at the very beginning of a run (#8536)The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
The
Trainer
functionsreset_{train,val,test,predict}_dataloader
,reset_train_val_dataloaders
, andrequest_dataloader
model
argument is now optional (#8536)Saved checkpoints will no longer use the type of a
Callback
as the key to avoid issues with unpickling (#6886)Improved string conversion for
ResultCollection
(#8622)LightningCLI
changes:LightningCLI.init_parser
now returns the parser instance (#8721)LightningCLI.add_core_arguments_to_parser
,LightningCLI.parse_arguments
now take aparser
argument (#8721)LightningCLI.instantiate_trainer
now takes a config and a list of callbacks (#8721)Split
LightningCLI.add_core_arguments_to_parser
intoLightningCLI.add_default_arguments_to_parser
+LightningCLI.add_core_arguments_to_parser
(#8721)
The accelerator and training type plugin
setup
hooks no longer have amodel
argument (#8536)The accelerator and training type plugin
update_global_step
hook has been removed (#8856)The coverage of
self.log
-ing in anyLightningModule
orCallback
hook has been improved (#8498)self.log
-ing without aTrainer
reference now raises a warning instead of an exception (#9733)Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloader
now takes aRunningStage
enum instance (#8858)Changed
rank_zero_warn
toNotImplementedError
in the{train, val, test, predict}_dataloader
hooks thatLightning(Data)Module
uses (#9161)Moved
block_ddp_sync_behaviour
out ofTrainingBatchLoop
to loop utilities (#9192)Executing the
optimizer_closure
is now required when overriding theoptimizer_step
hook (#9360)Changed logging of
LightningModule
andLightningDataModule
hyperparameters to raise an exception only if there are colliding keys with different values (#9496)seed_everything
now fails when an invalid seed value is passed instead of selecting a random seed (#8787)The Trainer now calls
TrainingTypePlugin
collective APIs directly instead of going through the Accelerator reference (#9677, #9901)The tuner now usees a unique filename to save a temporary checkpoint (#9682)
Changed
HorovodPlugin.all_gather
to return atorch.Tensor
instead of a list (#9696)Changed Trainer connectors to be protected attributes:
Configuration Validator (#9779)
The
current_epoch
andglobal_step
attributes now get restored irrespective of the Trainer task (#9413)Trainer now raises an exception when requesting
amp_level
with nativeamp_backend
(#9755)Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_norm
now raises an exception if parameternorm_type <= 0
(#9765)Updated error message for interactive incompatible plugins (#9896)
Moved the
optimizer_step
andclip_gradients
hook from theAccelerator
andTrainingTypePlugin
into thePrecisionPlugin
(#10143, #10029)NativeMixedPrecisionPlugin
and its subclasses now take an optionalGradScaler
instance (#10055)Trainer is now raising a
MisconfigurationException
instead of a warning ifTrainer.{validate/test}
is missing required methods (#10016)Changed default value of the
max_steps
Trainer argument fromNone
to -1 (#9460)LightningModule now raises an error when calling
log(on_step=False, on_epoch=False)
(#10227)Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
Raised
MisconfigurationException
when total length ofdataloader
across ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827)Changed the model size calculation using
ByteCounter
(#10123)Enabled
on_load_checkpoint
forLightningDataModule
for alltrainer_fn
(#10238)Allowed separate config files for parameters with class type when LightningCLI is in
subclass_mode=False
(#10286)
[1.5.0] - Deprecated¶
Deprecated Trainer argument
terminate_on_nan
in favor ofdetect_anomaly
(#9175)Deprecated
Trainer.terminate_on_nan
public attribute access (#9849)Deprecated
LightningModule.summarize()
in favor ofpytorch_lightning.utilities.model_summary.summarize()
(#8513)Deprecated
LightningModule.model_size
(#8343)Deprecated
DataModule
properties:train_transforms
,val_transforms
,test_transforms
,size
,dims
(#8851)Deprecated
add_to_queue
,get_from_queue
fromLightningModule
in favor of corresponding methods in theDDPSpawnPlugin
(#9118)Deprecated
LightningModule.get_progress_bar_dict
andTrainer.progress_bar_dict
in favor ofpytorch_lightning.callbacks.progress.base.get_standard_metrics
andProgressBarBase.get_metrics
(#8985)Deprecated
prepare_data_per_node
flag on Trainer and set it as a property ofDataHooks
, accessible in theLightningModule
andLightningDataModule
(#8958)Deprecated the
TestTubeLogger
(#9065)Deprecated
on_{train/val/test/predict}_dataloader()
fromLightningModule
andLightningDataModule
(#9098)Deprecated
on_keyboard_interrupt
callback hook in favor of newon_exception
hook (#9260)Deprecated passing
process_position
to theTrainer
constructor in favor of adding theProgressBar
callback withprocess_position
directly to the list of callbacks (#9222)Deprecated passing
flush_logs_every_n_steps
as a Trainer argument, instead pass it to the logger init if supported (#9366)Deprecated
LightningLoggerBase.close
,LoggerCollection.close
in favor ofLightningLoggerBase.finalize
,LoggerCollection.finalize
(#9422)Deprecated passing
progress_bar_refresh_rate
to theTrainer
constructor in favor of adding theProgressBar
callback withrefresh_rate
directly to the list of callbacks, or passingenable_progress_bar=False
to disable the progress bar (#9616)Deprecated
LightningDistributed
and moved the broadcast logic toDDPPlugin
andDDPSpawnPlugin
directly (#9691)Deprecated passing
stochastic_weight_avg
to theTrainer
constructor in favor of adding theStochasticWeightAveraging
callback directly to the list of callbacks (#8989)Deprecated Accelerator collective API
barrier
,broadcast
, andall_gather
in favor of calling theTrainingTypePlugin
collective API directly (#9677)Deprecated
checkpoint_callback
from theTrainer
constructor in favor ofenable_checkpointing
(#9754)Deprecated the
LightningModule.on_post_move_to_device
method (#9525)Deprecated
pytorch_lightning.core.decorators.parameter_validation
in favor ofpytorch_lightning.utilities.parameter_tying.set_shared_parameters
(#9525)Deprecated passing
weights_summary
to theTrainer
constructor in favor of adding theModelSummary
callback withmax_depth
directly to the list of callbacks (#9699)Deprecated
log_gpu_memory
,gpu_metrics
, and util funcs in favor ofDeviceStatsMonitor
callback (#9921)Deprecated
GPUStatsMonitor
andXLAStatsMonitor
in favor ofDeviceStatsMonitor
callback (#9924)Deprecated setting
Trainer(max_steps=None)
; To turn off the limit, setTrainer(max_steps=-1)
(default) (#9460)Deprecated access to the
AcceleratorConnector.is_slurm_managing_tasks
attribute and marked it as protected (#10101)Deprecated access to the
AcceleratorConnector.configure_slurm_ddp
method and marked it as protected (#10101)Deprecated passing
resume_from_checkpoint
to theTrainer
constructor in favor oftrainer.fit(ckpt_path=)
(#10061)Deprecated
ClusterEnvironment.creates_children()
in favor ofClusterEnvironment.creates_processes_externally
(property) (#10106)Deprecated
PrecisionPlugin.master_params()
in favor ofPrecisionPlugin.main_params()
(#10105)Deprecated
lr_sch_names
fromLearningRateMonitor
(#10066)Deprecated
ProgressBar
callback in favor ofTQDMProgressBar
(#10134)
[1.5.0] - Removed¶
Removed deprecated
metrics
(#8586)Removed the deprecated
outputs
argument in both theLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#8587)Removed the deprecated
TrainerLoggingMixin
class (#8609)Removed the deprecated
TrainerTrainingTricksMixin
class (#8679)Removed the deprecated
optimizer_idx
fromtraining_step
as an accepted argument in manual optimization (#8576)Removed support for the deprecated
on_save_checkpoint
signature. The hook now takes acheckpoint
positional parameter (#8697)Removed support for the deprecated
on_load_checkpoint
signature. The hook now takes apl_module
positional parameter (#8697)Removed the deprecated
save_function
property inModelCheckpoint
(#8680)Removed the deprecated
model
argument fromModelCheckpoint.save_checkpoint
(#8688)Removed the deprecated
sync_step
argument fromWandbLogger
(#8763)Removed the deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#8826)Removed
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#8850)Removed
on_reset_*_dataloader
hooks in TrainingType Plugins and Accelerators (#8858)Removed deprecated
GradInformation
module in favor ofpytorch_lightning.utilities.grads
(#8831)Removed
TrainingTypePlugin.on_save
andAccelerator.on_save
(#9023)Removed
{Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step
(#9746)Removed deprecated
connect_precision_plugin
andconnect_training_type_plugin
fromAccelerator
(#9019)Removed
on_train_epoch_end
fromAccelerator
(#9035)Removed
InterBatchProcessor
in favor ofDataLoaderIterDataFetcher
(#9052)Removed
Plugin
inbase_plugin.py
in favor of accessingTrainingTypePlugin
andPrecisionPlugin
directly instead (#9066)Removed
teardown
fromParallelPlugin
(#8943)Removed deprecated
profiled_functions
argument fromPyTorchProfiler
(#9178)Removed deprecated
pytorch_lighting.utilities.argparse_utils
module (#9166)Removed deprecated property
Trainer.running_sanity_check
in favor ofTrainer.sanity_checking
(#9209)Removed deprecated
BaseProfiler.output_filename
arg from it and its descendants in favor ofdirpath
andfilename
(#9214)Removed deprecated property
ModelCheckpoint.period
in favor ofModelCheckpoint.every_n_epochs
(#9213)Removed deprecated
auto_move_data
decorator (#9231)Removed deprecated property
LightningModule.datamodule
in favor ofTrainer.datamodule
(#9233)Removed deprecated properties
DeepSpeedPlugin.cpu_offload*
in favor ofoffload_optimizer
,offload_parameters
andpin_memory
(#9244)Removed deprecated property
AcceleratorConnector.is_using_torchelastic
in favor ofTorchElasticEnvironment.is_using_torchelastic()
(#9729)Removed
pytorch_lightning.utilities.debugging.InternalDebugger
(#9680)Removed
call_configure_sharded_model_hook
property fromAccelerator
andTrainingTypePlugin
(#9612)Removed
TrainerProperties
mixin and moved property definitions directly intoTrainer
(#9495)Removed a redundant warning with
ModelCheckpoint(monitor=None)
callback (#9875)Remove
epoch
fromtrainer.logged_metrics
(#9904)Removed
should_rank_save_checkpoint
property from Trainer (#9433)Remove deprecated
distributed_backend
fromTrainer
(#10017)Removed
process_idx
from the{DDPSpawnPlugin,TPUSpawnPlugin}.new_process
methods (#10022)Removed automatic patching of
{train,val,test,predict}_dataloader()
on theLightningModule
(#9764)Removed
pytorch_lightning.trainer.connectors.OptimizerConnector
(#10120)
[1.5.0] - Fixed¶
Fixed ImageNet evaluation in example (#10179)
Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
Fixed
move_metrics_to_cpu
moving the loss to CPU while training on device (#9308)Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
Fixed an issue with freeing memory of datafetchers during teardown (#9387)
Fixed a bug where the training step output needed to be
deepcopy
-ed (#9349)Fixed an issue with freeing memory allocated by the data iterators in
Loop.on_run_end
(#9386, #9915)Fixed
BasePredictionWriter
not returning the batch indices in a non-distributed setting (#9432)Fixed an error when running in XLA environments with no TPU attached (#9572)
Fixed check on torchmetrics logged whose
compute()
output is a multielement tensor (#9582)Fixed gradient accumulation for
DDPShardedPlugin
(#9122)Fixed missing DeepSpeed distributed call (#9540)
Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin
,DDPSpawnPlugin
,DDPShardedPlugin
,DDPSpawnShardedPlugin
(#9096)Fixed
trainer.accumulate_grad_batches
to be an int on init. The default value for it is nowNone
inside Trainer (#9652)Fixed
broadcast
inDDPPlugin
andDDPSpawnPlugin
to respect thesrc
input (#9691)Fixed
self.log(on_epoch=True, reduce_fx=sum))
for theon_batch_start
andon_train_batch_start
hooks (#9791)Fixed
self.log(on_epoch=True)
for theon_batch_start
andon_train_batch_start
hooks (#9780)Fixed restoring training state during
Trainer.fit
only (#9413)Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
Fixed DeepSpeed GPU device IDs (#9847)
Reset
val_dataloader
intuner/batch_size_scaling
(#9857)Fixed use of
LightningCLI
in computer_vision_fine_tuning.py example (#9934)Fixed issue with non-init dataclass fields in
apply_to_collection
(#9963)Reset
val_dataloader
intuner/batch_size_scaling
for binsearch (#9975)Fixed logic to check for spawn in dataloader
TrainerDataLoadingMixin._worker_check
(#9902)Fixed
train_dataloader
getting loaded twice when resuming from a checkpoint duringTrainer.fit()
(#9671)Fixed
LearningRateMonitor
logging with multiple param groups optimizer with no scheduler (#10044)Fixed undesired side effects being caused by
Trainer
patching dataloader methods on theLightningModule
(#9764)Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
Fixed
on_before_optimizer_step
getting called before the optimizer closure (including backward) has run (#10167)Fixed monitor value in
ModelCheckpoint
getting moved to the wrong device in a special case where it becomes NaN (#10118)Fixed creation of
dirpath
inBaseProfiler
if it doesn’t exist (#10073)Fixed incorrect handling of sigterm (#10189)
Fixed bug where
log(on_step=True, on_epoch=True, sync_dist=True)
wouldn’t reduce the value on step (#10227)Fixed an issue with
pl.utilities.seed.reset_seed
converting thePL_SEED_WORKERS
environment variable tobool
(#10099)Fixed iterating over a logger collection when
fast_dev_run > 0
(#10232)Fixed
batch_size
inResultCollection
not being reset to 1 on epoch end (#10242)Fixed
distrib_type
not being set when training plugin instances are being passed to the Trainer (#10251)
[1.4.9] - 2021-09-30¶
[1.4.8] - 2021-09-22¶
Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#9389)
Fixed
add_argparse_args
raisingTypeError
when args are typed astyping.Generic
in Python 3.6 (#9554)Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)
[1.4.7] - 2021-09-14¶
[1.4.6] - 2021-09-07¶
Fixed an issues with export to ONNX format when a model has multiple inputs (#8800)
Removed deprecation warnings being called for
on_{task}_dataloader
(#9279)Fixed save/load/resume from checkpoint for DeepSpeed Plugin ( #8397, #8644, #8627)
Fixed
EarlyStopping
running on train epoch end whencheck_val_every_n_epoch>1
is set (#9156)Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8333)
Fixed the Apex and DeepSpeed plugin closure running after the
on_before_optimizer_step
hook (#9288)Fixed the Native AMP plugin closure not running with manual optimization (#9288)
Fixed bug where data-loading functions where not getting the correct running stage passed (#8858)
Fixed intra-epoch evaluation outputs staying in memory when the respective
*_epoch_end
hook wasn’t overridden (#9261)Fixed error handling in DDP process reconciliation when
_sync_dir
was not initialized (#9267)Fixed PyTorch Profiler not enabled for manual optimization (#9316)
Fixed inspection of other args when a container is specified in
save_hyperparameters
(#9125)Fixed signature of
Timer.on_train_epoch_end
andStochasticWeightAveraging.on_train_epoch_end
to prevent unwanted deprecation warnings (#9347)
[1.4.5] - 2021-08-31¶
Fixed reduction using
self.log(sync_dict=True, reduce_fx={mean,max})
(#9142)Fixed not setting a default value for
max_epochs
ifmax_time
was specified on theTrainer
constructor (#9072)Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations (#9150)
Fixed
DDP
“CUDA error: initialization error” due to acopy
instead ofdeepcopy
onResultCollection
(#9239)
[1.4.4] - 2021-08-24¶
[1.4.3] - 2021-08-17¶
Fixed plateau scheduler stepping on incomplete epoch (#8861)
Fixed infinite loop with
CycleIterator
and multiple loaders (#8889)Fixed
StochasticWeightAveraging
with a list of learning rates not applying them to each param group (#8747)Restore original loaders if replaced by entrypoint (#8885)
Fixed lost reference to
_Metadata
object inResultMetricCollection
(#8932)Ensure the existence of
DDPPlugin._sync_dir
inreconciliate_processes
(#8939)
[1.4.2] - 2021-08-10¶
Fixed recursive call for
apply_to_collection(include_none=False)
(#8719)Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer (#8804)
Fixed comments and exception message for metrics_to_scalars (#8782)
Fixed typo error in LightningLoggerBase.after_save_checkpoint docstring (#8737)
[1.4.1] - 2021-08-03¶
Fixed
trainer.fit_loop.split_idx
always returningNone
(#8601)Fixed references for
ResultCollection.extra
(#8622)Fixed reference issues during epoch end result collection (#8621)
Fixed horovod auto-detection when horovod is not installed and the launcher is
mpirun
(#8610)Fixed an issue with
training_step
outputs not getting collected correctly fortraining_epoch_end
(#8613)Fixed distributed types support for CPUs (#8667)
Fixed a deadlock issue with DDP and torchelastic (#8655)
Fixed
accelerator=ddp
choice for CPU (#8645)
[1.4.0] - 2021-07-27¶
[1.4.0] - Added¶
Added
extract_batch_size
utility and corresponding tests to extract batch dimension from multiple batch types (#8357)Added support for named parameter groups in
LearningRateMonitor
(#7987)Added
dataclass
support forpytorch_lightning.utilities.apply_to_collection
(#7935)Added support to
LightningModule.to_torchscript
for saving to custom filesystems withfsspec
(#7617)Added
KubeflowEnvironment
for use with thePyTorchJob
operator in KubeflowAdded LightningCLI support for config files on object stores (#7521)
Added
ModelPruning(prune_on_train_epoch_end=True|False)
to choose when to apply pruning (#7704)Added support for checkpointing based on a provided time interval during training (#7515)
Progress tracking
Added support for passing a
LightningDataModule
positionally as the second argument totrainer.{validate,test,predict}
(#7431)Added argument
trainer.predict(ckpt_path)
(#7430)Added
clip_grad_by_value
support for TPUs (#7025)Added support for passing any class to
is_overridden
(#7918)Added
sub_dir
parameter toTensorBoardLogger
(#6195)Added correct
dataloader_idx
to batch transfer hooks (#6241)Added
include_none=bool
argument toapply_to_collection
(#7769)Added
apply_to_collections
to apply a function to two zipped collections (#7769)Added
ddp_fully_sharded
support (#7487)Added
should_rank_save_checkpoint
property to Training Plugins (#7684)Added
log_grad_norm
hook toLightningModule
to customize the logging of gradient norms (#7873)Added
save_config_filename
init argument toLightningCLI
to ease resolving name conflicts (#7741)Added
save_config_overwrite
init argument toLightningCLI
to ease overwriting existing config files (#8059)Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
Added trainer stage hooks for Training Plugins and Accelerators (#7864)
Added the
on_before_optimizer_step
hook (#8048)Added IPU Accelerator (#7867)
Fault-tolerant training
Added
{,load_}state_dict
toResultCollection
(#7948)Added
{,load_}state_dict
toLoops
(#8197)Added
FastForwardSampler
andCaptureIterableDataset
(#8307)Set
Loop.restarting=False
at the end of the first iteration (#8362)Save the loops state with the checkpoint (opt-in) (#8362)
Save a checkpoint to restore the state on exception (opt-in) (#8362)
Added
state_dict
andload_state_dict
utilities forCombinedLoader
+ utilities for dataloader (#8364)
Added
rank_zero_only
toLightningModule.log
function (#7966)Added
metric_attribute
toLightningModule.log
function (#7966)Added a warning if
Trainer(log_every_n_steps)
is a value too high for the training dataloader (#7734)Added LightningCLI support for argument links applied on instantiation (#7895)
Added LightningCLI support for configurable callbacks that should always be present (#7964)
Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
Added support for
torch.nn.UninitializedParameter
inModelSummary
(#7642)Added support
LightningModule.save_hyperparameters
whenLightningModule
is a dataclass (#7992)Added support for overriding
optimizer_zero_grad
andoptimizer_step
when using accumulate_grad_batches (#7980)Added
logger
boolean flag tosave_hyperparameters
(#7960)Added support for calling scripts using the module syntax (
python -m package.script
) (#8073)Added support for optimizers and learning rate schedulers to
LightningCLI
(#8093)Added XLA Profiler (#8014)
Added
PrecisionPlugin.{pre,post}_backward
(#8328)Added
on_load_checkpoint
andon_save_checkpoint
hooks to thePrecisionPlugin
base class (#7831)Added
max_depth
parameter inModelSummary
(#8062)Added
XLAStatsMonitor
callback (#8235)Added
restore
function andrestarting
attribute to baseLoop
(#8247)Added support for
save_hyperparameters
inLightningDataModule
(#3792)Added the
ModelCheckpoint(save_on_train_epoch_end)
to choose when to run the saving logic (#8389)Added
LSFEnvironment
for distributed training with the LSF resource managerjsrun
(#5102)Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'
(#7808)Added
tpu_spawn_debug
to plugin registry (#7933)Enabled traditional/manual launching of DDP processes through
LOCAL_RANK
andNODE_RANK
environment variable assignments (#7480)Added
quantize_on_fit_end
argument toQuantizationAwareTraining
(#8464)Added experimental support for loop specialization (#8226)
Added support for
devices
flag to Trainer (#8440)Added private
prevent_trainer_and_dataloaders_deepcopy
context manager on theLightningModule
(#8472)Added support for providing callables to the Lightning CLI instead of types (#8400)
[1.4.0] - Changed¶
Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
Changed the
Trainer
’scheckpoint_callback
argument to allow only boolean values (#7539)Log epoch metrics before the
on_evaluation_end
hook (#7272)Explicitly disallow calling
self.log(on_epoch=False)
during epoch-only or single-call hooks (#7874)Changed these
Trainer
methods to be protected:call_setup_hook
,call_configure_sharded_model
,pre_dispatch
,dispatch
,post_dispatch
,call_teardown_hook
,run_train
,run_sanity_check
,run_evaluate
,run_evaluation
,run_predict
,track_output_for_epoch_end
Changed
metrics_to_scalars
to work with any collection or value (#7888)Changed
clip_grad_norm
to usetorch.nn.utils.clip_grad_norm_
(#7025)Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint
now runs at the end of the training epoch by default (#8389)EarlyStopping
now runs at the end of the training epoch by default (#8286)Refactored Loops
Moved attributes
global_step
,current_epoch
,max/min_steps
,max/min_epochs
,batch_idx
, andtotal_batch_idx
to TrainLoop (#7437)Refactored result handling in training loop (#7506)
Moved attributes
hiddens
andsplit_idx
to TrainLoop (#7507)Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
Simplified “should run validation” logic (#7682)
Simplified logic for updating the learning rate for schedulers (#7682)
Removed the
on_epoch
guard from the “should stop” validation check (#7701)Refactored internal loop interface; added new classes
FitLoop
,TrainingEpochLoop
,TrainingBatchLoop
(#7871, #8077)Removed
pytorch_lightning/trainer/training_loop.py
(#7985)Refactored evaluation loop interface; added new classes
DataLoaderLoop
,EvaluationLoop
,EvaluationEpochLoop
(#7990, #8077)Removed
pytorch_lightning/trainer/evaluation_loop.py
(#8056)Restricted public access to several internal functions (#8024)
Refactored trainer
_run_*
functions and separate evaluation loops (#8065)Refactored prediction loop interface; added new classes
PredictionLoop
,PredictionEpochLoop
(#7700, #8077)Removed
pytorch_lightning/trainer/predict_loop.py
(#8094)Moved result teardown to the loops (#8245)
Improve
Loop
API to better handle childrenstate_dict
andprogress
(#8334)
Refactored logging
Renamed and moved
core/step_result.py
totrainer/connectors/logger_connector/result.py
(#7736)Dramatically simplify the
LoggerConnector
(#7882)trainer.{logged,progress_bar,callback}_metrics
are now updated on-demand (#7882)Completely overhaul the
Result
object in favor ofResultMetric
(#7882)Improve epoch-level reduction time and overall memory usage (#7882)
Allow passing
self.log(batch_size=...)
(#7891)Each of the training loops now keeps its own results collection (#7891)
Remove
EpochResultStore
andHookResultStore
in favor ofResultCollection
(#7909)Remove
MetricsHolder
(#7909)
Moved
ignore_scalar_return_in_dp
warning suppression to the DataParallelPlugin class (#7421)Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*
to the metric name (#7351)Raised
ValueError
when aNone
value isself.log
-ed (#7771)Changed
resolve_training_type_plugins
to allow settingnum_nodes
andsync_batchnorm
fromTrainer
setting (#7026)Default
seed_everything(workers=True)
in theLightningCLI
(#7504)Changed
model.state_dict()
inCheckpointConnector
to allowtraining_type_plugin
to customize the model’sstate_dict()
(#7474)MLflowLogger
now uses the env variableMLFLOW_TRACKING_URI
as default tracking URI (#7457)Changed
Trainer
arg and functionality fromreload_dataloaders_every_epoch
toreload_dataloaders_every_n_epochs
(#5043)Changed
WandbLogger(log_model={True/'all'})
to log models as artifacts (#6231)MLFlowLogger now accepts
run_name
as an constructor argument (#7622)Changed
teardown()
inAccelerator
to allowtraining_type_plugin
to customizeteardown
logic (#7579)Trainer.fit
now raises an error when using manual optimization with unsupported features such asgradient_clip_val
oraccumulate_grad_batches
(#7788)Accelerator hooks are called regardless if
LightningModule
overrides the same hooks (#7826)Moved profilers to their own file (#7822)
The
on_after_backward
hook is now called on accumulating iterations. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328)The mixed precision loss is no longer unscaled before the
on_after_backward
hook. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328)The
TrainingTypePlugin.{pre,post}_backward
hooks no longer take theoptimizer, opt_idx, should_accumulate
arguments (#8328)The
PrecisionPlugin.backward
hooks no longer returns a value (#8328)The
PrecisionPlugin.backward
hooks no longer takes ashould_accumulate
argument (#8328)Added the
on_before_backward
hook (#7865)LightningCLI
now aborts with a clearer message if config already exists and disables save config duringfast_dev_run
(#7963)Saved the
LightningCLI
config onsetup
and only on the main process (#8017)Dropped the
LightningCLI
ArgumentParser
when pickling (#8017)Skip
broadcast
if distributed not initialized for the spawn plugins (#8017)Trainer(resume_from_checkpoint=...)
now restores the model directly afterLightningModule.setup()
, which is beforeLightningModule.configure_sharded_model()
(#7652)Moved
torch.cuda.set_device()
to enable collective calls earlier in setup (#8312)Used XLA utility API to move data to CPU (Single TPU core) (#8078)
Improved error messages in
replace_sampler
when theDataLoader
attributes are not included in the signature or the signature is missing optional arguments (#8519)Moved
DeviceDtypeModuleMixin
andHyperparametersMixin
mixin tocore
(#8396)Return the
default_root_dir
as thelog_dir
when the logger is aLoggerCollection
(#8187)
[1.4.0] - Deprecated¶
Deprecated
LightningModule.loaded_optimizer_states_dict
(#8229)Standardized the dataloaders arguments of
trainer.{fit,valdiate,test,tune}
(#7431)Deprecated
DataModule
properties:has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
,has_teardown_predict
(#7657)Deprecated
TrainerModelHooksMixin
in favor ofpytorch_lightning.utilities.signature_utils
(#7422)Deprecated
num_nodes
andsync_batchnorm
arguments inDDPPlugin
andDDPSpawnPlugin
(#7026)Deprecated
self.log(sync_dist_op)
in favor ofself.log(reduce_fx)
. (#7891)Deprecated
is_overridden(model=...)
in favor ofis_overridden(instance=...)
(#7918)Deprecated automatically detaching returned extras with grads (#7994)
Deprecated default value of
monitor
argument in EarlyStopping callback to enforcemonitor
as a required argument (#7907)Deprecated importing
rank_zero_{warn,deprecation}
directly frompytorch_lightning.utilities.distributed
(#8085)Deprecated the use of
CheckpointConnector.hpc_load()
in favor ofCheckpointConnector.restore()
(#7652)Deprecated
ModelCheckpoint(every_n_val_epochs)
in favor ofModelCheckpoint(every_n_epochs)
(#8383)Deprecated
DDPPlugin.task_idx
in favor ofDDPPlugin.local_rank
(#8203)Deprecated the
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#8025)Deprecated the
Trainer.disable_validation
property in favor ofnot Trainer.enable_validation
(#8291)Deprecated
mode
parameter inModelSummary
in favor ofmax_depth
(#8062)Deprecated
reload_dataloaders_every_epoch
argument ofTrainer
in favor ofreload_dataloaders_every_n_epochs
(#5043)Deprecated
distributed_backend
argument forTrainer
(#8575)
[1.4.0] - Removed¶
Dropped official support/testing for PyTorch <1.6 (#8288)
Removed
ProfilerConnector
(#7654)Pruned deprecated classif. metrics from
pytorch_lightning.metrics.functional.classification
(#7499)Removed deprecated data parallel classes
LightningDataParallel
andLightningDistributedDataParallel
frompytorch_lightning.overrides.data_parallel
(#7510)Removed deprecated trainer attributes -
get_model
andaccelerator_backend
(#7502)Removed support for automatically monitoring the
val_loss
key withModelCheckpoint
. Pass yourmonitor
of choice to theModelCheckpoint
instance instead (#8293)Removed support for
self.log(tbptt_reduce_fx)
andself.log(tbptt_pad_token)
. Please, open a discussion explaining your use-case if you relied on these. (#7644)Removed deprecated utils modules
model_utils
,warning_utils
,xla_device_utils
and partiallyargparse_utils
(#7503)Removed
RPCPlugin
andRPCSequentialPlugin
. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101)Removed deprecated trainer attributes -
on_cpu
,on_tpu
,use_tpu
,on_gpu
,use_dp
,use_ddp
,use_ddp2
,use_horovod
,use_single_gpu
(#7501)Removed deprecated
optimizer
argument inLightningModule.manual_backward()
; Toggling optimizers in manual optimization should be done usingLightningModule.{un}toggle_optimizer()
(#8287)Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
Removed environment variable
PL_EXP_VERSION
from DDP subprocesses (7403)
[1.4.0] - Fixed¶
Fixed the
GPUStatsMonitor
callbacks to use the correct GPU IDs ifCUDA_VISIBLE_DEVICES
set (#8260)Fixed
lr_scheduler
checkpointed state by callingupdate_lr_schedulers
before saving checkpoints (#7877)Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
Fixed
None
loss keys getting added intraining_epoch_end
when using manual optimization and not returning a loss (#7772)Fixed a bug where
precision=64
withaccelerator='ddp_spawn'
would throw a pickle error (#6924)Do not override the existing
epoch
value inlogged_metrics
when already logged by the user (#7982)Support for manual optimization with DeepSpeed (#7970)
Fixed
dataloader_idx
argument value when predicting with only oneDataLoader
(#7941)Fixed passing the
stage
argument ofCallback.{setup,teardown}
as a keyword (#7973)Fixed metrics generated during
validation sanity checking
are cleaned on end (#8171)Fixed
log_gpu_memory
metrics not being added tologging
when nothing else is logged (#8174)Fixed a bug where calling
log
with aMetric
instance would raise an error if it was a nested attribute of the model (#8181)Fixed a bug where using
precision=64
would cause buffers with complex dtype to be cast to real (#8208)Fixed
is_overridden
returning true for wrapped functions with no changes (#8296)Fixed a bug where
truncated_bptt_steps
would throw an AttributeError when the target RNN has multiple hidden states (#8145)Fixed
self.optimizers()
not returning a single optimizer if it had been wrapped (#8326)Fixed the
on_after_backward
hook not getting called when using manual optimization and no plugins (#8328)Fixed the
LightningModule.backward
hook only getting called with theapex
plugin when using manual optimization (#8328)Fixed moving batch to device before sending it to the
on_*_batch_start
/on_*_batch_end
callbacks and model hooks (#7378)Fixed passing a custom
DDPPlugin
when choosingaccelerator="ddp_cpu"
for the accelerator (#6208)Fixed missing call to
LightningModule.untoggle_optimizer
in training loop when running gradient accumulation with multiple optimizers (#8284)Fixed hash of LightningEnum to work with value instead of name (#8421).
Fixed a bug where an extra checkpoint was saved at the end of training if the
val_check_interval
did not align with the number of training batches (#7724)Fixed hash of LightningEnum to work with value instead of name(#8421).
Fixed
move_data_to_device
to return the batch if the objectto
function didn’t returnself
(#8433)Fixed progress bar updates for Pod Training (#8258)
Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
Fixed memory leaks on GPU by moving
optimizer_states
,ResultCollection.extra
,ResultMetric
attributes, andLoggerConnector
metrics tocpu
. Also, delete the DDP wrapper onteardown
(#8490)Fixed
SWA
callback using LightningModuleprevent_trainer_and_dataloaders_deepcopy
to avoid OOM (#8472)Fixed
ModelPruning
callbackon_save_checkpoint
to avoid making adeepcopy
potentially leading to OOM (#8472)Fixed the sampler replacement logic for
DataLoader
s which do not define allDataLoader
attributes as__init__
parameters (#8519)Fixed DeepSpeed Windows support (#8488)
Fixed DeepSpeed not properly setting the trainer
lr_schedulers
attribute (#8527)Fixed experiment version and log-dir divergence in DDP when using multiple
Trainer
instances in sequence (7403)Enabled manual optimization for TPUs (#8458)
Fixed
accumulate_grad_batches
not been recomputed during model reload (#5334)Fixed a
TypeError
when wrapping optimizers in theHorovodPlugin
and runningTrainer.test
(#7840)Fixed
BackboneFinetuning
restoration (#8501)Fixed
lr_scheduler
with metric (e.g.torch.optim.lr_scheduler.ReduceLROnPlateau
) when usingautomatic_optimization = False
(#7643)Fixed
DeepSpeed
breaking with no schedulers (#8580)
[1.3.8] - 2021-07-01¶
[1.3.8] - Fixed¶
Fixed a sync deadlock when checkpointing a
LightningModule
that uses a torchmetrics 0.4Metric
(#8218)Fixed compatibility TorchMetrics v0.4 (#8206)
Added torchelastic check when sanitizing GPUs (#8095)
Fixed a DDP info message that was never shown (#8111)
Fixed metrics deprecation message at module import level (#8163)
Fixed a bug where an infinite recursion would be triggered when using the
BaseFinetuning
callback on a model that contains aModuleDict
(#8170)Added a mechanism to detect
deadlock
forDDP
when only 1 process trigger anException
. The mechanism willkill the processes
when it happens (#8167)Fixed NCCL error when selecting non-consecutive device ids (#8165)
Fixed SWA to also work with
IterableDataset
(#8172)
[1.3.7] - 2021-06-22¶
[1.3.7] - Fixed¶
Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
Fixed setting a
DistributedSampler
when using a distributed plugin in a custom accelerator (#7814)Improved
PyTorchProfiler
chrome traces names (#8009)Fixed moving the best score to device in
EarlyStopping
callback for TPU devices (#7959)Fixes access to
callback_metrics
in ddp_spawn (#7916)
[1.3.6] - 2021-06-15¶
[1.3.6] - Fixed¶
Fixed logs overwriting issue for remote filesystems (#7889)
Fixed
DataModule.prepare_data
could only be called on the global rank 0 process (#7945)Fixed setting
worker_init_fn
to seed dataloaders correctly when using DDP (#7942)Fixed
BaseFinetuning
callback to properly handle parent modules w/ parameters (#7931)
[1.3.5] - 2021-06-08¶
[1.3.5] - Added¶
Added warning to Training Step output (#7779)
[1.3.5] - Fixed¶
[1.3.5] - Changed¶
Move
training_output
validation to aftertrain_step_end
(#7868)
[1.3.4] - 2021-06-01¶
[1.3.4] - Fixed¶
[1.3.3] - 2021-05-27¶
[1.3.3] - Changed¶
Changed calling of
untoggle_optimizer(opt_idx)
out of the closure function (#7563)
[1.3.3] - Fixed¶
Fixed
ProgressBar
pickling after callingtrainer.predict
(#7608)Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
Fixed dataloaders are not reset when tuning the model (#7566)
Fixed print errors in
ProgressBar
whentrainer.fit
is not called (#7674)Fixed global step update when the epoch is skipped (#7677)
Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)
[1.3.2] - 2021-05-18¶
[1.3.2] - Changed¶
DataModule
s now avoid duplicate{setup,teardown,prepare_data}
calls for the same stage (#7238)
[1.3.2] - Fixed¶
Fixed parsing of multiple training dataloaders (#7433)
Fixed recursive passing of
wrong_type
keyword argument inpytorch_lightning.utilities.apply_to_collection
(#7433)Fixed setting correct
DistribType
forddp_cpu
(spawn) backend (#7492)Fixed incorrect number of calls to LR scheduler when
check_val_every_n_epoch > 1
(#7032)
[1.3.1] - 2021-05-11¶
[1.3.1] - Fixed¶
[1.3.0] - 2021-05-06¶
[1.3.0] - Added¶
Added support for the
EarlyStopping
callback to run at the end of the training epoch (#6944)Added synchronization points before and after
setup
hooks are run (#7202)Added a
teardown
hook toClusterEnvironment
(#6942)Added utils for metrics to scalar conversions (#7180)
Added utils for NaN/Inf detection for gradients and parameters (#6834)
Added more explicit exception message when trying to execute
trainer.test()
ortrainer.validate()
withfast_dev_run=True
(#6667)Added
LightningCLI
class to provide simple reproducibility with minimum boilerplate training CLI ( #4492, #6862, #7156, #7299)Added
gradient_clip_algorithm
argument to Trainer for gradient clipping by value (#6123).Added a way to print to terminal without breaking up the progress bar (#5470)
Added support to checkpoint after training steps in
ModelCheckpoint
callback (#6146)Added
TrainerStatus.{INITIALIZING,RUNNING,FINISHED,INTERRUPTED}
(#7173)Added
Trainer.validate()
method to perform one evaluation epoch over the validation set (#4948)Added
LightningEnvironment
for Lightning-specific DDP (#5915)Added
teardown()
hook to LightningDataModule (#4673)Added
auto_insert_metric_name
parameter toModelCheckpoint
(#6277)Added arg to
self.log
that enables users to give custom names when dealing with multiple dataloaders (#6274)Added
teardown
method toBaseProfiler
to enable subclasses defining post-profiling steps outside of__del__
(#6370)Added
setup
method toBaseProfiler
to enable subclasses defining pre-profiling steps for every process (#6633)Added no return warning to predict (#6139)
Added
Trainer.predict
config validation (#6543)Added
AbstractProfiler
interface (#6621)Added support for including module names for forward in the autograd trace of
PyTorchProfiler
(#6349)Added support for the PyTorch 1.8.1 autograd profiler (#6618)
Added
outputs
parameter to callback’son_validation_epoch_end
&on_test_epoch_end
hooks (#6120)Added
configure_sharded_model
hook (#6679)Added support for
precision=64
, enabling training with double precision (#6595)Added support for DDP communication hooks (#6736)
Added
artifact_location
argument toMLFlowLogger
which will be passed to theMlflowClient.create_experiment
call (#6677)Added
model
parameter to precision plugins’clip_gradients
signature ( #6764, #7231)Added
is_last_batch
attribute toTrainer
(#6825)Added
LightningModule.lr_schedulers()
for manual optimization (#6567)Added
MpModelWrapper
in TPU Spawn (#7045)Added
max_time
Trainer argument to limit training time (#6823)Added
on_predict_{batch,epoch}_{start,end}
hooks (#7141)Added new
EarlyStopping
parametersstopping_threshold
anddivergence_threshold
(#6868)Added
debug
flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219)Added new
UnrepeatedDistributedSampler
andIndexBatchSamplerWrapper
for tracking distributed predictions (#7215)Added
trainer.predict(return_predictions=None|False|True)
(#7215)Added
BasePredictionWriter
callback to implement prediction saving (#7127)Added
trainer.tune(scale_batch_size_kwargs, lr_find_kwargs)
arguments to configure the tuning algorithms (#7258)Added
tpu_distributed
check for TPU Spawn barrier (#7241)Added device updates to TPU Spawn for Pod training (#7243)
Added warning when missing
Callback
and usingresume_from_checkpoint
(#7254)DeepSpeed single file saving (#6900)
Added Training type Plugins Registry ( #6982, #7063, #7214, #7224 )
Add
ignore
param tosave_hyperparameters
(#6056)
[1.3.0] - Changed¶
Changed
LightningModule.truncated_bptt_steps
to be property (#7323)Changed
EarlyStopping
callback from by default runningEarlyStopping.on_validation_end
if only training is run. Setcheck_on_train_epoch_end
to run the callback at the end of the train epoch instead of at the end of the validation epoch (#7069)Renamed
pytorch_lightning.callbacks.swa
topytorch_lightning.callbacks.stochastic_weight_avg
(#6259)Refactor
RunningStage
andTrainerState
usage ( #4945, #7173)Added
RunningStage.SANITY_CHECKING
Added
TrainerFn.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
Changed
trainer.evaluating
to returnTrue
if validating or testing
Changed
setup()
andteardown()
stage argument to take any of{fit,validate,test,predict}
(#6386)Changed profilers to save separate report files per state and rank (#6621)
The trainer no longer tries to save a checkpoint on exception or run callback’s
on_train_end
functions (#6864)Changed
PyTorchProfiler
to usetorch.autograd.profiler.record_function
to record functions (#6349)Disabled
lr_scheduler.step()
in manual optimization (#6825)Changed warnings and recommendations for dataloaders in
ddp_spawn
(#6762)pl.seed_everything
will now also set the seed on theDistributedSampler
(#7024)Changed default setting for communication of multi-node training using
DDPShardedPlugin
(#6937)trainer.tune()
now returns the tuning result (#7258)LightningModule.from_datasets()
now acceptsIterableDataset
instances as training datasets. (#7503)Changed
resume_from_checkpoint
warning to an error when the checkpoint file does not exist (#7075)Automatically set
sync_batchnorm
fortraining_type_plugin
(#6536)Allowed training type plugin to delay optimizer creation (#6331)
Removed ModelSummary validation from train loop on_trainer_init (#6610)
Moved
save_function
to accelerator (#6689)Improved verbose logging for
EarlyStopping
callback (#6811)Run ddp_spawn dataloader checks on Windows (#6930)
Updated mlflow with using
resolve_tags
(#6746)Moved
save_hyperparameters
to its own function (#7119)Replaced
_DataModuleWrapper
with__new__
(#7289)Reset
current_fx
properties on lightning module in teardown (#7247)Auto-set
DataLoader.worker_init_fn
withseed_everything
(#6960)Remove
model.trainer
call inside of dataloading mixin (#7317)Split profilers module (#6261)
Ensure accelerator is valid if running interactively (#5970)
Disabled batch transfer in DP mode (#6098)
[1.3.0] - Deprecated¶
Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339)Deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#7323)Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339)Deprecated
LightningModule.grad_norm
in favor ofpytorch_lightning.utilities.grads.grad_norm
(#7292)Deprecated the
save_function
property from theModelCheckpoint
callback (#7201)Deprecated
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#7066)Deprecated
TrainerLoggingMixin
in favor of a separate utilities module for metric handling (#7180)Deprecated
TrainerTrainingTricksMixin
in favor of a separate utilities module for NaN/Inf detection for gradients and parameters (#6834)period
has been deprecated in favor ofevery_n_val_epochs
in theModelCheckpoint
callback (#6146)Deprecated
trainer.running_sanity_check
in favor oftrainer.sanity_checking
(#4945)Deprecated
Profiler(output_filename)
in favor ofdirpath
andfilename
(#6621)Deprecated
PytorchProfiler(profiled_functions)
in favor ofrecord_functions
(#6349)Deprecated
@auto_move_data
in favor oftrainer.predict
(#6993)Deprecated
Callback.on_load_checkpoint(checkpoint)
in favor ofCallback.on_load_checkpoint(trainer, pl_module, checkpoint)
(#7253)Deprecated metrics in favor of
torchmetrics
( #6505, #6530, #6540, #6547, #6515, #6572, #6573, #6584, #6636, #6637, #6649, #6659, #7131, )Deprecated the
LightningModule.datamodule
getter and setter methods; access them throughTrainer.datamodule
instead (#7168)Deprecated the use of
Trainer(gpus="i")
(string) for selecting the i-th GPU; from v1.5 this will set the number of GPUs instead of the index (#6388)
[1.3.0] - Removed¶
Removed the
exp_save_path
property from theLightningModule
(#7266)Removed training loop explicitly calling
EarlyStopping.on_validation_end
if no validation is run (#7069)Removed
automatic_optimization
as a property from the training loop in favor ofLightningModule.automatic_optimization
(#7130)Removed evaluation loop legacy returns for
*_epoch_end
hooks (#6973)Removed support for passing a bool value to
profiler
argument of Trainer (#6164)Removed no return warning from val/test step (#6139)
Removed passing a
ModelCheckpoint
instance toTrainer(checkpoint_callback)
(#6166)Removed deprecated Trainer argument
enable_pl_optimizer
andautomatic_optimization
(#6163)Removed deprecated metrics (#6161)
from
pytorch_lightning.metrics.functional.classification
removedto_onehot
,to_categorical
,get_num_classes
,roc
,multiclass_roc
,average_precision
,precision_recall_curve
,multiclass_precision_recall_curve
from
pytorch_lightning.metrics.functional.reduction
removedreduce
,class_reduce
Removed deprecated
ModelCheckpoint
argumentsprefix
,mode="auto"
(#6162)Removed
mode='auto'
fromEarlyStopping
(#6167)Removed
epoch
andstep
arguments fromModelCheckpoint.format_checkpoint_name()
, these are now included in themetrics
argument (#7344)Removed legacy references for magic keys in the
Result
object (#6016)Removed deprecated
LightningModule
hparams
setter (#6207)Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the
"log"/"progress_bar"
magic keys. Useself.log
instead (#6734)Removed
trainer.fit()
return value of1
. It has no return now (#7237)Removed
logger_connector
legacy code (#6733)Removed unused mixin attributes (#6487)
[1.3.0] - Fixed¶
Fixed NaN errors in progress bars when training with iterable datasets with no length defined (#7306)
Fixed attaching train and validation dataloaders when
reload_dataloaders_every_epoch=True
andnum_sanity_val_steps=0
(#7207)Added a barrier in the accelerator
teardown
to synchronize processes before execution finishes (#6814)Fixed multi-node DDP sub-process launch by using
local_rank
instead ofglobal_rank
for main process assertion (#7061)Fixed incorrect removal of
WORLD_SIZE
environment variable in DDP training when launching with torch distributed/torchelastic (#6942)Made the
Plugin.reduce
method more consistent across all Plugins to reflect a mean-reduction by default (#6011)Move lightning module to correct device type when using LightningDistributedWrapper (#6070)
Do not print top-k verbose log with
ModelCheckpoint(monitor=None)
(#6109)Fixed
ModelCheckpoint(save_top_k=0, save_last=True)
not saving thelast
checkpoint (#6136)Fixed
.teardown(stage='fit')
and.on_fit_{start,end}()
getting called duringtrainer.test
(#6386)Fixed LightningModule
all_gather
on cpu tensors (#6416)Fixed torch distributed not available in setup hook for DDP (#6506)
Fixed
trainer.tuner.{lr_find,scale_batch_size}
not setting theTrainer
state properly (#7258)Fixed bug where the learning rate schedulers did not follow the optimizer frequencies (#4868)
Fixed pickle error checker to now check for
pickle.PickleError
to catch all pickle errors (#6917)Fixed a bug where the outputs object passed to
LightningModule.training_epoch_end
was different from the object passed to theon_train_end_epoch
hook (#6969)Fixed a bug where the outputs passed to
train_batch_end
would be lists even when using a single optimizer and no truncated backprop through time steps (#6969)Fixed bug for trainer error handling which would cause hang for distributed training (#6864)
Fixed
self.device
not returning the correct device in replicas of data-parallel (#6414)Fixed
lr_find
trying beyondnum_training
steps and suggesting a too high learning rate (#7076)Fixed logger creating incorrect version folder in DDP with repeated
Trainer.fit
calls (#7077)Fixed metric objects passed directly to
self.log
not being reset correctly (#7055)Fixed
CombinedLoader
in distributed settings for validation / testing (#7102)Fixed the save_dir in
WandbLogger
when the run was initiated externally (#7106)Fixed
num_sanity_val_steps
affecting reproducibility of training data shuffling (#7014)Fixed resetting device after
fitting/evaluating/predicting
(#7188)Fixed bug where
trainer.tuner.scale_batch_size(max_trials=0)
would not return the correct batch size result (#7262)Fixed metrics not being properly logged with
precision=16
andmanual_optimization
(#7228)Fixed
BaseFinetuning
properly reloadingoptimizer_states
when usingresume_from_checkpoint
(#6891)Fixed
parameters_to_ignore
not properly set to DDPWrapper (#7239)Fixed parsing of
fast_dev_run=True
with the built-inArgumentParser
(#7240)Fixed handling an
IterableDataset
that fails to produce a batch at the beginning of an epoch (#7294)Fixed
LightningModule.save_hyperparameters()
when attempting to save an empty container (#7268)Fixed
apex
not properly instantiated when running withddp
(#7274)Fixed optimizer
state
not moved toGPU
(#7277)Fixed custom init args for
WandbLogger
(#6989)Fixed a bug where an error would be raised if the train dataloader sometimes produced None for a batch (#7342)
Fixed examples ( #6600, #6638, #7096, #7246, #6357, #6476, #6294, #6373, #6088, #7398 )
Resolved schedule step bug for PyTorch Profiler (#6674, #6681)
Updated logic for checking TPUs availability (#6767)
Resolve TPU miss rendezvous (#6781)
Fixed auto-scaling mode when calling tune method on trainer (#7321)
Fixed finetuning complex models correctly unfreezes (#6880)
Ensure we set the eval/train flag correctly on accelerator model (#6877)
Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic (#6802)Fixed matching the number of outputs of backward with forward for AllGatherGrad (#6625)
Fixed the
gradient_clip_algorithm
has no effect (#6928)Fixed CUDA OOM detection and handling (#6934)
Fixed
unfreeze_and_add_param_group
expectsmodules
rather thanmodule
(#6822)Fixed DPP + SyncBN when move on device (#6838)
Fixed missing arguments in
lr_find
call (#6784)Fixed
set_default_tensor_type
totorch.DoubleTensor
with precision=64 (#7108)Fixed
NeptuneLogger.log_text(step=None)
(#7194)
[1.2.9] - 2021-04-20¶
[1.2.9] - Fixed¶
[1.2.8] - 2021-04-14¶
[1.2.8] - Added¶
Added TPUSpawn + IterableDataset error message (#6875)
[1.2.8] - Fixed¶
Fixed process rank not being available right away after
Trainer
instantiation (#6941)Fixed
sync_dist
for tpus (#6950)Fixed
AttributeError
forrequire_backward_grad_sync
when running manual optimization with sharded plugin (#6915)Fixed
--gpus
default for parser returned byTrainer.add_argparse_args
(#6898)Fixed TPU Spawn all gather (#6896)
Fixed
EarlyStopping
logic whenmin_epochs
ormin_steps
requirement is not met (#6705)Fixed csv extension check (#6436)
Fixed checkpoint issue when using Horovod distributed backend (#6958)
Fixed tensorboard exception raising (#6901)
Fixed setting the eval/train flag correctly on accelerator model (#6983)
Fixed DDP_SPAWN compatibility with bug_report_model.py (#6892)
Fixed bug where
BaseFinetuning.flatten_modules()
was duplicating leaf node parameters (#6879)Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic:
[1.2.7] - 2021-04-06¶
[1.2.7] - Fixed¶
Fixed resolve a bug with omegaconf and xm.save (#6741)
Fixed an issue with IterableDataset when len is not defined (#6828)
Sanitize None params during pruning (#6836)
Enforce an epoch scheduler interval when using SWA (#6588)
Fixed TPU Colab hang issue, post training (#6816)
Fixed a bug where
TensorBoardLogger
would give a warning and not log correctly to a symbolic linksave_dir
(#6730)Fixed bug where
predict
could not be used whenprogress_bar_refresh_rate=0
(#6884)
[1.2.6] - 2021-03-30¶
[1.2.6] - Changed¶
Changed the behavior of
on_epoch_start
to run at the beginning of validation & test epoch (#6498)
[1.2.6] - Removed¶
Removed legacy code to include
step
dictionary returns incallback_metrics
. Useself.log_dict
instead. (#6682)
[1.2.6] - Fixed¶
Fixed
DummyLogger.log_hyperparams
raising aTypeError
when running withfast_dev_run=True
(#6398)Fixed error on TPUs when there was no
ModelCheckpoint
(#6654)Fixed
trainer.test
freeze on TPUs (#6654)Fixed a bug where gradients were disabled after calling
Trainer.predict
(#6657)Fixed bug where no TPUs were detected in a TPU pod env (#6719)
[1.2.5] - 2021-03-23¶
[1.2.5] - Changed¶
[1.2.5] - Fixed¶
[1.2.4] - 2021-03-16¶
[1.2.4] - Changed¶
Changed the default of
find_unused_parameters
back toTrue
in DDP and DDP Spawn (#6438)
[1.2.4] - Fixed¶
Expose DeepSpeed loss parameters to allow users to fix loss instability (#6115)
Fixed DP reduction with collection (#6324)
Fixed an issue where the tuner would not tune the learning rate if also tuning the batch size (#4688)
Fixed broadcast to use PyTorch
broadcast_object_list
and addreduce_decision
(#6410)Fixed logger creating directory structure too early in DDP (#6380)
Fixed DeepSpeed additional memory use on rank 0 when default device not set early enough (#6460)
Fixed an issue with
Tuner.scale_batch_size
not finding the batch size attribute in the datamodule (#5968)Fixed an exception in the layer summary when the model contains torch.jit scripted submodules (#6511)
Fixed when Train loop config was run during
Trainer.predict
(#6541)
[1.2.3] - 2021-03-09¶
[1.2.3] - Fixed¶
Fixed
ModelPruning(make_pruning_permanent=True)
pruning buffers getting removed when saved during training (#6073)Fixed when
_stable_1d_sort
to work whenn >= N
(#6177)Fixed
AttributeError
whenlogger=None
on TPU (#6221)Fixed PyTorch Profiler with
emit_nvtx
(#6260)Fixed
trainer.test
frombest_path
hangs after callingtrainer.fit
(#6272)Fixed
SingleTPU
callingall_gather
(#6296)Ensure we check DeepSpeed/Sharded in multi-node DDP (#6297
Check
LightningOptimizer
doesn’t delete optimizer hooks (#6305Resolve memory leak for evaluation (#6326
Ensure that clip gradients is only called if the value is greater than 0 (#6330
Fixed
Trainer
not resettinglightning_optimizers
when callingTrainer.fit()
multiple times (#6372)
[1.2.2] - 2021-03-02¶
[1.2.2] - Added¶
Added
checkpoint
parameter to callback’son_save_checkpoint
hook (#6072)
[1.2.2] - Changed¶
[1.2.2] - Fixed¶
Fixed epoch level schedulers not being called when
val_check_interval < 1.0
(#6075)Fixed multiple early stopping callbacks (#6197)
Fixed incorrect usage of
detach()
,cpu()
,to()
(#6216)Fixed LBFGS optimizer support which didn’t converge in automatic optimization (#6147)
Prevent
WandbLogger
from dropping values (#5931)Fixed error thrown when using valid distributed mode in multi node (#6297
[1.2.1] - 2021-02-23¶
[1.2.1] - Fixed¶
[1.2.0] - 2021-02-18¶
[1.2.0] - Added¶
Added
DataType
,AverageMethod
andMDMCAverageMethod
enum in metrics (#5657)Added support for summarized model total params size in megabytes (#5590)
Added support for multiple train loaders (#1959)
Added
Accuracy
metric now generalizes to Top-k accuracy for (multi-dimensional) multi-class inputs using thetop_k
parameter (#4838)Added
Accuracy
metric now enables the computation of subset accuracy for multi-label or multi-dimensional multi-class inputs with thesubset_accuracy
parameter (#4838)Added
HammingDistance
metric to compute the hamming distance (loss) (#4838)Added
max_fpr
parameter toauroc
metric for computing partial auroc metric (#3790)Added
StatScores
metric to compute the number of true positives, false positives, true negatives and false negatives (#4839)Added
R2Score
metric (#5241)Added
LambdaCallback
(#5347)Added
BackboneLambdaFinetuningCallback
(#5377)Accelerator
all_gather
supports collection (#5221)Added
image_gradients
functional metric to compute the image gradients of a given input image. (#5056)Added
MetricCollection
(#4318)Added
.clone()
method to metrics (#4318)Added
IoU
class interface (#4704)Support to tie weights after moving model to TPU via
on_post_move_to_device
hookAdded missing val/test hooks in
LightningModule
(#5467)The
Recall
andPrecision
metrics (and their functional counterpartsrecall
andprecision
) can now be generalized to Recall@K and Precision@K with the use oftop_k
parameter (#4842)Added
PyTorchProfiler
(#5560)Added compositional metrics (#5464)
Added Trainer method
predict(...)
for high performence predictions (#5579)Added
on_before_batch_transfer
andon_after_batch_transfer
data hooks (#3671)Added AUC/AUROC class interface (#5479)
Added
PredictLoop
object (#5752)Added
LightningModule.configure_callbacks
to enable the definition of model-specific callbacks (#5621)Added
dim
toPSNR
metric for mean-squared-error reduction (#5957)Added promxial policy optimization template to pl_examples (#5394)
Added
log_graph
toCometLogger
(#5295)Added possibility for nested loaders (#5404)
Added
sync_step
to Wandb logger (#5351)Added
StochasticWeightAveraging
callback (#5640)Added
LightningDataModule.from_datasets(...)
(#5133)Added
PL_TORCH_DISTRIBUTED_BACKEND
env variable to select backend (#5981)Added
Trainer
flag to activate Stochastic Weight Averaging (SWA)Trainer(stochastic_weight_avg=True)
(#6038)
[1.2.0] - Changed¶
Changed
stat_scores
metric now calculates stat scores over all classes and gains new parameters, in line with the newStatScores
metric (#4839)Changed
computer_vision_fine_tunning
example to useBackboneLambdaFinetuningCallback
(#5377)Changed
automatic casting
for LoggerConnectormetrics
(#5218)Changed
iou
[func] to allow float input (#4704)Metric
compute()
method will no longer automatically callreset()
(#5409)Set PyTorch 1.4 as min requirements, also for testing and examples
torchvision>=0.5
andtorchtext>=0.5
(#5418)Changed
callbacks
argument inTrainer
to allowCallback
input (#5446)Changed the default of
find_unused_parameters
toFalse
in DDP (#5185)Changed
ModelCheckpoint
version suffixes to start at 1 (#5008)Progress bar metrics tensors are now converted to float (#5692)
Changed the default value for the
progress_bar_refresh_rate
Trainer argument in Google COLAB notebooks to 20 (#5516)Extended support for purely iteration-based training (#5726)
Made
LightningModule.global_rank
,LightningModule.local_rank
andLightningModule.logger
read-only properties (#5730)Forced
ModelCheckpoint
callbacks to run after all others to guarantee all states are saved to the checkpoint (#5731)Refactored Accelerators and Plugins:
Added base classes for plugins (#5715)
Added parallel plugins for DP, DDP, DDPSpawn, DDP2 and Horovod (#5714)
Precision Plugins (#5718)
Added new Accelerators for CPU, GPU and TPU (#5719)
Added RPC and Sharded plugins (#5732)
Added missing
LightningModule
-wrapper logic to new plugins and accelerator (#5734)Moved device-specific teardown logic from training loop to accelerator (#5973)
Moved accelerator_connector.py to the connectors subfolder (#6033)
Trainer only references accelerator (#6039)
Made parallel devices optional across all plugins (#6051)
Enabled
self.log
in callbacks (#5094)Renamed xxx_AVAILABLE as protected (#5082)
Unified module names in Utils (#5199)
Refactor: clean trainer device & distributed getters (#5300)
Simplified training phase as LightningEnum (#5419)
Updated metrics to use LightningEnum (#5689)
Changed the seq of
on_train_batch_end
,on_batch_end
&on_train_epoch_end
,on_epoch_end hooks
(#5688)Refactored
setup_training
and removetest_mode
(#5388)Disabled training with zero
num_training_batches
when insufficientlimit_train_batches
(#5703)Refactored
EpochResultStore
(#5522)Update
lr_finder
to check for attribute if not runningfast_dev_run
(#5990)LightningOptimizer manual optimizer is more flexible and expose
toggle_model
(#5771)MlflowLogger
limit parameter value length to 250 char (#5893)Re-introduced fix for Hydra directory sync with multiple process (#5993)
[1.2.0] - Deprecated¶
Function
stat_scores_multiple_classes
is deprecated in favor ofstat_scores
(#4839)Moved accelerators and plugins to its
legacy
pkg (#5645)Deprecated
LightningDistributedDataParallel
in favor of new wrapper moduleLightningDistributedModule
(#5185)Deprecated
LightningDataParallel
in favor of new wrapper moduleLightningParallelModule
(#5670)Renamed utils modules (#5199)
argparse_utils
>>argparse
model_utils
>>model_helpers
warning_utils
>>warnings
xla_device_utils
>>xla_device
Deprecated using
'val_loss'
to set theModelCheckpoint
monitor (#6012)Deprecated
.get_model()
with explicit.lightning_module
property (#6035)Deprecated Trainer attribute
accelerator_backend
in favor ofaccelerator
(#6034)
[1.2.0] - Removed¶
[1.2.0] - Fixed¶
Fixed distributed setting and
ddp_cpu
only withnum_processes>1
(#5297)Fixed
num_workers
for Windows example (#5375)Fixed loading yaml (#5619)
Fixed support custom DataLoader with DDP if they can be re-instantiated (#5745)
Fixed repeated
.fit()
calls ignore max_steps iteration bound (#5936)Fixed throwing
MisconfigurationError
on unknown mode (#5255)Resolve bug with Finetuning (#5744)
Fixed
ModelCheckpoint
race condition in file existence check (#5155)Fixed some compatibility with PyTorch 1.8 (#5864)
Fixed forward cache (#5895)
Fixed recursive detach of tensors to CPU (#6007)
Fixed passing wrong strings for scheduler interval doesn’t throw an error (#5923)
Fixed wrong
requires_grad
state afterreturn None
with multiple optimizers (#5738)Fixed add
on_epoch_end
hook at the end ofvalidation
,test
epoch (#5986)Fixed missing
process_dataloader
call forTPUSpawn
when in distributed mode (#6015)Fixed progress bar flickering by appending 0 to floats/strings (#6009)
Fixed synchronization issues with TPU training (#6027)
Fixed
hparams.yaml
saved twice when usingTensorBoardLogger
(#5953)Fixed
fairscale
compatible with PT 1.8 (#5996)Ensured
process_dataloader
is called whentpu_cores > 1
to use Parallel DataLoader (#6015)Attempted SLURM auto resume call when non-shell call fails (#6002)
Fixed wrapping optimizers upon assignment (#6006)
Fixed allowing hashing of metrics with lists in their state (#5939)
[1.1.8] - 2021-02-08¶
[1.1.8] - Fixed¶
[1.1.7] - 2021-02-03¶
[1.1.7] - Fixed¶
Fixed
TensorBoardLogger
not closingSummaryWriter
onfinalize
(#5696)Fixed filtering of pytorch “unsqueeze” warning when using DP (#5622)
Fixed
num_classes
argument in F1 metric (#5663)Fixed
log_dir
property (#5537)Fixed a race condition in
ModelCheckpoint
when checking if a checkpoint file exists (#5144)Remove unnecessary intermediate layers in Dockerfiles (#5697)
Fixed auto learning rate ordering (#5638)
[1.1.6] - 2021-01-26¶
[1.1.6] - Changed¶
[1.1.6] - Fixed¶
Fixed
toggle_optimizer
to resetrequires_grad
state (#5574)Fixed FileNotFoundError for best checkpoint when using DDP with Hydra (#5629)
Fixed an error when logging a progress bar metric with a reserved name (#5620)
Fixed
Metric
’sstate_dict
not included when child modules (#5614)Fixed Neptune logger creating multiple experiments when GPUs > 1 (#3256)
Fixed duplicate logs appearing in console when using the python logging module (#5509)
Fixed tensor printing in
trainer.test()
(#5138)Fixed not using dataloader when
hparams
present (#4559)
[1.1.5] - 2021-01-19¶
[1.1.5] - Fixed¶
[1.1.4] - 2021-01-12¶
[1.1.4] - Added¶
Add automatic optimization property setter to lightning module (#5169)
[1.1.4] - Changed¶
Changed deprecated
enable_pl_optimizer=True
(#5244)
[1.1.4] - Fixed¶
Fixed
transfer_batch_to_device
for DDP withlen(devices_ids) == 1
(#5195)Logging only on
not should_accumulate()
during training (#5417)Resolve interpolation bug with Hydra (#5406)
Check environ before selecting a seed to prevent warning message (#4743)
Fixed signature mismatch in
model_to_device
ofDDPCPUHPCAccelerator
(#5505)
[1.1.3] - 2021-01-05¶
[1.1.3] - Added¶
[1.1.3] - Changed¶
[1.1.3] - Fixed¶
Fixed
trainer.test
returning non-test metrics (#5214)Fixed metric state reset (#5273)
Fixed
--num-nodes
onDDPSequentialPlugin
(#5327)Fixed invalid value for
weights_summary
(#5296)Fixed
Trainer.test
not using the latestbest_model_path
(#5161)Fixed existence check for hparams not using underlying filesystem (#5250)
Fixed
LightningOptimizer
AMP bug (#5191)Fixed casted key to string in
_flatten_dict
(#5354)
[1.1.2] - 2020-12-23¶
[1.1.2] - Added¶
[1.1.2] - Removed¶
enable_pl_optimizer=False
by default to temporarily fix AMP issues (#5163)
[1.1.2] - Fixed¶
Metric reduction with Logging (#5150)
Remove nan loss in manual optimization (#5121)
Un-balanced logging properly supported (#5119)
Fix hanging in DDP HPC accelerators (#5157)
Fix reset
TensorRunningAccum
(#5106)Updated
DALIClassificationLoader
to not use deprecated arguments (#4925)Corrected call to
torch.no_grad
(#5124)
[1.1.1] - 2020-12-15¶
[1.1.1] - Added¶
Add a notebook example to reach a quick baseline of ~94% accuracy on CIFAR10 using Resnet in Lightning (#4818)
[1.1.1] - Changed¶
[1.1.1] - Removed¶
[1.1.1] - Fixed¶
Fixed trainer by default
None
inDDPAccelerator
(#4915)Fixed
LightningOptimizer
to expose optimizer attributes (#5095)Do not warn when the
name
key is used in thelr_scheduler
dict (#5057)Check if optimizer supports closure (#4981)
Add deprecated metric utility functions back to functional ( #5067, #5068)
Allow any input in
to_onnx
andto_torchscript
(#4378)Fixed
DDPHPCAccelerator
hangs in DDP construction by callinginit_device
(#5157)
[1.1.0] - 2020-12-09¶
[1.1.0] - Added¶
Added “monitor” key to saved
ModelCheckpoints
(#4383)Added
ConfusionMatrix
class interface (#4348)Added multiclass AUROC metric (#4236)
Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
Added optimizer hooks in callbacks (#4379)
Added option to log momentum (#4384)
Added
current_score
toModelCheckpoint.on_save_checkpoint
(#4721)Added logging using
self.log
in train and evaluation for epoch end hooks ( #4552, #4495, #4439, #4684, #4913)Added ability for DDP plugin to modify optimizer state saving (#4675)
Added
prefix
argument in loggers (#4557)Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
Added
PrecisionRecallCurve, ROC, AveragePrecision
class metric (#4549)Added custom
Apex
andNativeAMP
asPrecision plugins
(#4355)Added
DALI MNIST
example (#3721)Added
sharded plugin
for DDP for multi-gpu training memory optimizations ( #4639, #4686, #4737, #4773)Added
experiment_id
to the NeptuneLogger (#3462)Added
Pytorch Geometric
integration example with Lightning (#4568)Added
all_gather
method toLightningModule
which allows gradient based tensor synchronizations for use-cases such as negative sampling. (#5012)Enabled
self.log
in most functions (#4969)Added changeable extension variable for
ModelCheckpoint
(#4977)
[1.1.0] - Changed¶
Tuner algorithms will be skipped if
fast_dev_run=True
(#3903)WandbLogger
does not force wandbreinit
arg to True anymore and creates a run only when needed (#4648)Changed
automatic_optimization
to be a model attribute (#4602)Changed
Simple Profiler
report to order by percentage time spent + num calls (#4880)Simplify optimization Logic (#4984)
Classification metrics overhaul (#4837)
Updated
fast_dev_run
to accept integer representing num_batches (#4629)Refactored optimizer (#4658)
[1.1.0] - Deprecated¶
[1.1.0] - Removed¶
[1.1.0] - Fixed¶
Added feature to move tensors to CPU before saving (#4309)
Fixed
LoggerConnector
to have logged metrics on root device in DP (#4138)Auto convert tensors to contiguous format when
gather_all
(#4907)Fixed
PYTHONPATH
for ddp test model (#4528)Fixed allowing logger to support indexing (#4595)
Fixed DDP and manual_optimization (#4976)
[1.0.8] - 2020-11-24¶
[1.0.8] - Added¶
[1.0.8] - Changed¶
Consistently use
step=trainer.global_step
inLearningRateMonitor
independently oflogging_interval
(#4376)Metric states are no longer as default added to
state_dict
(#4685)Renamed class metric
Fbeta
>>FBeta
(#4656)Model summary: add 1 decimal place (#4745)
Do not override
PYTHONWARNINGS
(#4700)Changed
init_ddp_connection
moved fromDDP
toDDPPlugin
(#4407)
[1.0.8] - Fixed¶
Fixed checkpoint
hparams
dict casting whenomegaconf
is available (#4770)Fixed incomplete progress bars when total batches not divisible by refresh rate (#4577)
Updated SSIM metric (#4566)
Fixed batch_arg_name - add
batch_arg_name
to all calls to_adjust_batch_size
bug (#4812)Fixed
torchtext
data to GPU (#4785)Fixed a crash bug in MLFlow logger (#4716)
[1.0.7] - 2020-11-17¶
[1.0.7] - Added¶
Added lambda closure to
manual_optimizer_step
(#4618)
[1.0.7] - Changed¶
[1.0.7] - Fixed¶
Prevent crash if
sync_dist=True
on CPU (#4626)Fixed average pbar Metrics (#4534)
Fixed
setup
callback hook to correctly pass the LightningModule through (#4608)Allowing decorate model init with saving
hparams
inside (#4662)Fixed
split_idx
set byLoggerConnector
inon_trainer_init
toTrainer
(#4697)
[1.0.6] - 2020-11-11¶
[1.0.6] - Added¶
Added metrics aggregation in Horovod and fixed early stopping (#3775)
Added
manual_optimizer_step
which work withAMP Native
andaccumulated_grad_batches
(#4485)Added
persistent(mode)
method to metrics, to enable and disable metric states being added tostate_dict
(#4482)Added congratulations at the end of our notebooks (#4555)
Added parameters
move_metrics_to_cpu
in Trainer to disable gpu leak (#4592)
[1.0.6] - Changed¶
[1.0.6] - Fixed¶
Fixed feature-lack in
hpc_load
(#4526)Fixed metrics states being overridden in DDP mode (#4482)
Fixed
lightning_getattr
,lightning_hasattr
not finding the correct attributes in datamodule (#4347)Fixed automatic optimization AMP by
manual_optimization_step
(#4485)Replace
MisconfigurationException
with warning inModelCheckpoint
Callback (#4560)Fixed logged keys in mlflow logger (#4412)
Fixed
is_picklable
by catchingAttributeError
(#4508)Fixed multi test dataloaders dict
AttributeError
error (#4480)Fixed show progress bar only for
progress_rank 0
onDDP_SLURM
(#4437)
[1.0.5] - 2020-11-03¶
[1.0.5] - Added¶
[1.0.5] - Changed¶
W&B log in sync with
Trainer
step (#4405)Hook
on_after_backward
is called only whenoptimizer_step
is being called (#4439)Moved
track_and_norm_grad
intotraining loop
and called only whenoptimizer_step
is being called (#4439)Changed type checker with explicit cast of
ref_model
object (#4457)Changed
distributed_backend
->accelerator
(#4429)
[1.0.5] - Deprecated¶
Deprecated passing
ModelCheckpoint
instance tocheckpoint_callback
Trainer argument (#4336)
[1.0.5] - Fixed¶
Disable saving checkpoints if not trained (#4372)
Fixed error using
auto_select_gpus=True
withgpus=-1
(#4209)Disabled training when
limit_train_batches=0
(#4371)Fixed that metrics do not store computational graph for all seen data (#4313)
Fixed AMP unscale for
on_after_backward
(#4439)Fixed TorchScript export when module includes Metrics (#4428)
Fixed TorchScript trace method’s data to device and docstring (#4360)
Fixed CSV logger warning (#4419)
Fixed skip DDP parameter sync (#4301)
Fixed
WandbLogger
_sanitize_callable function (#4422)Fixed
AMP Native
_unscale
gradient (#4441)
[1.0.4] - 2020-10-27¶
[1.0.4] - Added¶
Added
dirpath
andfilename
parameter inModelCheckpoint
(#4213)Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
Added
strict
option to the scheduler dictionary (#3586)Added
fsspec
support for profilers (#4162)Added autogenerated helptext to
Trainer.add_argparse_args
(#4344)Added support for string values in
Trainer
’sprofiler
parameter (#3656)Added
optimizer_closure
tooptimizer.step
when supported (#4190)Added unification of regression metrics (#4166)
Added checkpoint load from Bytes (#4314)
[1.0.4] - Changed¶
[1.0.4] - Deprecated¶
[1.0.4] - Fixed¶
Fixed setting device ids in DDP (#4297)
Fixed synchronization of best model path in
ddp_accelerator
(#4323)Fixed
WandbLogger
not uploading checkpoint artifacts at the end of training (#4341)Fixed
FBeta
computation (#4183)Fixed
accumulation across batches
has completedbefore breaking training loop
(#4278)Fixed
ModelCheckpoint
don’t increase current_epoch and global_step when not training (#4291)Fixed
COMET_EXPERIMENT_KEY
environment variable usage in comet logger (#4230)
[1.0.3] - 2020-10-20¶
[1.0.3] - Added¶
Added persistent flag to
Metric.add_state
(#4195)
[1.0.3] - Changed¶
[1.0.3] - Fixed¶
[1.0.2] - 2020-10-15¶
[1.0.2] - Added¶
Added trace functionality to the function
to_torchscript
(#4142)
[1.0.2] - Changed¶
Called
on_load_checkpoint
before loadingstate_dict
(#4057)
[1.0.2] - Removed¶
Removed duplicate metric vs step log for train loop (#4173)
[1.0.2] - Fixed¶
[1.0.1] - 2020-10-14¶
[1.0.1] - Added¶
Added getstate/setstate method for torch.save serialization (#4127)
[1.0.0] - 2020-10-13¶
[1.0.0] - Added¶
Added Explained Variance Metric + metric fix (#4013)
Added Metric <-> Lightning Module integration tests (#4008)
Added parsing OS env vars in
Trainer
(#4022)Added classification metrics (#4043)
Updated explained variance metric (#4024)
Enabled plugins (#4041)
Enabled custom clusters (#4048)
Enabled passing in custom accelerators (#4050)
Added
LightningModule.toggle_optimizer
(#4058)Added
LightningModule.manual_backward
(#4063)Added
output
argument to*_epoch_end
hooks (#3967)
[1.0.0] - Changed¶
[1.0.0] - Removed¶
Removed support for EvalResult and TrainResult (#3968)
Removed deprecated trainer flags:
overfit_pct
,log_save_interval
,row_log_interval
(#3969)Removed deprecated early_stop_callback (#3982)
Removed deprecated model hooks (#3980)
Removed deprecated callbacks (#3979)
Removed
trainer
argument inLightningModule.backward
#4056)
[1.0.0] - Fixed¶
[0.10.0] - 2020-10-07¶
[0.10.0] - Added¶
Enable PyTorch 1.7 compatibility (#3541)
Added
LightningModule.to_torchscript
to support exporting asScriptModule
(#3258)Added warning when dropping unpicklable
hparams
(#2874)Added EMB similarity (#3349)
Added
ModelCheckpoint.to_yaml
method (#3048)Allow
ModelCheckpoint
monitor to beNone
, meaning it will always save (#3630)Disabled optimizers setup during testing (#3059)
Added support for datamodules to save and load checkpoints when training (#3563)
Added support for datamodule in learning rate finder (#3425)
Added gradient clip test for native AMP (#3754)
Added dist lib to enable syncing anything across devices (#3762)
Added
broadcast
toTPUBackend
(#3814)Added
XLADeviceUtils
class to check XLA device type (#3274)
[0.10.0] - Changed¶
Refactored accelerator backends:
moved TPU
xxx_step
to backend (#3118)refactored DDP backend
forward
(#3119)refactored GPU backend
__step
(#3120)remove obscure forward call in eval + CPU backend
___step
(#3123)reduced all simplified forward (#3126)
added hook base method (#3127)
refactor eval loop to use hooks - use
test_mode
for if so we can split later (#3129)moved
___step_end
hooks (#3130)training forward refactor (#3134)
training AMP scaling refactor (#3135)
eval step scaling factor (#3136)
add eval loop object to streamline eval loop (#3138)
refactored dataloader process hook (#3139)
refactored inner eval loop (#3141)
final inner eval loop hooks (#3154)
clean up hooks in
run_evaluation
(#3156)clean up data reset (#3161)
expand eval loop out (#3165)
moved hooks around in eval loop (#3195)
remove
_evaluate
fx (#3197)Trainer.fit
hook clean up (#3198)DDPs train hooks (#3203)
reduced accelerator selection (#3211)
group prepare data hook (#3212)
added data connector (#3285)
modular is_overridden (#3290)
adding
Trainer.tune()
(#3293)move
run_pretrain_routine
->setup_training
(#3294)move train outside of setup training (#3297)
move
prepare_data
to data connector (#3307)moved accelerator router (#3309)
train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
duplicate data interface definition up into DataHooks class (#3344)
inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
all logging related calls in a connector (#3395)
added model connector (#3407)
moved eval loop logging to loggers (#3408)
moved eval loop (#3412#3408)
move
lr_finder
(#3434)move specific accelerator code (#3457)
group connectors (#3472)
apex plugin (#3502)
precision plugins (#3504)
Result - make monitor default to
checkpoint_on
to simplify (#3571)reference to the Trainer on the
LightningDataModule
(#3684)add
.log
to lightning module (#3686, #3699, #3701, #3704, #3715)enable tracking original metric when step and epoch are both true (#3685)
deprecated results obj, added support for simpler comms (#3681)
move backends back to individual files (#3712)
fixes logging for eval steps (#3763)
decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806, #3817, #3819, #3927)
remove weight loading hack for ddp_cpu (#3808)
separate
torchelastic
from DDP (#3810)separate SLURM from DDP (#3809)
decoupled DDP2 (#3816)
bug fix with logging val epoch end + monitor (#3812)
callback system and init DDP (#3836)
epoch can now log independently (#3843)
test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
fixed
init_slurm_connection
causing hostname errors (#3856)moves init apex from LM to apex connector (#3923)
moves sync bn to each backend (#3925)
moves configure ddp to each backend (#3924)
Deprecation warning (#3844)
Changed
LearningRateLogger
toLearningRateMonitor
(#3251)Used
fsspec
instead ofgfile
for all IO (#3320)Swaped
torch.load
forfsspec
load in DDP spawn backend (#3787)Swaped
torch.load
forfsspec
load in cloud_io loading (#3692)Added support for
to_disk()
to use remote filepaths withfsspec
(#3930)Updated model_checkpoint’s to_yaml to use
fsspec
open (#3801)Fixed
fsspec
is inconsistent when doingfs.ls
(#3805)
Refactor
GPUStatsMonitor
to improve training speed (#3257)Changed IoU score behavior for classes absent in target and pred (#3098)
Changed IoU
remove_bg
bool toignore_index
optional int (#3098)Changed defaults of
save_top_k
andsave_last
toNone
in ModelCheckpoint (#3680)row_log_interval
andlog_save_interval
are now based on training loop’sglobal_step
instead of epoch-internal batch index (#3667)Silenced some warnings. verified ddp refactors (#3483)
Cleaning up stale logger tests (#3490)
Allow
ModelCheckpoint
monitor to beNone
(#3633)Enable
None
model checkpoint default (#3669)Skipped
best_model_path
ifcheckpoint_callback
isNone
(#2962)Used
raise .. from ..
to explicitly chain exceptions (#3750)Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
Write predictions in LightningModule instead of EvalResult #3882
[0.10.0] - Deprecated¶
Deprecated
TrainResult
andEvalResult
, useself.log
andself.write
from theLightningModule
to log metrics and write predictions.training_step
can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681)Deprecate
early_stop_callback
Trainer argument (#3845)Rename Trainer arguments
row_log_interval
>>log_every_n_steps
andlog_save_interval
>>flush_logs_every_n_steps
(#3748)
[0.10.0] - Removed¶
Removed experimental Metric API (#3943, #3949, #3946), listed changes before final removal:
Added hooks to metric module interface (#2528)
Added error when AUROC metric is used for multiclass problems (#3350)
Fixed
ModelCheckpoint
withsave_top_k=-1
option not tracking the best models when a monitor metric is available (#3735)Fixed counter-intuitive error being thrown in
Accuracy
metric for zero target tensor (#3764)Fixed aggregation of metrics (#3517)
Fixed Metric aggregation (#3321)
Fixed RMSLE metric (#3188)
Renamed
reduction
toclass_reduction
in classification metrics (#3322)Changed
class_reduction
similar to sklearn for classification metrics (#3322)Renaming of precision recall metric (#3308)
[0.10.0] - Fixed¶
Fixed
on_train_batch_start
hook to end epoch early (#3700)Fixed
num_sanity_val_steps
is clipped tolimit_val_batches
(#2917)Fixed ONNX model save on GPU (#3145)
Fixed
GpuUsageLogger
to work on different platforms (#3008)Fixed auto-scale batch size not dumping
auto_lr_find
parameter (#3151)Fixed
batch_outputs
with optimizer frequencies (#3229)Fixed setting batch size in
LightningModule.datamodule
when usingauto_scale_batch_size
(#3266)Fixed Horovod distributed backend compatibility with native AMP (#3404)
Fixed batch size auto scaling exceeding the size of the dataset (#3271)
Fixed getting
experiment_id
from MLFlow only once instead of each training loop (#3394)Fixed
overfit_batches
which now correctly disables shuffling for the training loader. (#3501)Fixed gradient norm tracking for
row_log_interval > 1
(#3489)Fixed
ModelCheckpoint
name formatting (#3164)Fixed example implementation of AutoEncoder (#3190)
Fixed invalid paths when remote logging with TensorBoard (#3236)
Fixed change
t()
totranspose()
as XLA devices do not support.t()
on 1-dim tensor (#3252)Fixed (weights only) checkpoints loading without PL (#3287)
Fixed
gather_all_tensors
cross GPUs in DDP (#3319)Fixed CometML save dir (#3419)
Fixed forward key metrics (#3467)
Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
Fixed global step increment in training loop when
training_epoch_end
hook is used (#3673)Fixed dataloader shuffling not getting turned off with
overfit_batches > 0
anddistributed_backend = "ddp"
(#3534)Fixed determinism in
DDPSpawnBackend
when usingseed_everything
in main process (#3335)Fixed
ModelCheckpoint
period
to actually save everyperiod
epochs (#3630)Fixed
val_progress_bar
total withnum_sanity_val_steps
(#3751)Fixed Tuner dump: add
current_epoch
to dumped_params (#3261)Fixed
current_epoch
andglobal_step
properties mismatch betweenTrainer
andLightningModule
(#3785)Fixed learning rate scheduler for optimizers with internal state (#3897)
Fixed
tbptt_reduce_fx
when non-floating tensors are logged (#3796)Fixed model checkpoint frequency (#3852)
Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
Fixed
TrainerEvaluationLoopMixin
activatesmodel.train()
at the end (#3858)Fixed
overfit_batches
when using with multiple val/test_dataloaders (#3857)Fixed enables
training_step
to returnNone
(#3862)Fixed init nan for checkpointing (#3863)
Fixed for
load_from_checkpoint
(#2776)Fixes incorrect
batch_sizes
when Dataloader returns a dict with multiple tensors (#3668)Fixed unexpected signature for
validation_step
(#3947)
[0.9.0] - 2020-08-20¶
[0.9.0] - Added¶
Added basic
CSVLogger
(#2721)Added SSIM metrics (#2671)
Added BLEU metrics (#2535)
Added support to export a model to ONNX format (#2596)
Added support for
Trainer(num_sanity_val_steps=-1)
to check all validation data before training (#2246)Added struct. output:
Added class
LightningDataModule
(#2668)Added support for PyTorch 1.6 (#2745)
Added call DataModule hooks implicitly in trainer (#2755)
Added support for Mean in DDP Sync (#2568)
Added remaining
sklearn
metrics:AveragePrecision
,BalancedAccuracy
,CohenKappaScore
,DCG
,Hamming
,Hinge
,Jaccard
,MeanAbsoluteError
,MeanSquaredError
,MeanSquaredLogError
,MedianAbsoluteError
,R2Score
,MeanPoissonDeviance
,MeanGammaDeviance
,MeanTweedieDeviance
,ExplainedVariance
(#2562)Added support for
limit_{mode}_batches (int)
to work with infinite dataloader (IterableDataset) (#2840)Added support returning python scalars in DP (#1935)
Added support to Tensorboard logger for OmegaConf
hparams
(#2846)Added tracking of basic states in
Trainer
(#2541)Tracks all outputs including TBPTT and multiple optimizers (#2890)
Added GPU Usage Logger (#2932)
Added
strict=False
forload_from_checkpoint
(#2819)Added saving test predictions on multiple GPUs (#2926)
Auto log the computational graph for loggers that support this (#3003)
Added warning when changing monitor and using results obj (#3014)
Added a hook
transfer_batch_to_device
to theLightningDataModule
(#3038)
[0.9.0] - Changed¶
Truncated long version numbers in progress bar (#2594)
Enabling val/test loop disabling (#2692)
Refactored into
accelerator
module:Using
.comet.config
file forCometLogger
(#1913)Updated hooks arguments - breaking for
setup
andteardown
(#2850)Using
gfile
to support remote directories (#2164)Moved optimizer creation after device placement for DDP backends (#2904)
Support
**DictConfig
forhparam
serialization (#2519)Removed callback metrics from test results obj (#2994)
Re-enabled naming metrics in ckpt name (#3060)
Changed progress bar epoch counting to start from 0 (#3061)
[0.9.0] - Deprecated¶
Deprecated Trainer attribute
ckpt_path
, which will now be set byweights_save_path
(#2681)
[0.9.0] - Removed¶
Removed deprecated: (#2760)
core decorator
data_loader
Module hook
on_sanity_check_start
and loadingload_from_metrics
package
pytorch_lightning.logging
Trainer arguments:
show_progress_bar
,num_tpu_cores
,use_amp
,print_nan_grads
LR Finder argument
num_accumulation_steps
[0.9.0] - Fixed¶
Fixed
accumulate_grad_batches
for last batch (#2853)Fixed setup call while testing (#2624)
Fixed local rank zero casting (#2640)
Fixed single scalar return from training (#2587)
Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
Fixed
dtype
anddevice
properties not getting updated in submodules (#2657)Fixed
fast_dev_run
to run for all dataloaders (#2581)Fixed
save_dir
in loggers getting ignored by default value ofweights_save_path
when user did not specifyweights_save_path
(#2681)Fixed
weights_save_path
getting ignored whenlogger=False
is passed to Trainer (#2681)Fixed TPU multi-core and Float16 (#2632)
Fixed test metrics not being logged with
LoggerCollection
(#2723)Fixed data transfer to device when using
torchtext.data.Field
andinclude_lengths is True
(#2689)Fixed shuffle argument for distributed sampler (#2789)
Fixed logging interval (#2694)
Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1
(#2738)Fixed correct CWD for ddp sub-processes when using Hydra (#2719)
Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES
(#2739)Fixed false
num_classes
warning in metrics (#2781)Fixed shell injection vulnerability in subprocess call (#2786)
Fixed LR finder and
hparams
compatibility (#2821)Fixed
ModelCheckpoint
not saving the latest information whensave_last=True
(#2881)Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
Fixed apex gradient clipping (#2829)
Fixed save apex scaler states (#2828)
Fixed a model loading issue with inheritance and variable positional arguments (#2911)
Fixed passing
non_blocking=True
when transferring a batch object that does not support it (#2910)Fixed checkpointing to remote file paths (#2925)
Fixed adding val step argument to metrics (#2986)
Fixed an issue that caused
Trainer.test()
to stall in ddp mode (#2997)Fixed gathering of results with tensors of varying shape (#3020)
Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
Fixed automatic batch scaling not working with half precision (#3045)
Fixed setting device to root gpu (#3042)
[0.8.5] - 2020-07-09¶
[0.8.5] - Added¶
[0.8.5] - Removed¶
Removed auto val reduce (#2462)
[0.8.5] - Fixed¶
Flattening Wandb Hyperparameters (#2459)
Fixed using the same DDP python interpreter and actually running (#2482)
Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
Made
TensorBoardLogger
andCometLogger
pickleable (#2518)Fixed a problem with
MLflowLogger
creating multiple run folders (#2502)Fixed global_step increment (#2455)
Fixed TPU hanging example (#2488)
Fixed
argparse
default value bug (#2526)Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
Fixed Trainer
.fit()
returning last not best weights in “ddp_spawn” (#2565)Fixed passing (do not pass) TPU weights back on test (#2566)
[0.8.4] - 2020-07-01¶
[0.8.4] - Added¶
[0.8.4] - Changed¶
Enabled no returns from eval (#2446)
[0.8.4] - Fixed¶
[0.8.3] - 2020-06-29¶
[0.8.3] - Fixed¶
[0.8.2] - 2020-06-28¶
[0.8.2] - Added¶
Added TorchText support for moving data to GPU (#2379)
[0.8.2] - Changed¶
[0.8.2] - Removed¶
Moved
TrainsLogger
to Bolts (#2384)
[0.8.2] - Fixed¶
Fixed parsing TPU arguments and TPU tests (#2094)
Fixed number batches in case of multiple dataloaders and
limit_{*}_batches
(#1920, #2226)Fixed an issue with forward hooks not being removed after model summary (#2298)
Fix for
load_from_checkpoint()
not working with absolute path on Windows (#2294)Fixed an issue how _has_len handles
NotImplementedError
e.g. raised bytorchtext.data.Iterator
(#2293), (#2307)Fixed
average_precision
metric (#2319)Fixed ROC metric for CUDA tensors (#2304)
Fixed lost compatibility with custom datatypes implementing
.to
(#2335)Fixed loading model with kwargs (#2387)
Fixed sum(0) for
trainer.num_val_batches
(#2268)Fixed checking if the parameters are a
DictConfig
Object (#2216)Fixed SLURM weights saving (#2341)
Fixed swaps LR scheduler order (#2356)
Fixed adding tensorboard
hparams
logging test (#2342)Fixed use model ref for tear down (#2360)
Fixed logger crash on DDP (#2388)
Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
Fixed loading past checkpoints from v0.7.x (#2405)
Fixed loading model without arguments (#2403)
Fixed Windows compatibility issue (#2358)
[0.8.1] - 2020-06-19¶
[0.8.1] - Fixed¶
[0.8.0] - 2020-06-18¶
[0.8.0] - Added¶
Added
overfit_batches
,limit_{val|test}_batches
flags (overfit now uses training set for all three) (#2213)Added metrics
Allow dataloaders without sampler field present (#1907)
Added option
save_last
to save the model at the end of every epoch inModelCheckpoint
(#1908)Early stopping checks
on_validation_end
(#1458)Speed up single-core TPU training by loading data using
ParallelLoader
(#2033)Added a model hook
transfer_batch_to_device
that enables moving custom data structures to the target device (#1756)Added black formatter for the code with code-checker on pull (#1610)
Added back the slow spawn ddp implementation as
ddp_spawn
(#2115)Added loading checkpoints from URLs (#1667)
Added a callback method
on_keyboard_interrupt
for handling KeyboardInterrupt events during training (#2134)Added a decorator
auto_move_data
that moves data to the correct device when using the LightningModule for inference (#1905)Added
ckpt_path
option toLightningModule.test(...)
to load particular checkpoint (#2190)Added
setup
andteardown
hooks for model (#2229)
[0.8.0] - Changed¶
Allow user to select individual TPU core to train on (#1729)
Removed non-finite values from loss in
LRFinder
(#1862)Allow passing model hyperparameters as complete kwarg list (#1896)
Renamed
ModelCheckpoint
’s attributesbest
tobest_model_score
andkth_best_model
tokth_best_model_path
(#1799)Re-Enable Logger’s
ImportError
s (#1938)Changed the default value of the Trainer argument
weights_summary
fromfull
totop
(#2029)Raise an error when lightning replaces an existing sampler (#2020)
Enabled
prepare_data
from correct processes - clarify local vs global rank (#2166)Remove explicit flush from tensorboard logger (#2126)
Changed epoch indexing from 1 instead of 0 (#2206)
[0.8.0] - Deprecated¶
Deprecated flags: (#2213)
overfit_pct
in favour ofoverfit_batches
val_percent_check
in favour oflimit_val_batches
test_percent_check
in favour oflimit_test_batches
Deprecated
ModelCheckpoint
’s attributesbest
andkth_best_model
(#1799)Dropped official support/testing for older PyTorch versions <1.3 (#1917)
Deprecated Trainer
proc_rank
in favour ofglobal_rank
(#2166, #2269)
[0.8.0] - Removed¶
Removed unintended Trainer argument
progress_bar_callback
, the callback should be passed in byTrainer(callbacks=[...])
instead (#1855)Removed obsolete
self._device
in Trainer (#1849)Removed deprecated API (#2073)
Packages:
pytorch_lightning.pt_overrides
,pytorch_lightning.root_module
Modules:
pytorch_lightning.logging.comet_logger
,pytorch_lightning.logging.mlflow_logger
,pytorch_lightning.logging.test_tube_logger
,pytorch_lightning.overrides.override_data_parallel
,pytorch_lightning.core.model_saving
,pytorch_lightning.core.root_module
Trainer arguments:
add_row_log_interval
,default_save_path
,gradient_clip
,nb_gpu_nodes
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
Trainer attributes:
nb_gpu_nodes
,num_gpu_nodes
,gradient_clip
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
,default_save_path
,tng_tqdm_dic
[0.8.0] - Fixed¶
Run graceful training teardown on interpreter exit (#1631)
Fixed user warning when apex was used together with learning rate schedulers (#1873)
Fixed multiple calls of
EarlyStopping
callback (#1863)Fixed an issue with
Trainer.from_argparse_args
when passing in unknown Trainer args (#1932)Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
Fixed root node resolution for SLURM cluster with dash in host name (#1954)
Fixed
LearningRateLogger
in multi-scheduler setting (#1944)Fixed test configuration check and testing (#1804)
Fixed an issue with Trainer constructor silently ignoring unknown/misspelled arguments (#1820)
Fixed
save_weights_only
in ModelCheckpoint (#1780)Allow use of same
WandbLogger
instance for multiple training loops (#2055)Fixed an issue with
_auto_collect_arguments
collecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself
(#2048)Fixed mistake in parameters’ grad norm tracking (#2012)
Fixed CPU and hanging GPU crash (#2118)
Fixed an issue with the model summary and
example_input_array
depending on a specific ordering of the submodules in a LightningModule (#1773)Fixed Tpu logging (#2230)
[0.7.6] - 2020-05-16¶
[0.7.6] - Added¶
Added callback for logging learning rates (#1498)
Added transfer learning example (for a binary classification task in computer vision) (#1564)
Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723).Added auto scaling of batch size (#1638)
The progress bar metrics now also get updated in
training_epoch_end
(#1724)Enable
NeptuneLogger
to work withdistributed_backend=ddp
(#1753)Added option to provide seed to random generators to ensure reproducibility (#1572)
Added override for hparams in
load_from_ckpt
(#1797)Added support multi-node distributed execution under
torchelastic
(#1811, #1818)Added dummy logger for internally disabling logging for some features (#1836)
[0.7.6] - Changed¶
Enable
non-blocking
for device transfers to GPU (#1843)Replace mata_tags.csv with hparams.yaml (#1271)
Reduction when
batch_size < num_gpus
(#1609)Updated LightningTemplateModel to look more like Colab example (#1577)
Don’t convert
namedtuple
totuple
when transferring the batch to target device (#1589)Allow passing hparams as keyword argument to LightningModule when loading from checkpoint (#1639)
Args should come after the last positional argument (#1807)
Made ddp the default if no backend specified with multiple GPUs (#1789)
[0.7.6] - Deprecated¶
Deprecated
tags_csv
in favor ofhparams_file
(#1271)
[0.7.6] - Fixed¶
Fixed broken link in PR template (#1675)
Fixed ModelCheckpoint not None checking filepath (#1654)
Trainer now calls
on_load_checkpoint()
when resuming from a checkpoint (#1666)Fixed sampler logic for ddp with iterable dataset (#1734)
Fixed
_reset_eval_dataloader()
for IterableDataset (#1560)Fixed Horovod distributed backend to set the
root_gpu
property (#1669)Fixed wandb logger
global_step
affects other loggers (#1492)Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
Fixed bugs that prevent lr finder to be used together with early stopping and validation dataloaders (#1676)
Fixed a bug in Trainer that prepended the checkpoint path with
version_
when it shouldn’t (#1748)Fixed lr key name in case of param groups in LearningRateLogger (#1719)
Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
Fixed num processes wasn’t being set properly and auto sampler was ddp failing (#1819)
Fixed bugs in semantic segmentation example (#1824)
Fixed saving native AMP scaler state (#1777)
Fixed native amp + ddp (#1788)
Fixed
hparam
logging with metrics (#1647)
[0.7.5] - 2020-04-27¶
[0.7.5] - Changed¶
Allow logging of metrics together with
hparams
(#1630)
[0.7.5] - Removed¶
Removed Warning from trainer loop (#1634)
[0.7.5] - Fixed¶
[0.7.4] - 2020-04-26¶
[0.7.4] - Added¶
Added flag
replace_sampler_ddp
to manually disable sampler replacement in DDP (#1513)Added
auto_select_gpus
flag to trainer that enables automatic selection of available GPUs on exclusive mode systems.Added learning rate finder (#1347)
Added support for DDP mode in clusters without SLURM (#1387)
Added
test_dataloaders
parameter toTrainer.test()
(#1434)Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
(#1475)Added speed parity tests (max 1 sec difference per epoch)(#1482)
Added
ddp_cpu
backend for testing ddp without GPUs (#1158)Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')
(#1529)Added support for 8 core distributed training on Kaggle TPU’s (#1568)
[0.7.4] - Changed¶
Changed the default behaviour to no longer include a NaN check with each training iteration (#1475)
Decoupled the progress bar from trainer` it is a callback now and can be customized or even be replaced entirely (#1450).
Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
Updated semantic segmentation example with custom U-Net and logging (#1371)
Disabled val and test shuffling (#1600)
[0.7.4] - Deprecated¶
Deprecated
training_tqdm_dict
in favor ofprogress_bar_dict
(#1450).
[0.7.4] - Removed¶
Removed
test_dataloaders
parameter fromTrainer.fit()
(#1434)
[0.7.4] - Fixed¶
Added the possibility to pass nested metrics dictionaries to loggers (#1582)
Fixed memory leak from opt return (#1528)
Fixed saving checkpoint before deleting old ones (#1453)
Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()
results (#1459)Fixed optimizer configuration when
configure_optimizers
returns dict withoutlr_scheduler
(#1443)Fixed
LightningModule
- mixing hparams and arguments inLightningModule.__init__()
crashes load_from_checkpoint() (#1505)Added a missing call to the
on_before_zero_grad
model hook (#1493).Allow use of sweeps with
WandbLogger
(#1512)Fixed a bug that caused the
callbacks
Trainer argument to reference a global variable (#1534).Fixed a bug that set all boolean CLI arguments from
Trainer.add_argparse_args
always to True (#1571)Fixed do not copy the batch when training on a single GPU (#1576, #1579)
Fixed soft checkpoint removing on DDP (#1408)
Fixed automatic parser bug (#1585)
Fixed bool conversion from string (#1606)
[0.7.3] - 2020-04-09¶
[0.7.3] - Added¶
Added
rank_zero_warn
for warning only in rank 0 (#1428)
[0.7.3] - Fixed¶
[0.7.2] - 2020-04-07¶
[0.7.2] - Added¶
Added same step loggers’ metrics aggregation (#1278)
Added parity test between a vanilla MNIST model and lightning model (#1284)
Added parity test between a vanilla RNN model and lightning model (#1351)
Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
Added support for hierarchical
dict
(#1152)Added
TrainsLogger
class (#1122)Added type hints to
pytorch_lightning.core
(#946)Added support for
IterableDataset
in validation and testing (#1104)Added support for non-primitive types in
hparams
forTensorboardLogger
(#1130)Added a check that stops the training when loss or weights contain
NaN
orinf
values. (#1097)Added support for
IterableDataset
whenval_check_interval=1.0
(default), this will trigger validation at the end of each epoch. (#1283)Added
summary
method to Profilers. (#1259)Added informative errors if user defined dataloader has zero length (#1280)
Added testing for python 3.8 (#915)
Added model configuration checking (#1199)
Added support for optimizer frequencies through
LightningModule.configure_optimizers()
(#1269)Added option to run without an optimizer by returning
None
fromconfigure_optimizers
. (#1279)Added a warning when the number of data loader workers is small. (#1378)
[0.7.2] - Changed¶
Changed (renamed and refatored)
TensorRunningMean
->TensorRunningAccum
: running accumulations were generalized. (#1278)Changed
progress_bar_refresh_rate
trainer flag to disable progress bar when set to 0. (#1108)Enhanced
load_from_checkpoint
to also forward params to the model (#1307)Updated references to
self.forward()
to instead use the__call__
interface. (#1211)Changed default behaviour of
configure_optimizers
to use no optimizer rather than Adam. (#1279)Allow to upload models on W&B (#1339)
On DP and DDP2 unsqueeze is automated now (#1319)
Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
Did not interfere with a default sampler (#1318)
Remove default Adam optimizer (#1317)
Give warnings for unimplemented required lightning methods (#1317)
Made
evaluate
method private >>Trainer._evaluate(...)
. (#1260)Simplify the PL examples structure (shallower and more readable) (#1247)
Changed min max gpu memory to be on their own plots (#1358)
Remove
.item
which causes sync issues (#1254)Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
Change default logger to dedicated one (#1064)
[0.7.2] - Deprecated¶
[0.7.2] - Removed¶
[0.7.2] - Fixed¶
Fixed
model_checkpoint
when saving all models (#1359)Trainer.add_argparse_args
classmethod fixed. Now it adds a type for the arguments (#1147)Fixed bug related to type checking of
ReduceLROnPlateau
lr schedulers(#1126)Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch
(#1196)Fixed all warnings and errors in the docs build process (#1191)
Fixed an issue where
val_percent_check=0
would not disable validation (#1251)Fixed average of incomplete
TensorRunningMean
(#1309)Fixed
WandbLogger.watch
withwandb.init()
(#1311)Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
Fixed a bug that would cause
trainer.test()
to run on the validation set when overloadingvalidation_epoch_end
andtest_end
(#1353)Fixed
WandbLogger.watch
- use of the watch method without importingwandb
(#1311)Fixed
WandbLogger
to be used with ‘ddp’ - allow reinits in sub-processes (#1149, #1360)Made
training_epoch_end
behave likevalidation_epoch_end
(#1357)Fixed
fast_dev_run
running validation twice (#1365)Fixed pickle error from quick patch
__code__
(#1352)Fixed checkpointing interval (#1272)
Fixed validation and training loops run the partial dataset (#1192)
Fixed running
on_validation_end
only on main process in DDP (#1125)Fixed
load_spawn_weights
only in proc rank 0 (#1385)Fixes using deprecated
use_amp
attribute (#1145)Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1377)
Fixed
Unimplemented backend XLA
error on TPU (#1387)
[0.7.1] - 2020-03-07¶
[0.7.1] - Fixed¶
Fixes
print
issues anddata_loader
(#1080)
[0.7.0] - 2020-03-06¶
[0.7.0] - Added¶
Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
Added
reload_dataloaders_every_epoch=False
flag for trainer. Some users require reloading data every epoch (#926)Added
progress_bar_refresh_rate=50
flag for trainer. Throttle refresh rate on notebooks (#926)Updated governance docs
Added a check to ensure that the metric used for early stopping exists before training commences (#542)
Added
optimizer_idx
argument tobackward
hook (#733)Added
entity
argument toWandbLogger
to be passed towandb.init
(#783)Added a tool for profiling training runs (#782)
Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804)Added option to specify
step
key when logging metrics (#808)Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759)Added Tensor Processing Unit (TPU) support (#868)
Split callbacks in multiple files (#849)
Added support for multiple loggers to be passed to
Trainer
as an iterable (e.g. list, tuple, etc.) (#903)Added support for step-based learning rate scheduling (#941)
Added support for logging
hparams
as dict (#1029)Checkpoint and early stopping now work without val. step (#1041)
Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
Added type hints for function arguments (#912, )
Added TPU gradient clipping (#963)
Added max/min number of steps in
Trainer
(#728)
[0.7.0] - Changed¶
Improved
NeptuneLogger
by addingclose_after_fit
argument to allow logging after training(#908)Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752)Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767)Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749)Moved functionality of
LightningModule.load_from_metrics
intoLightningModule.load_from_checkpoint
(#995)Changed Checkpoint path parameter from
filepath
todirpath
(#1016)Freezed models
hparams
asNamespace
property (#1029)Dropped
logging
config in package init (#1015)Renames model steps (#1051)
training_end
>>training_epoch_end
validation_end
>>validation_epoch_end
test_end
>>test_epoch_end
Refactor dataloading, supports infinite dataloader (#955)
Create single file in
TensorBoardLogger
(#777)
[0.7.0] - Deprecated¶
[0.7.0] - Removed¶
[0.7.0] - Fixed¶
Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743)Fixed a bug where the model checkpointer didn’t write to the same directory as the logger (#771)
Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777)Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832)Fixed a bug when calling
self.logger.experiment
with multiple loggers (#1009)Fixed a bug when calling
logger.append_tags
on aNeptuneLogger
with a single tag (#1009)Fixed sending back data from
.spawn
by saving and loading the trained model in/out of the process (#1017Fixed port collision on DDP (#1010)
Fixed/tested pass overrides (#918)
Fixed comet logger to log after train (#892)
Remove deprecated args to learning rate step function (#890)
[0.6.0] - 2020-01-21¶
[0.6.0] - Added¶
Added support for resuming from a specific checkpoint via
resume_from_checkpoint
argument (#516)Added support for
ReduceLROnPlateau
scheduler (#320)Added support for Apex mode
O2
in conjunction with Data Parallel (#493)Added option (
save_top_k
) to save the top k models in theModelCheckpoint
class (#128)Added
on_train_start
andon_train_end
hooks toModelHooks
(#598)Added
TensorBoardLogger
(#607)Added support for weight summary of model with multiple inputs (#543)
Added
map_location
argument toload_from_metrics
andload_from_checkpoint
(#625)Added option to disable validation by setting
val_percent_check=0
(#649)Added
NeptuneLogger
class (#648)Added
WandbLogger
class (#627)
[0.6.0] - Changed¶
Changed the default progress bar to print to stdout instead of stderr (#531)
Renamed
step_idx
tostep
,epoch_idx
toepoch
,max_num_epochs
tomax_epochs
andmin_num_epochs
tomin_epochs
(#589)Renamed
total_batch_nb
tototal_batches
,nb_val_batches
tonum_val_batches
,nb_training_batches
tonum_training_batches
,max_nb_epochs
tomax_epochs
,min_nb_epochs
tomin_epochs
,nb_test_batches
tonum_test_batches
, andnb_val_batches
tonum_val_batches
(#567)Changed gradient logging to use parameter names instead of indexes (#660)
Changed the default logger to
TensorBoardLogger
(#609)Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
[0.6.0] - Deprecated¶
[0.6.0] - Removed¶
Removed the
save_best_only
argument fromModelCheckpoint
, usesave_top_k=1
instead (#128)
[0.6.0] - Fixed¶
Fixed a bug which ocurred when using Adagrad with cuda (#554)
Fixed a bug where training would be on the GPU despite setting
gpus=0
orgpus=[]
(#561)Fixed an error with
print_nan_gradients
when some parameters do not require gradient (#579)Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
Fixed support for PyTorch 1.1.0 (#552)
Fixed an issue with early stopping when using a
val_check_interval < 1.0
inTrainer
(#492)Fixed bugs relating to the
CometLogger
object that would cause it to not work properly (#481)Fixed a bug that would occur when returning
-1
fromon_batch_start
following an early exit or when the batch wasNone
(#509)Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
Fixed a bug where batch ‘segments’ would remain on the GPU when using
truncated_bptt > 1
(#532)Fixed a bug when using
IterableDataset
(#547)Fixed a bug where
.item
was called on non-tensor objects (#602)Fixed a bug where
Trainer.train
would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs
(#608)Fixed a bug where early stopping would begin two epochs early (#617)
Fixed a bug where
num_training_batches
andnum_test_batches
would sometimes be rounded down to zero (#649)Fixed a bug where an additional batch would be processed when manually setting
num_training_batches
(#653)Fixed a bug when batches did not have a
.copy
method (#701)Fixed a bug when using
log_gpu_memory=True
in Python 3.6 (#715)Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
Fixed a bug where
on_train_end
was not called when ealy stopping (#723)
[0.5.3] - 2019-11-06¶
[0.5.3] - Added¶
Added option to disable default logger, checkpointer, and early stopping by passing
logger=False
,checkpoint_callback=False
andearly_stop_callback=False
respectivelyAdded
CometLogger
for use with Comet.mlAdded
val_check_interval
argument toTrainer
allowing validition to be performed at every given number of batchesAdded functionality to save and load hyperparameters using the standard checkpoint mechanism
Added call to
torch.cuda.empty_cache
before training startsAdded option for user to override the call t
backward
Added support for truncated backprop through time via the
truncated_bptt_steps
argument inTrainer
Added option to operate on all outputs from
training_step
in DDP2Added a hook for modifying DDP init
Added a hook for modifying Apex
[0.5.3] - Changed¶
Changed experiment version to be padded with zeros (e.g.
/dir/version_9
becomes/dir/version_0009
)Changed callback metrics to include any metrics given in logs or progress bar
Changed the default for
save_best_only
inModelCheckpoint
toTrue
Added
tng_data_loader
for backwards compatibilityRenamed
MLFlowLogger.client
toMLFlowLogger.experiment
for consistencyMoved
global_step
increment to happen after the batch has been processedChanged weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
Changed progress bar functionality to add multiple progress bars for train/val/test
Changed calls to
print
to uselogging
instead
[0.5.3] - Deprecated¶
Deprecated
tng_dataloader
[0.5.3] - Fixed¶
Fixed an issue where the number of batches was off by one during training
Fixed a bug that occured when setting a ckeckpoint callback and
early_stop_callback=False
Fixed an error when importing CometLogger
Fixed a bug where the
gpus
argument had some unexpected behaviourFixed a bug where the computed total number of batches was sometimes incorrect
Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
Fixed a bug when using the
log_gpu_memory='min_max'
option inTrainer
Fixed a bug where checkpointing would sometimes erase the current directory
[0.5.2] - 2019-10-10¶
[0.5.2] - Added¶
Added
weights_summary
argument toTrainer
to be set tofull
(full summary),top
(just top level modules) or otherAdded
tags
argument toMLFlowLogger
[0.5.2] - Changed¶
Changed default for
amp_level
toO1
[0.5.2] - Removed¶
Removed the
print_weights_summary
argument fromTrainer
[0.5.2] - Fixed¶
Fixed a bug where logs were not written properly
Fixed a bug where
logger.finalize
wasn’t called after training is completeFixed callback metric errors in DDP
Fixed a bug where
TestTubeLogger
didn’t log to the correct directory
[0.5.1] - 2019-10-05¶
[0.5.1] - Added¶
Added the
LightningLoggerBase
class for experiment loggersAdded
MLFlowLogger
for logging withmlflow
Added
TestTubeLogger
for logging withtest_tube
Added a different implementation of DDP (
distributed_backed='ddp2'
) where every node has one model using all GPUsAdded support for optimisers which require a closure (e.g. LBFGS)
Added automatic
MASTER_PORT
defualt for DDP when not set manuallyAdded new GPU memory logging options
'min_max'
(log only the min/max utilization) and'all'
(log all the GPU memory)
[0.5.1] - Changed¶
Changed schedulers to always be called with the current epoch
Changed
test_tube
to an optional dependencyChanged data loaders to internally use a getter instead of a python property
Disabled auto GPU loading when restoring weights to prevent out of memory errors
Changed logging, early stopping and checkpointing to occur by default
[0.5.1] - Fixed¶
Fixed a bug with samplers that do not specify
set_epoch
Fixed a bug when using the
MLFlowLogger
with unsupported data types, this will now raise a warningFixed a bug where gradient norms were alwasy zero using
track_grad_norm
Fixed a bug which causes a crash when logging memory
[0.5.0] - 2019-09-26¶
[0.5.0] - Changed¶
Changed
data_batch
argument tobatch
throughoutChanged
batch_i
argument tobatch_idx
throughoutChanged
tng_dataloader
method totrain_dataloader
Changed
on_tng_metrics
method toon_training_metrics
Changed
gradient_clip
argument togradient_clip_val
Changed
add_log_row_interval
torow_log_interval
[0.5.0] - Fixed¶
Fixed a bug with tensorboard logging in multi-gpu setup
[0.4.9] - 2019-09-16¶
[0.4.9] - Added¶
Added the flag
log_gpu_memory
toTrainer
to deactivate logging of GPU memory utilizationAdded SLURM resubmit functionality (port from test-tube)
Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
Added option to use single gpu per node with
DistributedDataParallel
[0.4.9] - Changed¶
Changed functionality of
validation_end
andtest_end
with multiple dataloaders to be given all of the dataloaders at once rather than in seperate callsChanged print_nan_grads to only print the parameter value and gradients when they contain NaN
Changed gpu API to take integers as well (e.g.
gpus=2
instead ofgpus=[0, 1]
)All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
[0.4.9] - Fixed¶
Fixed a bug where data types that implement
.to
but not.cuda
would not be properly moved onto the GPUFixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
[0.4.8] - 2019-08-31¶
[0.4.8] - Added¶
Added
test_step
andtest_end
methods, used whenTrainer.test
is calledAdded
GradientAccumulationScheduler
callback which can be used to schedule changes to the number of accumulation batchesAdded option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
[0.4.8] - Fixed¶
Fixed a bug when setting
nb_sanity_val_steps = 0
[0.4.7] - 2019-08-24¶
[0.4.7] - Changed¶
Changed the default
val_check_interval
to1.0
Changed defaults for
nb_val_batches
,nb_tng_batches
andnb_test_batches
to 0
[0.4.7] - Fixed¶
Fixed a bug where the full validation set as used despite setting
val_percent_check
Fixed a bug where an
Exception
was thrown when using a data set containing a single batchFixed a bug where an
Exception
was thrown if noval_dataloader
was givenFixed a bug where tuples were not properly transfered to the GPU
Fixed a bug where data of a non standard type was not properly handled by the trainer
Fixed a bug when loading data as a tuple
Fixed a bug where
AttributeError
could be suppressed by theTrainer
[0.4.6] - 2019-08-15¶
[0.4.6] - Added¶
Added support for data to be given as a
dict
orlist
with a single gpuAdded support for
configure_optimizers
to return a single optimizer, two list (optimizers and schedulers), or a single list
[0.4.6] - Fixed¶
Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizers
would throw anException
[0.4.5] - 2019-08-13¶
[0.4.5] - Added¶
Added
optimizer_step
method that can be overridden to change the standard optimizer behaviour
[0.4.4] - 2019-08-12¶
[0.4.4] - Added¶
Added supoort for multiple validation dataloaders
Added support for latest test-tube logger (optimised for
torch==1.2.0
)
[0.4.4] - Changed¶
validation_step
andval_dataloader
are now optionallr_scheduler
is now activated after epoch
[0.4.4] - Fixed¶
Fixed a bug where a warning would show when using
lr_scheduler
intorch>1.1.0
Fixed a bug where an
Exception
would be thrown if usingtorch.DistributedDataParallel
without using aDistributedSampler
, this now throws aWarning
instead
[0.4.3] - 2019-08-10¶
[0.4.3] - Fixed¶
Fixed a bug where accumulate gradients would scale the loss incorrectly
[0.4.2] - 2019-08-08¶
[0.4.2] - Changed¶
Changed install requirement to
torch==1.2.0
[0.4.1] - 2019-08-08¶
[0.4.1] - Changed¶
Changed install requirement to
torch==1.1.0
[0.4.0] - 2019-08-08¶
[0.4.0] - Added¶
Added 16-bit support for a single GPU
Added support for training continuation (preserves epoch, global step etc.)
[0.4.0] - Changed¶
Changed
training_step
andvalidation_step
, outputs will no longer be automatically reduced
[0.4.0] - Removed¶
Removed need for
Experiment
object inTrainer
[0.4.0] - Fixed¶
Fixed issues with reducing outputs from generative models (such as images and text)
[0.3.6] - 2019-07-25¶
[0.3.6] - Added¶
Added a decorator to do lazy data loading internally
[0.3.6] - Fixed¶
Fixed a bug where
Experiment
object was not process safe, potentially causing logs to be overwritten