Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
[1.5.1] - 2021-11-09¶
[1.5.1] - Fixed¶
Fixed
apply_to_collection(defaultdict)(#10316)Fixed failure when
DataLoader(batch_size=None)is passed (#10345)Fixed interception of
__init__arguments for sub-classed DataLoader re-instantiation in Lite (#10334)Fixed issue with pickling
CSVLoggerafter a call toCSVLogger.save(#10388)Fixed an import error being caused by
PostLocalSGDwhentorch.distributednot available (#10359)Fixed the logging with
on_step=Truein epoch-level hooks causing unintended side-effects. Logging withon_step=Truein epoch-level hooks will now correctly raise an error (#10409)Fixed deadlocks for distributed training with
RichProgressBar(#10428)Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
Fixed dataloader workers with
persistent_workersbeing deleted on every iteration (#10434)
[1.5.0] - 2021-11-02¶
[1.5.0] - Added¶
Added support for monitoring the learning rate without schedulers in
LearningRateMonitor(#9786)Added registration of
ShardedTensorstate dict hooks inLightningModule.__init__if the PyTorch version supportsShardedTensor(#8944)Added error handling including calling of
on_keyboard_interrupt()andon_exception()for all entrypoints (fit, validate, test, predict) (#8819)Added a flavor of
training_stepthat takesdataloader_iteras an argument (#8807)Added a
state_keyproperty to theCallbackbase class (#6886)Added progress tracking to loops:
Integrated
TrainingEpochLoop.total_batch_idx(#8598)Added
BatchProgressand integratedTrainingEpochLoop.is_last_batch(#9657)Avoid optional
Trackerattributes (#9320)Reset
currentprogress counters when restarting an epoch loop that had already finished (#9371)Call
reset_on_restartin the loop’sresethook instead of when loading a checkpoint (#9561)Use
completedoverprocessedinreset_on_restart(#9656)Renamed
reset_on_epochtoreset_on_run(#9658)
Added
batch_sizeandrank_zero_onlyarguments forlog_dictto matchlog(#8628)Added a check for unique GPU ids (#8666)
Added
ResultCollectionstate_dict to the Loopstate_dictand added support for distributed reload (#8641)Added DeepSpeed collate checkpoint utility function (#8701)
Added a
handles_accumulate_grad_batchesproperty to the training type plugins (#8856)Added a warning to
WandbLoggerwhen reusing a wandb run (#8714)Added
log_graphargument forwatchmethod ofWandbLogger(#8662)LightningCLIadditions:Added
LightningCLI(run=False|True)to choose whether to run aTrainersubcommand (#8751)Added support to call any trainer function from the
LightningCLIvia subcommands (#7508)Allow easy trainer re-instantiation (#7508)
Automatically register all optimizers and learning rate schedulers (#9565)
Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
Support passing lists of callbacks via command line (#8815)
Support shorthand notation to instantiate models (#9588)
Support shorthand notation to instantiate datamodules (#10011)
Added
multifileoption toLightningCLIto enable/disable config saving to preserve multiple files structure (#9073)
Fault-tolerant training:
Added
FastForwardSamplerandCaptureIterableDatasetinjection to data loading utilities (#8366)Added
DataFetcherto control fetching flow (#8890)Added
SharedCycleIteratorStateto prevent infinite loop (#8889)Added
CaptureMapDatasetfor state management in map-style datasets (#8891)Added Fault Tolerant Training to
DataFetcher(#8891)Replaced old prefetch iterator with new
DataFetcherin training loop (#8953)Added partial support for global random state fault-tolerance in map-style datasets (#8950)
Converted state to tuple explicitly when setting Python random state (#9401)
Added support for restarting an optimizer loop (multiple optimizers) (#9537)
Added support for restarting within Evaluation Loop (#9563)
Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
Checkpoint saving and loading extensibility:
Added
CheckpointIOplugin to expose checkpoint IO from training type plugin (#8743)Refactored
CheckpointConnectorto offload validation logic to theCheckpointIOplugin (#9045)Added
remove_checkpointtoCheckpointIOplugin by moving the responsibility out of theModelCheckpointcallback (#9373)Added
XLACheckpointIOplugin (#9972)
Loop customization:
Added
ClosureandAbstractClosureclasses (#8642)Refactored
TrainingBatchLoopand extractedOptimizerLoop, splitting off automatic optimization into its own loop (#9191)Removed
TrainingBatchLoop.backward(); manual optimization now calls directly intoAccelerator.backward()and automatic optimization handles backward in newOptimizerLoop(#9265)Extracted
ManualOptimizationlogic fromTrainingBatchLoopinto its own separate loop class (#9266)Marked
OptimizerLoop.backwardas protected (#9514)Marked
FitLoop.should_accumulateas protected (#9515)Marked several methods in
PredictionLoopas protected:on_predict_start,on_predict_epoch_end,on_predict_end,on_predict_model_eval(#9516)Marked several methods in
EvaluationLoopas protected:get_max_batches,on_evaluation_model_eval,on_evaluation_model_train,on_evaluation_start,on_evaluation_epoch_start,on_evaluation_epoch_end,on_evaluation_end,reload_evaluation_dataloaders(#9516)Marked several methods in
EvaluationEpochLoopas protected:on_evaluation_batch_start,evaluation_step,evaluation_step_end(#9516)Added
yielding_training_stepexample (#9983)
Added support for saving and loading state of multiple callbacks of the same type (#7187)
Added DeepSpeed Stage 1 support (#8974)
Added
Python dataclasssupport forLightningDataModule(#8272)Added sanitization of tensors when they get logged as hyperparameters in
TensorBoardLogger(#9031)Added
InterBatchParallelDataFetcher(#9020)Added
DataLoaderIterDataFetcher(#9020)Added
DataFetcherwithinFit / EvaluationLoop (#9047)Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
Added Rich integration:
Added input validation logic for precision (#9080)
Added support for CPU AMP autocast (#9084)
Added
on_exceptioncallback hook (#9183)Added a warning to DeepSpeed when inferring batch size (#9221)
Added
ModelSummarycallback (#9344)Added
log_images,log_textandlog_tabletoWandbLogger(#9545)Added
PL_RECONCILE_PROCESSenvironment variable to enable process reconciliation regardless of cluster environment settings (#9389)Added
get_device_statsto the Accelerator interface and added its implementation for GPU and TPU (#9586)Added a warning when an unknown key is encountered in the optimizer configuration, and when
OneCycleLRis used with"interval": "epoch"(#9666)Added
DeviceStatsMonitorcallback (#9712)Added
enable_progress_barto the Trainer constructor (#9664)Added
pl_legacy_patchload utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166)Added support for
torch.use_deterministic_algorithms(#9121)Added automatic parameters tying for TPUs (#9525)
Added support for
torch.autograd.set_detect_anomalythroughTrainerconstructor argumentdetect_anomaly(#9848)Added
enable_model_summaryflag to Trainer (#9699)Added
strategyargument to Trainer (#8597)Added
init_meta_context,materialize_moduleutilities (#9920)Added
TPUPrecisionPlugin(#10020)Added
torch.bfloat16support:Added
kfoldexample for loop customization (#9965)LightningLite:
Added
PrecisionPlugin.forward_context, making it the default implementation for all{train,val,test,predict}_step_context()methods (#9988)Added
DDPSpawnPlugin.spawn()for spawning new processes of a given function (#10018, #10022)Added
TrainingTypePlugin.{_setup_model, _setup_optimizer}methods (#9994, #10064)Implemented
DataParallelPlugin._setup_model(#10010)Implemented
DeepSpeedPlugin._setup_model_and_optimizers(#10009, #10064)Implemented
{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers(#10028, #10064)Added optional
modelargument to theoptimizer_stepmethods in accelerators and plugins (#10023)Updated precision attributes in
DeepSpeedPlugin(#10164)Added the ability to return a result from rank 0 in
DDPSpawnPlugin.spawn(#10162)Added
pytorch_lightning.litepackage (#10175)Added
LightningLitedocumentation (#10043)Added
LightningLiteexamples (#9987)Make the
_LiteDataLoaderan iterator and add supports for custom dataloader (#10279)
Added
use_omegaconfargument tosave_hparams_to_yamlplugin (#9170)Added
ckpt_pathargument forTrainer.fit()(#10061)Added
auto_device_countmethod toAccelerators(#10222)Added support for
devices="auto"(#10264)Added a
filenameargument inModelCheckpoint.format_checkpoint_name(#9818)Added support for empty
gpuslist to run on CPU (#10246)Added a warning if multiple batch sizes are found from ambiguous batch (#10247)
[1.5.0] - Changed¶
Trainer now raises a
MisconfigurationExceptionwhen its methods are called withckpt_path="best"but a checkpoint callback isn’t configured (#9841)Setting
Trainer(accelerator="ddp_cpu")now does not spawn a subprocess ifnum_processesis kept1along withnum_nodes > 1(#9603)Module imports are now catching
ModuleNotFoundErrorinstead ofImportError(#9867)pytorch_lightning.loggers.neptune.NeptuneLoggeris now consistent with the new neptune-client API; the old neptune-client API is supported byNeptuneClientfrom the neptune-contrib repo (#6867)Parsing of
enumstype hyperparameters to be saved in thehaprams.yamlfile by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170)Parsing of the
gpusTrainer argument has changed:gpus="n"(str) no longer selects the GPU index n and instead selects the first n devices (#8770)iteration_countand other index attributes in the loops has been replaced with progress dataclasses (#8477)The
trainer.lightning_modulereference is now properly set at the very beginning of a run (#8536)The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
The
Trainerfunctionsreset_{train,val,test,predict}_dataloader,reset_train_val_dataloaders, andrequest_dataloadermodelargument is now optional (#8536)Saved checkpoints will no longer use the type of a
Callbackas the key to avoid issues with unpickling (#6886)Improved string conversion for
ResultCollection(#8622)LightningCLIchanges:LightningCLI.init_parsernow returns the parser instance (#8721)LightningCLI.add_core_arguments_to_parser,LightningCLI.parse_argumentsnow take aparserargument (#8721)LightningCLI.instantiate_trainernow takes a config and a list of callbacks (#8721)Split
LightningCLI.add_core_arguments_to_parserintoLightningCLI.add_default_arguments_to_parser+LightningCLI.add_core_arguments_to_parser(#8721)
The accelerator and training type plugin
setuphooks no longer have amodelargument (#8536)The accelerator and training type plugin
update_global_stephook has been removed (#8856)The coverage of
self.log-ing in anyLightningModuleorCallbackhook has been improved (#8498)self.log-ing without aTrainerreference now raises a warning instead of an exception (#9733)Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloadernow takes aRunningStageenum instance (#8858)Changed
rank_zero_warntoNotImplementedErrorin the{train, val, test, predict}_dataloaderhooks thatLightning(Data)Moduleuses (#9161)Moved
block_ddp_sync_behaviourout ofTrainingBatchLoopto loop utilities (#9192)Executing the
optimizer_closureis now required when overriding theoptimizer_stephook (#9360)Changed logging of
LightningModuleandLightningDataModulehyperparameters to raise an exception only if there are colliding keys with different values (#9496)seed_everythingnow fails when an invalid seed value is passed instead of selecting a random seed (#8787)The Trainer now calls
TrainingTypePlugincollective APIs directly instead of going through the Accelerator reference (#9677, #9901)The tuner now usees a unique filename to save a temporary checkpoint (#9682)
Changed
HorovodPlugin.all_gatherto return atorch.Tensorinstead of a list (#9696)Changed Trainer connectors to be protected attributes:
Configuration Validator (#9779)
The
current_epochandglobal_stepattributes now get restored irrespective of the Trainer task (#9413)Trainer now raises an exception when requesting
amp_levelwith nativeamp_backend(#9755)Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_normnow raises an exception if parameternorm_type <= 0(#9765)Updated error message for interactive incompatible plugins (#9896)
Moved the
optimizer_stepandclip_gradientshook from theAcceleratorandTrainingTypePlugininto thePrecisionPlugin(#10143, #10029)NativeMixedPrecisionPluginand its subclasses now take an optionalGradScalerinstance (#10055)Trainer is now raising a
MisconfigurationExceptioninstead of a warning ifTrainer.{validate/test}is missing required methods (#10016)Changed default value of the
max_stepsTrainer argument fromNoneto -1 (#9460)LightningModule now raises an error when calling
log(on_step=False, on_epoch=False)(#10227)Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
Raised
MisconfigurationExceptionwhen total length ofdataloaderacross ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827)Changed the model size calculation using
ByteCounter(#10123)Enabled
on_load_checkpointforLightningDataModulefor alltrainer_fn(#10238)Allowed separate config files for parameters with class type when LightningCLI is in
subclass_mode=False(#10286)
[1.5.0] - Deprecated¶
Deprecated Trainer argument
terminate_on_nanin favor ofdetect_anomaly(#9175)Deprecated
Trainer.terminate_on_nanpublic attribute access (#9849)Deprecated
LightningModule.summarize()in favor ofpytorch_lightning.utilities.model_summary.summarize()(#8513)Deprecated
LightningModule.model_size(#8343)Deprecated
DataModuleproperties:train_transforms,val_transforms,test_transforms,size,dims(#8851)Deprecated
add_to_queue,get_from_queuefromLightningModulein favor of corresponding methods in theDDPSpawnPlugin(#9118)Deprecated
LightningModule.get_progress_bar_dictandTrainer.progress_bar_dictin favor ofpytorch_lightning.callbacks.progress.base.get_standard_metricsandProgressBarBase.get_metrics(#8985)Deprecated
prepare_data_per_nodeflag on Trainer and set it as a property ofDataHooks, accessible in theLightningModuleandLightningDataModule(#8958)Deprecated the
TestTubeLogger(#9065)Deprecated
on_{train/val/test/predict}_dataloader()fromLightningModuleandLightningDataModule(#9098)Deprecated
on_keyboard_interruptcallback hook in favor of newon_exceptionhook (#9260)Deprecated passing
process_positionto theTrainerconstructor in favor of adding theProgressBarcallback withprocess_positiondirectly to the list of callbacks (#9222)Deprecated passing
flush_logs_every_n_stepsas a Trainer argument, instead pass it to the logger init if supported (#9366)Deprecated
LightningLoggerBase.close,LoggerCollection.closein favor ofLightningLoggerBase.finalize,LoggerCollection.finalize(#9422)Deprecated passing
progress_bar_refresh_rateto theTrainerconstructor in favor of adding theProgressBarcallback withrefresh_ratedirectly to the list of callbacks, or passingenable_progress_bar=Falseto disable the progress bar (#9616)Deprecated
LightningDistributedand moved the broadcast logic toDDPPluginandDDPSpawnPlugindirectly (#9691)Deprecated passing
stochastic_weight_avgto theTrainerconstructor in favor of adding theStochasticWeightAveragingcallback directly to the list of callbacks (#8989)Deprecated Accelerator collective API
barrier,broadcast, andall_gatherin favor of calling theTrainingTypePlugincollective API directly (#9677)Deprecated
checkpoint_callbackfrom theTrainerconstructor in favor ofenable_checkpointing(#9754)Deprecated the
LightningModule.on_post_move_to_devicemethod (#9525)Deprecated
pytorch_lightning.core.decorators.parameter_validationin favor ofpytorch_lightning.utilities.parameter_tying.set_shared_parameters(#9525)Deprecated passing
weights_summaryto theTrainerconstructor in favor of adding theModelSummarycallback withmax_depthdirectly to the list of callbacks (#9699)Deprecated
log_gpu_memory,gpu_metrics, and util funcs in favor ofDeviceStatsMonitorcallback (#9921)Deprecated
GPUStatsMonitorandXLAStatsMonitorin favor ofDeviceStatsMonitorcallback (#9924)Deprecated setting
Trainer(max_steps=None); To turn off the limit, setTrainer(max_steps=-1)(default) (#9460)Deprecated access to the
AcceleratorConnector.is_slurm_managing_tasksattribute and marked it as protected (#10101)Deprecated access to the
AcceleratorConnector.configure_slurm_ddpmethod and marked it as protected (#10101)Deprecated passing
resume_from_checkpointto theTrainerconstructor in favor oftrainer.fit(ckpt_path=)(#10061)Deprecated
ClusterEnvironment.creates_children()in favor ofClusterEnvironment.creates_processes_externally(property) (#10106)Deprecated
PrecisionPlugin.master_params()in favor ofPrecisionPlugin.main_params()(#10105)Deprecated
lr_sch_namesfromLearningRateMonitor(#10066)Deprecated
ProgressBarcallback in favor ofTQDMProgressBar(#10134)
[1.5.0] - Removed¶
Removed deprecated
metrics(#8586)Removed the deprecated
outputsargument in both theLightningModule.on_train_epoch_endandCallback.on_train_epoch_endhooks (#8587)Removed the deprecated
TrainerLoggingMixinclass (#8609)Removed the deprecated
TrainerTrainingTricksMixinclass (#8679)Removed the deprecated
optimizer_idxfromtraining_stepas an accepted argument in manual optimization (#8576)Removed support for the deprecated
on_save_checkpointsignature. The hook now takes acheckpointpositional parameter (#8697)Removed support for the deprecated
on_load_checkpointsignature. The hook now takes apl_modulepositional parameter (#8697)Removed the deprecated
save_functionproperty inModelCheckpoint(#8680)Removed the deprecated
modelargument fromModelCheckpoint.save_checkpoint(#8688)Removed the deprecated
sync_stepargument fromWandbLogger(#8763)Removed the deprecated
Trainer.truncated_bptt_stepsin favor ofLightningModule.truncated_bptt_steps(#8826)Removed
LightningModule.write_predictionsandLightningModule.write_predictions_dict(#8850)Removed
on_reset_*_dataloaderhooks in TrainingType Plugins and Accelerators (#8858)Removed deprecated
GradInformationmodule in favor ofpytorch_lightning.utilities.grads(#8831)Removed
TrainingTypePlugin.on_saveandAccelerator.on_save(#9023)Removed
{Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step(#9746)Removed deprecated
connect_precision_pluginandconnect_training_type_pluginfromAccelerator(#9019)Removed
on_train_epoch_endfromAccelerator(#9035)Removed
InterBatchProcessorin favor ofDataLoaderIterDataFetcher(#9052)Removed
Plugininbase_plugin.pyin favor of accessingTrainingTypePluginandPrecisionPlugindirectly instead (#9066)Removed
teardownfromParallelPlugin(#8943)Removed deprecated
profiled_functionsargument fromPyTorchProfiler(#9178)Removed deprecated
pytorch_lighting.utilities.argparse_utilsmodule (#9166)Removed deprecated property
Trainer.running_sanity_checkin favor ofTrainer.sanity_checking(#9209)Removed deprecated
BaseProfiler.output_filenamearg from it and its descendants in favor ofdirpathandfilename(#9214)Removed deprecated property
ModelCheckpoint.periodin favor ofModelCheckpoint.every_n_epochs(#9213)Removed deprecated
auto_move_datadecorator (#9231)Removed deprecated property
LightningModule.datamodulein favor ofTrainer.datamodule(#9233)Removed deprecated properties
DeepSpeedPlugin.cpu_offload*in favor ofoffload_optimizer,offload_parametersandpin_memory(#9244)Removed deprecated property
AcceleratorConnector.is_using_torchelasticin favor ofTorchElasticEnvironment.is_using_torchelastic()(#9729)Removed
pytorch_lightning.utilities.debugging.InternalDebugger(#9680)Removed
call_configure_sharded_model_hookproperty fromAcceleratorandTrainingTypePlugin(#9612)Removed
TrainerPropertiesmixin and moved property definitions directly intoTrainer(#9495)Removed a redundant warning with
ModelCheckpoint(monitor=None)callback (#9875)Remove
epochfromtrainer.logged_metrics(#9904)Removed
should_rank_save_checkpointproperty from Trainer (#9433)Remove deprecated
distributed_backendfromTrainer(#10017)Removed
process_idxfrom the{DDPSpawnPlugin,TPUSpawnPlugin}.new_processmethods (#10022)Removed automatic patching of
{train,val,test,predict}_dataloader()on theLightningModule(#9764)Removed
pytorch_lightning.trainer.connectors.OptimizerConnector(#10120)
[1.5.0] - Fixed¶
Fixed ImageNet evaluation in example (#10179)
Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
Fixed
move_metrics_to_cpumoving the loss to CPU while training on device (#9308)Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
Fixed an issue with freeing memory of datafetchers during teardown (#9387)
Fixed a bug where the training step output needed to be
deepcopy-ed (#9349)Fixed an issue with freeing memory allocated by the data iterators in
Loop.on_run_end(#9386, #9915)Fixed
BasePredictionWriternot returning the batch indices in a non-distributed setting (#9432)Fixed an error when running in XLA environments with no TPU attached (#9572)
Fixed check on torchmetrics logged whose
compute()output is a multielement tensor (#9582)Fixed gradient accumulation for
DDPShardedPlugin(#9122)Fixed missing DeepSpeed distributed call (#9540)
Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin,DDPSpawnPlugin,DDPShardedPlugin,DDPSpawnShardedPlugin(#9096)Fixed
trainer.accumulate_grad_batchesto be an int on init. The default value for it is nowNoneinside Trainer (#9652)Fixed
broadcastinDDPPluginandDDPSpawnPluginto respect thesrcinput (#9691)Fixed
self.log(on_epoch=True, reduce_fx=sum))for theon_batch_startandon_train_batch_starthooks (#9791)Fixed
self.log(on_epoch=True)for theon_batch_startandon_train_batch_starthooks (#9780)Fixed restoring training state during
Trainer.fitonly (#9413)Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
Fixed DeepSpeed GPU device IDs (#9847)
Reset
val_dataloaderintuner/batch_size_scaling(#9857)Fixed use of
LightningCLIin computer_vision_fine_tuning.py example (#9934)Fixed issue with non-init dataclass fields in
apply_to_collection(#9963)Reset
val_dataloaderintuner/batch_size_scalingfor binsearch (#9975)Fixed logic to check for spawn in dataloader
TrainerDataLoadingMixin._worker_check(#9902)Fixed
train_dataloadergetting loaded twice when resuming from a checkpoint duringTrainer.fit()(#9671)Fixed
LearningRateMonitorlogging with multiple param groups optimizer with no scheduler (#10044)Fixed undesired side effects being caused by
Trainerpatching dataloader methods on theLightningModule(#9764)Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
Fixed
on_before_optimizer_stepgetting called before the optimizer closure (including backward) has run (#10167)Fixed monitor value in
ModelCheckpointgetting moved to the wrong device in a special case where it becomes NaN (#10118)Fixed creation of
dirpathinBaseProfilerif it doesn’t exist (#10073)Fixed incorrect handling of sigterm (#10189)
Fixed bug where
log(on_step=True, on_epoch=True, sync_dist=True)wouldn’t reduce the value on step (#10227)Fixed an issue with
pl.utilities.seed.reset_seedconverting thePL_SEED_WORKERSenvironment variable tobool(#10099)Fixed iterating over a logger collection when
fast_dev_run > 0(#10232)Fixed
batch_sizeinResultCollectionnot being reset to 1 on epoch end (#10242)Fixed
distrib_typenot being set when training plugin instances are being passed to the Trainer (#10251)
[1.4.9] - 2021-09-30¶
[1.4.8] - 2021-09-22¶
Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#9389)
Fixed
add_argparse_argsraisingTypeErrorwhen args are typed astyping.Genericin Python 3.6 (#9554)Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)
[1.4.7] - 2021-09-14¶
[1.4.6] - 2021-09-07¶
Fixed an issues with export to ONNX format when a model has multiple inputs (#8800)
Removed deprecation warnings being called for
on_{task}_dataloader(#9279)Fixed save/load/resume from checkpoint for DeepSpeed Plugin ( #8397, #8644, #8627)
Fixed
EarlyStoppingrunning on train epoch end whencheck_val_every_n_epoch>1is set (#9156)Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8333)
Fixed the Apex and DeepSpeed plugin closure running after the
on_before_optimizer_stephook (#9288)Fixed the Native AMP plugin closure not running with manual optimization (#9288)
Fixed bug where data-loading functions where not getting the correct running stage passed (#8858)
Fixed intra-epoch evaluation outputs staying in memory when the respective
*_epoch_endhook wasn’t overridden (#9261)Fixed error handling in DDP process reconciliation when
_sync_dirwas not initialized (#9267)Fixed PyTorch Profiler not enabled for manual optimization (#9316)
Fixed inspection of other args when a container is specified in
save_hyperparameters(#9125)Fixed signature of
Timer.on_train_epoch_endandStochasticWeightAveraging.on_train_epoch_endto prevent unwanted deprecation warnings (#9347)
[1.4.5] - 2021-08-31¶
Fixed reduction using
self.log(sync_dict=True, reduce_fx={mean,max})(#9142)Fixed not setting a default value for
max_epochsifmax_timewas specified on theTrainerconstructor (#9072)Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations (#9150)
Fixed
DDP“CUDA error: initialization error” due to acopyinstead ofdeepcopyonResultCollection(#9239)
[1.4.4] - 2021-08-24¶
[1.4.3] - 2021-08-17¶
Fixed plateau scheduler stepping on incomplete epoch (#8861)
Fixed infinite loop with
CycleIteratorand multiple loaders (#8889)Fixed
StochasticWeightAveragingwith a list of learning rates not applying them to each param group (#8747)Restore original loaders if replaced by entrypoint (#8885)
Fixed lost reference to
_Metadataobject inResultMetricCollection(#8932)Ensure the existence of
DDPPlugin._sync_dirinreconciliate_processes(#8939)
[1.4.2] - 2021-08-10¶
Fixed recursive call for
apply_to_collection(include_none=False)(#8719)Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer (#8804)
Fixed comments and exception message for metrics_to_scalars (#8782)
Fixed typo error in LightningLoggerBase.after_save_checkpoint docstring (#8737)
[1.4.1] - 2021-08-03¶
Fixed
trainer.fit_loop.split_idxalways returningNone(#8601)Fixed references for
ResultCollection.extra(#8622)Fixed reference issues during epoch end result collection (#8621)
Fixed horovod auto-detection when horovod is not installed and the launcher is
mpirun(#8610)Fixed an issue with
training_stepoutputs not getting collected correctly fortraining_epoch_end(#8613)Fixed distributed types support for CPUs (#8667)
Fixed a deadlock issue with DDP and torchelastic (#8655)
Fixed
accelerator=ddpchoice for CPU (#8645)
[1.4.0] - 2021-07-27¶
[1.4.0] - Added¶
Added
extract_batch_sizeutility and corresponding tests to extract batch dimension from multiple batch types (#8357)Added support for named parameter groups in
LearningRateMonitor(#7987)Added
dataclasssupport forpytorch_lightning.utilities.apply_to_collection(#7935)Added support to
LightningModule.to_torchscriptfor saving to custom filesystems withfsspec(#7617)Added
KubeflowEnvironmentfor use with thePyTorchJoboperator in KubeflowAdded LightningCLI support for config files on object stores (#7521)
Added
ModelPruning(prune_on_train_epoch_end=True|False)to choose when to apply pruning (#7704)Added support for checkpointing based on a provided time interval during training (#7515)
Progress tracking
Added support for passing a
LightningDataModulepositionally as the second argument totrainer.{validate,test,predict}(#7431)Added argument
trainer.predict(ckpt_path)(#7430)Added
clip_grad_by_valuesupport for TPUs (#7025)Added support for passing any class to
is_overridden(#7918)Added
sub_dirparameter toTensorBoardLogger(#6195)Added correct
dataloader_idxto batch transfer hooks (#6241)Added
include_none=boolargument toapply_to_collection(#7769)Added
apply_to_collectionsto apply a function to two zipped collections (#7769)Added
ddp_fully_shardedsupport (#7487)Added
should_rank_save_checkpointproperty to Training Plugins (#7684)Added
log_grad_normhook toLightningModuleto customize the logging of gradient norms (#7873)Added
save_config_filenameinit argument toLightningCLIto ease resolving name conflicts (#7741)Added
save_config_overwriteinit argument toLightningCLIto ease overwriting existing config files (#8059)Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
Added trainer stage hooks for Training Plugins and Accelerators (#7864)
Added the
on_before_optimizer_stephook (#8048)Added IPU Accelerator (#7867)
Fault-tolerant training
Added
{,load_}state_dicttoResultCollection(#7948)Added
{,load_}state_dicttoLoops(#8197)Added
FastForwardSamplerandCaptureIterableDataset(#8307)Set
Loop.restarting=Falseat the end of the first iteration (#8362)Save the loops state with the checkpoint (opt-in) (#8362)
Save a checkpoint to restore the state on exception (opt-in) (#8362)
Added
state_dictandload_state_dictutilities forCombinedLoader+ utilities for dataloader (#8364)
Added
rank_zero_onlytoLightningModule.logfunction (#7966)Added
metric_attributetoLightningModule.logfunction (#7966)Added a warning if
Trainer(log_every_n_steps)is a value too high for the training dataloader (#7734)Added LightningCLI support for argument links applied on instantiation (#7895)
Added LightningCLI support for configurable callbacks that should always be present (#7964)
Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
Added support for
torch.nn.UninitializedParameterinModelSummary(#7642)Added support
LightningModule.save_hyperparameterswhenLightningModuleis a dataclass (#7992)Added support for overriding
optimizer_zero_gradandoptimizer_stepwhen using accumulate_grad_batches (#7980)Added
loggerboolean flag tosave_hyperparameters(#7960)Added support for calling scripts using the module syntax (
python -m package.script) (#8073)Added support for optimizers and learning rate schedulers to
LightningCLI(#8093)Added XLA Profiler (#8014)
Added
PrecisionPlugin.{pre,post}_backward(#8328)Added
on_load_checkpointandon_save_checkpointhooks to thePrecisionPluginbase class (#7831)Added
max_depthparameter inModelSummary(#8062)Added
XLAStatsMonitorcallback (#8235)Added
restorefunction andrestartingattribute to baseLoop(#8247)Added support for
save_hyperparametersinLightningDataModule(#3792)Added the
ModelCheckpoint(save_on_train_epoch_end)to choose when to run the saving logic (#8389)Added
LSFEnvironmentfor distributed training with the LSF resource managerjsrun(#5102)Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'(#7808)Added
tpu_spawn_debugto plugin registry (#7933)Enabled traditional/manual launching of DDP processes through
LOCAL_RANKandNODE_RANKenvironment variable assignments (#7480)Added
quantize_on_fit_endargument toQuantizationAwareTraining(#8464)Added experimental support for loop specialization (#8226)
Added support for
devicesflag to Trainer (#8440)Added private
prevent_trainer_and_dataloaders_deepcopycontext manager on theLightningModule(#8472)Added support for providing callables to the Lightning CLI instead of types (#8400)
[1.4.0] - Changed¶
Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
Changed the
Trainer’scheckpoint_callbackargument to allow only boolean values (#7539)Log epoch metrics before the
on_evaluation_endhook (#7272)Explicitly disallow calling
self.log(on_epoch=False)during epoch-only or single-call hooks (#7874)Changed these
Trainermethods to be protected:call_setup_hook,call_configure_sharded_model,pre_dispatch,dispatch,post_dispatch,call_teardown_hook,run_train,run_sanity_check,run_evaluate,run_evaluation,run_predict,track_output_for_epoch_endChanged
metrics_to_scalarsto work with any collection or value (#7888)Changed
clip_grad_normto usetorch.nn.utils.clip_grad_norm_(#7025)Validation is now always run inside the training epoch scope (#7357)
ModelCheckpointnow runs at the end of the training epoch by default (#8389)EarlyStoppingnow runs at the end of the training epoch by default (#8286)Refactored Loops
Moved attributes
global_step,current_epoch,max/min_steps,max/min_epochs,batch_idx, andtotal_batch_idxto TrainLoop (#7437)Refactored result handling in training loop (#7506)
Moved attributes
hiddensandsplit_idxto TrainLoop (#7507)Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
Simplified “should run validation” logic (#7682)
Simplified logic for updating the learning rate for schedulers (#7682)
Removed the
on_epochguard from the “should stop” validation check (#7701)Refactored internal loop interface; added new classes
FitLoop,TrainingEpochLoop,TrainingBatchLoop(#7871, #8077)Removed
pytorch_lightning/trainer/training_loop.py(#7985)Refactored evaluation loop interface; added new classes
DataLoaderLoop,EvaluationLoop,EvaluationEpochLoop(#7990, #8077)Removed
pytorch_lightning/trainer/evaluation_loop.py(#8056)Restricted public access to several internal functions (#8024)
Refactored trainer
_run_*functions and separate evaluation loops (#8065)Refactored prediction loop interface; added new classes
PredictionLoop,PredictionEpochLoop(#7700, #8077)Removed
pytorch_lightning/trainer/predict_loop.py(#8094)Moved result teardown to the loops (#8245)
Improve
LoopAPI to better handle childrenstate_dictandprogress(#8334)
Refactored logging
Renamed and moved
core/step_result.pytotrainer/connectors/logger_connector/result.py(#7736)Dramatically simplify the
LoggerConnector(#7882)trainer.{logged,progress_bar,callback}_metricsare now updated on-demand (#7882)Completely overhaul the
Resultobject in favor ofResultMetric(#7882)Improve epoch-level reduction time and overall memory usage (#7882)
Allow passing
self.log(batch_size=...)(#7891)Each of the training loops now keeps its own results collection (#7891)
Remove
EpochResultStoreandHookResultStorein favor ofResultCollection(#7909)Remove
MetricsHolder(#7909)
Moved
ignore_scalar_return_in_dpwarning suppression to the DataParallelPlugin class (#7421)Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*to the metric name (#7351)Raised
ValueErrorwhen aNonevalue isself.log-ed (#7771)Changed
resolve_training_type_pluginsto allow settingnum_nodesandsync_batchnormfromTrainersetting (#7026)Default
seed_everything(workers=True)in theLightningCLI(#7504)Changed
model.state_dict()inCheckpointConnectorto allowtraining_type_pluginto customize the model’sstate_dict()(#7474)MLflowLoggernow uses the env variableMLFLOW_TRACKING_URIas default tracking URI (#7457)Changed
Trainerarg and functionality fromreload_dataloaders_every_epochtoreload_dataloaders_every_n_epochs(#5043)Changed
WandbLogger(log_model={True/'all'})to log models as artifacts (#6231)MLFlowLogger now accepts
run_nameas an constructor argument (#7622)Changed
teardown()inAcceleratorto allowtraining_type_pluginto customizeteardownlogic (#7579)Trainer.fitnow raises an error when using manual optimization with unsupported features such asgradient_clip_valoraccumulate_grad_batches(#7788)Accelerator hooks are called regardless if
LightningModuleoverrides the same hooks (#7826)Moved profilers to their own file (#7822)
The
on_after_backwardhook is now called on accumulating iterations. Use theon_before_optimizer_stephook to mimic the old behaviour (#8328)The mixed precision loss is no longer unscaled before the
on_after_backwardhook. Use theon_before_optimizer_stephook to mimic the old behaviour (#8328)The
TrainingTypePlugin.{pre,post}_backwardhooks no longer take theoptimizer, opt_idx, should_accumulatearguments (#8328)The
PrecisionPlugin.backwardhooks no longer returns a value (#8328)The
PrecisionPlugin.backwardhooks no longer takes ashould_accumulateargument (#8328)Added the
on_before_backwardhook (#7865)LightningCLInow aborts with a clearer message if config already exists and disables save config duringfast_dev_run(#7963)Saved the
LightningCLIconfig onsetupand only on the main process (#8017)Dropped the
LightningCLIArgumentParserwhen pickling (#8017)Skip
broadcastif distributed not initialized for the spawn plugins (#8017)Trainer(resume_from_checkpoint=...)now restores the model directly afterLightningModule.setup(), which is beforeLightningModule.configure_sharded_model()(#7652)Moved
torch.cuda.set_device()to enable collective calls earlier in setup (#8312)Used XLA utility API to move data to CPU (Single TPU core) (#8078)
Improved error messages in
replace_samplerwhen theDataLoaderattributes are not included in the signature or the signature is missing optional arguments (#8519)Moved
DeviceDtypeModuleMixinandHyperparametersMixinmixin tocore(#8396)Return the
default_root_diras thelog_dirwhen the logger is aLoggerCollection(#8187)
[1.4.0] - Deprecated¶
Deprecated
LightningModule.loaded_optimizer_states_dict(#8229)Standardized the dataloaders arguments of
trainer.{fit,valdiate,test,tune}(#7431)Deprecated
DataModuleproperties:has_prepared_data,has_setup_fit,has_setup_validate,has_setup_test,has_setup_predict,has_teardown_fit,has_teardown_validate,has_teardown_test,has_teardown_predict(#7657)Deprecated
TrainerModelHooksMixinin favor ofpytorch_lightning.utilities.signature_utils(#7422)Deprecated
num_nodesandsync_batchnormarguments inDDPPluginandDDPSpawnPlugin(#7026)Deprecated
self.log(sync_dist_op)in favor ofself.log(reduce_fx). (#7891)Deprecated
is_overridden(model=...)in favor ofis_overridden(instance=...)(#7918)Deprecated automatically detaching returned extras with grads (#7994)
Deprecated default value of
monitorargument in EarlyStopping callback to enforcemonitoras a required argument (#7907)Deprecated importing
rank_zero_{warn,deprecation}directly frompytorch_lightning.utilities.distributed(#8085)Deprecated the use of
CheckpointConnector.hpc_load()in favor ofCheckpointConnector.restore()(#7652)Deprecated
ModelCheckpoint(every_n_val_epochs)in favor ofModelCheckpoint(every_n_epochs)(#8383)Deprecated
DDPPlugin.task_idxin favor ofDDPPlugin.local_rank(#8203)Deprecated the
Trainer.train_loopproperty in favor ofTrainer.fit_loop(#8025)Deprecated the
Trainer.disable_validationproperty in favor ofnot Trainer.enable_validation(#8291)Deprecated
modeparameter inModelSummaryin favor ofmax_depth(#8062)Deprecated
reload_dataloaders_every_epochargument ofTrainerin favor ofreload_dataloaders_every_n_epochs(#5043)Deprecated
distributed_backendargument forTrainer(#8575)
[1.4.0] - Removed¶
Dropped official support/testing for PyTorch <1.6 (#8288)
Removed
ProfilerConnector(#7654)Pruned deprecated classif. metrics from
pytorch_lightning.metrics.functional.classification(#7499)Removed deprecated data parallel classes
LightningDataParallelandLightningDistributedDataParallelfrompytorch_lightning.overrides.data_parallel(#7510)Removed deprecated trainer attributes -
get_modelandaccelerator_backend(#7502)Removed support for automatically monitoring the
val_losskey withModelCheckpoint. Pass yourmonitorof choice to theModelCheckpointinstance instead (#8293)Removed support for
self.log(tbptt_reduce_fx)andself.log(tbptt_pad_token). Please, open a discussion explaining your use-case if you relied on these. (#7644)Removed deprecated utils modules
model_utils,warning_utils,xla_device_utilsand partiallyargparse_utils(#7503)Removed
RPCPluginandRPCSequentialPlugin. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101)Removed deprecated trainer attributes -
on_cpu,on_tpu,use_tpu,on_gpu,use_dp,use_ddp,use_ddp2,use_horovod,use_single_gpu(#7501)Removed deprecated
optimizerargument inLightningModule.manual_backward(); Toggling optimizers in manual optimization should be done usingLightningModule.{un}toggle_optimizer()(#8287)Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
Removed environment variable
PL_EXP_VERSIONfrom DDP subprocesses (7403)
[1.4.0] - Fixed¶
Fixed the
GPUStatsMonitorcallbacks to use the correct GPU IDs ifCUDA_VISIBLE_DEVICESset (#8260)Fixed
lr_schedulercheckpointed state by callingupdate_lr_schedulersbefore saving checkpoints (#7877)Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
Fixed
Noneloss keys getting added intraining_epoch_endwhen using manual optimization and not returning a loss (#7772)Fixed a bug where
precision=64withaccelerator='ddp_spawn'would throw a pickle error (#6924)Do not override the existing
epochvalue inlogged_metricswhen already logged by the user (#7982)Support for manual optimization with DeepSpeed (#7970)
Fixed
dataloader_idxargument value when predicting with only oneDataLoader(#7941)Fixed passing the
stageargument ofCallback.{setup,teardown}as a keyword (#7973)Fixed metrics generated during
validation sanity checkingare cleaned on end (#8171)Fixed
log_gpu_memorymetrics not being added tologgingwhen nothing else is logged (#8174)Fixed a bug where calling
logwith aMetricinstance would raise an error if it was a nested attribute of the model (#8181)Fixed a bug where using
precision=64would cause buffers with complex dtype to be cast to real (#8208)Fixed
is_overriddenreturning true for wrapped functions with no changes (#8296)Fixed a bug where
truncated_bptt_stepswould throw an AttributeError when the target RNN has multiple hidden states (#8145)Fixed
self.optimizers()not returning a single optimizer if it had been wrapped (#8326)Fixed the
on_after_backwardhook not getting called when using manual optimization and no plugins (#8328)Fixed the
LightningModule.backwardhook only getting called with theapexplugin when using manual optimization (#8328)Fixed moving batch to device before sending it to the
on_*_batch_start/on_*_batch_endcallbacks and model hooks (#7378)Fixed passing a custom
DDPPluginwhen choosingaccelerator="ddp_cpu"for the accelerator (#6208)Fixed missing call to
LightningModule.untoggle_optimizerin training loop when running gradient accumulation with multiple optimizers (#8284)Fixed hash of LightningEnum to work with value instead of name (#8421).
Fixed a bug where an extra checkpoint was saved at the end of training if the
val_check_intervaldid not align with the number of training batches (#7724)Fixed hash of LightningEnum to work with value instead of name(#8421).
Fixed
move_data_to_deviceto return the batch if the objecttofunction didn’t returnself(#8433)Fixed progress bar updates for Pod Training (#8258)
Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
Fixed memory leaks on GPU by moving
optimizer_states,ResultCollection.extra,ResultMetricattributes, andLoggerConnectormetrics tocpu. Also, delete the DDP wrapper onteardown(#8490)Fixed
SWAcallback using LightningModuleprevent_trainer_and_dataloaders_deepcopyto avoid OOM (#8472)Fixed
ModelPruningcallbackon_save_checkpointto avoid making adeepcopypotentially leading to OOM (#8472)Fixed the sampler replacement logic for
DataLoaders which do not define allDataLoaderattributes as__init__parameters (#8519)Fixed DeepSpeed Windows support (#8488)
Fixed DeepSpeed not properly setting the trainer
lr_schedulersattribute (#8527)Fixed experiment version and log-dir divergence in DDP when using multiple
Trainerinstances in sequence (7403)Enabled manual optimization for TPUs (#8458)
Fixed
accumulate_grad_batchesnot been recomputed during model reload (#5334)Fixed a
TypeErrorwhen wrapping optimizers in theHorovodPluginand runningTrainer.test(#7840)Fixed
BackboneFinetuningrestoration (#8501)Fixed
lr_schedulerwith metric (e.g.torch.optim.lr_scheduler.ReduceLROnPlateau) when usingautomatic_optimization = False(#7643)Fixed
DeepSpeedbreaking with no schedulers (#8580)
[1.3.8] - 2021-07-01¶
[1.3.8] - Fixed¶
Fixed a sync deadlock when checkpointing a
LightningModulethat uses a torchmetrics 0.4Metric(#8218)Fixed compatibility TorchMetrics v0.4 (#8206)
Added torchelastic check when sanitizing GPUs (#8095)
Fixed a DDP info message that was never shown (#8111)
Fixed metrics deprecation message at module import level (#8163)
Fixed a bug where an infinite recursion would be triggered when using the
BaseFinetuningcallback on a model that contains aModuleDict(#8170)Added a mechanism to detect
deadlockforDDPwhen only 1 process trigger anException. The mechanism willkill the processeswhen it happens (#8167)Fixed NCCL error when selecting non-consecutive device ids (#8165)
Fixed SWA to also work with
IterableDataset(#8172)
[1.3.7] - 2021-06-22¶
[1.3.7] - Fixed¶
Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
Fixed setting a
DistributedSamplerwhen using a distributed plugin in a custom accelerator (#7814)Improved
PyTorchProfilerchrome traces names (#8009)Fixed moving the best score to device in
EarlyStoppingcallback for TPU devices (#7959)Fixes access to
callback_metricsin ddp_spawn (#7916)
[1.3.6] - 2021-06-15¶
[1.3.6] - Fixed¶
Fixed logs overwriting issue for remote filesystems (#7889)
Fixed
DataModule.prepare_datacould only be called on the global rank 0 process (#7945)Fixed setting
worker_init_fnto seed dataloaders correctly when using DDP (#7942)Fixed
BaseFinetuningcallback to properly handle parent modules w/ parameters (#7931)
[1.3.5] - 2021-06-08¶
[1.3.5] - Added¶
Added warning to Training Step output (#7779)
[1.3.5] - Fixed¶
[1.3.5] - Changed¶
Move
training_outputvalidation to aftertrain_step_end(#7868)
[1.3.4] - 2021-06-01¶
[1.3.4] - Fixed¶
[1.3.3] - 2021-05-27¶
[1.3.3] - Changed¶
Changed calling of
untoggle_optimizer(opt_idx)out of the closure function (#7563)
[1.3.3] - Fixed¶
Fixed
ProgressBarpickling after callingtrainer.predict(#7608)Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
Fixed dataloaders are not reset when tuning the model (#7566)
Fixed print errors in
ProgressBarwhentrainer.fitis not called (#7674)Fixed global step update when the epoch is skipped (#7677)
Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)
[1.3.2] - 2021-05-18¶
[1.3.2] - Changed¶
DataModules now avoid duplicate{setup,teardown,prepare_data}calls for the same stage (#7238)
[1.3.2] - Fixed¶
Fixed parsing of multiple training dataloaders (#7433)
Fixed recursive passing of
wrong_typekeyword argument inpytorch_lightning.utilities.apply_to_collection(#7433)Fixed setting correct
DistribTypeforddp_cpu(spawn) backend (#7492)Fixed incorrect number of calls to LR scheduler when
check_val_every_n_epoch > 1(#7032)
[1.3.1] - 2021-05-11¶
[1.3.1] - Fixed¶
[1.3.0] - 2021-05-06¶
[1.3.0] - Added¶
Added support for the
EarlyStoppingcallback to run at the end of the training epoch (#6944)Added synchronization points before and after
setuphooks are run (#7202)Added a
teardownhook toClusterEnvironment(#6942)Added utils for metrics to scalar conversions (#7180)
Added utils for NaN/Inf detection for gradients and parameters (#6834)
Added more explicit exception message when trying to execute
trainer.test()ortrainer.validate()withfast_dev_run=True(#6667)Added
LightningCLIclass to provide simple reproducibility with minimum boilerplate training CLI ( #4492, #6862, #7156, #7299)Added
gradient_clip_algorithmargument to Trainer for gradient clipping by value (#6123).Added a way to print to terminal without breaking up the progress bar (#5470)
Added support to checkpoint after training steps in
ModelCheckpointcallback (#6146)Added
TrainerStatus.{INITIALIZING,RUNNING,FINISHED,INTERRUPTED}(#7173)Added
Trainer.validate()method to perform one evaluation epoch over the validation set (#4948)Added
LightningEnvironmentfor Lightning-specific DDP (#5915)Added
teardown()hook to LightningDataModule (#4673)Added
auto_insert_metric_nameparameter toModelCheckpoint(#6277)Added arg to
self.logthat enables users to give custom names when dealing with multiple dataloaders (#6274)Added
teardownmethod toBaseProfilerto enable subclasses defining post-profiling steps outside of__del__(#6370)Added
setupmethod toBaseProfilerto enable subclasses defining pre-profiling steps for every process (#6633)Added no return warning to predict (#6139)
Added
Trainer.predictconfig validation (#6543)Added
AbstractProfilerinterface (#6621)Added support for including module names for forward in the autograd trace of
PyTorchProfiler(#6349)Added support for the PyTorch 1.8.1 autograd profiler (#6618)
Added
outputsparameter to callback’son_validation_epoch_end&on_test_epoch_endhooks (#6120)Added
configure_sharded_modelhook (#6679)Added support for
precision=64, enabling training with double precision (#6595)Added support for DDP communication hooks (#6736)
Added
artifact_locationargument toMLFlowLoggerwhich will be passed to theMlflowClient.create_experimentcall (#6677)Added
modelparameter to precision plugins’clip_gradientssignature ( #6764, #7231)Added
is_last_batchattribute toTrainer(#6825)Added
LightningModule.lr_schedulers()for manual optimization (#6567)Added
MpModelWrapperin TPU Spawn (#7045)Added
max_timeTrainer argument to limit training time (#6823)Added
on_predict_{batch,epoch}_{start,end}hooks (#7141)Added new
EarlyStoppingparametersstopping_thresholdanddivergence_threshold(#6868)Added
debugflag to TPU Training Plugins (PT_XLA_DEBUG) (#7219)Added new
UnrepeatedDistributedSamplerandIndexBatchSamplerWrapperfor tracking distributed predictions (#7215)Added
trainer.predict(return_predictions=None|False|True)(#7215)Added
BasePredictionWritercallback to implement prediction saving (#7127)Added
trainer.tune(scale_batch_size_kwargs, lr_find_kwargs)arguments to configure the tuning algorithms (#7258)Added
tpu_distributedcheck for TPU Spawn barrier (#7241)Added device updates to TPU Spawn for Pod training (#7243)
Added warning when missing
Callbackand usingresume_from_checkpoint(#7254)DeepSpeed single file saving (#6900)
Added Training type Plugins Registry ( #6982, #7063, #7214, #7224 )
Add
ignoreparam tosave_hyperparameters(#6056)
[1.3.0] - Changed¶
Changed
LightningModule.truncated_bptt_stepsto be property (#7323)Changed
EarlyStoppingcallback from by default runningEarlyStopping.on_validation_endif only training is run. Setcheck_on_train_epoch_endto run the callback at the end of the train epoch instead of at the end of the validation epoch (#7069)Renamed
pytorch_lightning.callbacks.swatopytorch_lightning.callbacks.stochastic_weight_avg(#6259)Refactor
RunningStageandTrainerStateusage ( #4945, #7173)Added
RunningStage.SANITY_CHECKINGAdded
TrainerFn.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}Changed
trainer.evaluatingto returnTrueif validating or testing
Changed
setup()andteardown()stage argument to take any of{fit,validate,test,predict}(#6386)Changed profilers to save separate report files per state and rank (#6621)
The trainer no longer tries to save a checkpoint on exception or run callback’s
on_train_endfunctions (#6864)Changed
PyTorchProfilerto usetorch.autograd.profiler.record_functionto record functions (#6349)Disabled
lr_scheduler.step()in manual optimization (#6825)Changed warnings and recommendations for dataloaders in
ddp_spawn(#6762)pl.seed_everythingwill now also set the seed on theDistributedSampler(#7024)Changed default setting for communication of multi-node training using
DDPShardedPlugin(#6937)trainer.tune()now returns the tuning result (#7258)LightningModule.from_datasets()now acceptsIterableDatasetinstances as training datasets. (#7503)Changed
resume_from_checkpointwarning to an error when the checkpoint file does not exist (#7075)Automatically set
sync_batchnormfortraining_type_plugin(#6536)Allowed training type plugin to delay optimizer creation (#6331)
Removed ModelSummary validation from train loop on_trainer_init (#6610)
Moved
save_functionto accelerator (#6689)Improved verbose logging for
EarlyStoppingcallback (#6811)Run ddp_spawn dataloader checks on Windows (#6930)
Updated mlflow with using
resolve_tags(#6746)Moved
save_hyperparametersto its own function (#7119)Replaced
_DataModuleWrapperwith__new__(#7289)Reset
current_fxproperties on lightning module in teardown (#7247)Auto-set
DataLoader.worker_init_fnwithseed_everything(#6960)Remove
model.trainercall inside of dataloading mixin (#7317)Split profilers module (#6261)
Ensure accelerator is valid if running interactively (#5970)
Disabled batch transfer in DP mode (#6098)
[1.3.0] - Deprecated¶
Deprecated
outputsin bothLightningModule.on_train_epoch_endandCallback.on_train_epoch_endhooks (#7339)Deprecated
Trainer.truncated_bptt_stepsin favor ofLightningModule.truncated_bptt_steps(#7323)Deprecated
outputsin bothLightningModule.on_train_epoch_endandCallback.on_train_epoch_endhooks (#7339)Deprecated
LightningModule.grad_normin favor ofpytorch_lightning.utilities.grads.grad_norm(#7292)Deprecated the
save_functionproperty from theModelCheckpointcallback (#7201)Deprecated
LightningModule.write_predictionsandLightningModule.write_predictions_dict(#7066)Deprecated
TrainerLoggingMixinin favor of a separate utilities module for metric handling (#7180)Deprecated
TrainerTrainingTricksMixinin favor of a separate utilities module for NaN/Inf detection for gradients and parameters (#6834)periodhas been deprecated in favor ofevery_n_val_epochsin theModelCheckpointcallback (#6146)Deprecated
trainer.running_sanity_checkin favor oftrainer.sanity_checking(#4945)Deprecated
Profiler(output_filename)in favor ofdirpathandfilename(#6621)Deprecated
PytorchProfiler(profiled_functions)in favor ofrecord_functions(#6349)Deprecated
@auto_move_datain favor oftrainer.predict(#6993)Deprecated
Callback.on_load_checkpoint(checkpoint)in favor ofCallback.on_load_checkpoint(trainer, pl_module, checkpoint)(#7253)Deprecated metrics in favor of
torchmetrics( #6505, #6530, #6540, #6547, #6515, #6572, #6573, #6584, #6636, #6637, #6649, #6659, #7131, )Deprecated the
LightningModule.datamodulegetter and setter methods; access them throughTrainer.datamoduleinstead (#7168)Deprecated the use of
Trainer(gpus="i")(string) for selecting the i-th GPU; from v1.5 this will set the number of GPUs instead of the index (#6388)
[1.3.0] - Removed¶
Removed the
exp_save_pathproperty from theLightningModule(#7266)Removed training loop explicitly calling
EarlyStopping.on_validation_endif no validation is run (#7069)Removed
automatic_optimizationas a property from the training loop in favor ofLightningModule.automatic_optimization(#7130)Removed evaluation loop legacy returns for
*_epoch_endhooks (#6973)Removed support for passing a bool value to
profilerargument of Trainer (#6164)Removed no return warning from val/test step (#6139)
Removed passing a
ModelCheckpointinstance toTrainer(checkpoint_callback)(#6166)Removed deprecated Trainer argument
enable_pl_optimizerandautomatic_optimization(#6163)Removed deprecated metrics (#6161)
from
pytorch_lightning.metrics.functional.classificationremovedto_onehot,to_categorical,get_num_classes,roc,multiclass_roc,average_precision,precision_recall_curve,multiclass_precision_recall_curvefrom
pytorch_lightning.metrics.functional.reductionremovedreduce,class_reduce
Removed deprecated
ModelCheckpointargumentsprefix,mode="auto"(#6162)Removed
mode='auto'fromEarlyStopping(#6167)Removed
epochandsteparguments fromModelCheckpoint.format_checkpoint_name(), these are now included in themetricsargument (#7344)Removed legacy references for magic keys in the
Resultobject (#6016)Removed deprecated
LightningModulehparamssetter (#6207)Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the
"log"/"progress_bar"magic keys. Useself.loginstead (#6734)Removed
trainer.fit()return value of1. It has no return now (#7237)Removed
logger_connectorlegacy code (#6733)Removed unused mixin attributes (#6487)
[1.3.0] - Fixed¶
Fixed NaN errors in progress bars when training with iterable datasets with no length defined (#7306)
Fixed attaching train and validation dataloaders when
reload_dataloaders_every_epoch=Trueandnum_sanity_val_steps=0(#7207)Added a barrier in the accelerator
teardownto synchronize processes before execution finishes (#6814)Fixed multi-node DDP sub-process launch by using
local_rankinstead ofglobal_rankfor main process assertion (#7061)Fixed incorrect removal of
WORLD_SIZEenvironment variable in DDP training when launching with torch distributed/torchelastic (#6942)Made the
Plugin.reducemethod more consistent across all Plugins to reflect a mean-reduction by default (#6011)Move lightning module to correct device type when using LightningDistributedWrapper (#6070)
Do not print top-k verbose log with
ModelCheckpoint(monitor=None)(#6109)Fixed
ModelCheckpoint(save_top_k=0, save_last=True)not saving thelastcheckpoint (#6136)Fixed
.teardown(stage='fit')and.on_fit_{start,end}()getting called duringtrainer.test(#6386)Fixed LightningModule
all_gatheron cpu tensors (#6416)Fixed torch distributed not available in setup hook for DDP (#6506)
Fixed
trainer.tuner.{lr_find,scale_batch_size}not setting theTrainerstate properly (#7258)Fixed bug where the learning rate schedulers did not follow the optimizer frequencies (#4868)
Fixed pickle error checker to now check for
pickle.PickleErrorto catch all pickle errors (#6917)Fixed a bug where the outputs object passed to
LightningModule.training_epoch_endwas different from the object passed to theon_train_end_epochhook (#6969)Fixed a bug where the outputs passed to
train_batch_endwould be lists even when using a single optimizer and no truncated backprop through time steps (#6969)Fixed bug for trainer error handling which would cause hang for distributed training (#6864)
Fixed
self.devicenot returning the correct device in replicas of data-parallel (#6414)Fixed
lr_findtrying beyondnum_trainingsteps and suggesting a too high learning rate (#7076)Fixed logger creating incorrect version folder in DDP with repeated
Trainer.fitcalls (#7077)Fixed metric objects passed directly to
self.lognot being reset correctly (#7055)Fixed
CombinedLoaderin distributed settings for validation / testing (#7102)Fixed the save_dir in
WandbLoggerwhen the run was initiated externally (#7106)Fixed
num_sanity_val_stepsaffecting reproducibility of training data shuffling (#7014)Fixed resetting device after
fitting/evaluating/predicting(#7188)Fixed bug where
trainer.tuner.scale_batch_size(max_trials=0)would not return the correct batch size result (#7262)Fixed metrics not being properly logged with
precision=16andmanual_optimization(#7228)Fixed
BaseFinetuningproperly reloadingoptimizer_stateswhen usingresume_from_checkpoint(#6891)Fixed
parameters_to_ignorenot properly set to DDPWrapper (#7239)Fixed parsing of
fast_dev_run=Truewith the built-inArgumentParser(#7240)Fixed handling an
IterableDatasetthat fails to produce a batch at the beginning of an epoch (#7294)Fixed
LightningModule.save_hyperparameters()when attempting to save an empty container (#7268)Fixed
apexnot properly instantiated when running withddp(#7274)Fixed optimizer
statenot moved toGPU(#7277)Fixed custom init args for
WandbLogger(#6989)Fixed a bug where an error would be raised if the train dataloader sometimes produced None for a batch (#7342)
Fixed examples ( #6600, #6638, #7096, #7246, #6357, #6476, #6294, #6373, #6088, #7398 )
Resolved schedule step bug for PyTorch Profiler (#6674, #6681)
Updated logic for checking TPUs availability (#6767)
Resolve TPU miss rendezvous (#6781)
Fixed auto-scaling mode when calling tune method on trainer (#7321)
Fixed finetuning complex models correctly unfreezes (#6880)
Ensure we set the eval/train flag correctly on accelerator model (#6877)
Set better defaults for
rank_zero_only.rankwhen training is launched with SLURM and torchelastic (#6802)Fixed matching the number of outputs of backward with forward for AllGatherGrad (#6625)
Fixed the
gradient_clip_algorithmhas no effect (#6928)Fixed CUDA OOM detection and handling (#6934)
Fixed
unfreeze_and_add_param_groupexpectsmodulesrather thanmodule(#6822)Fixed DPP + SyncBN when move on device (#6838)
Fixed missing arguments in
lr_findcall (#6784)Fixed
set_default_tensor_typetotorch.DoubleTensorwith precision=64 (#7108)Fixed
NeptuneLogger.log_text(step=None)(#7194)
[1.2.9] - 2021-04-20¶
[1.2.9] - Fixed¶
[1.2.8] - 2021-04-14¶
[1.2.8] - Added¶
Added TPUSpawn + IterableDataset error message (#6875)
[1.2.8] - Fixed¶
Fixed process rank not being available right away after
Trainerinstantiation (#6941)Fixed
sync_distfor tpus (#6950)Fixed
AttributeErrorforrequire_backward_grad_syncwhen running manual optimization with sharded plugin (#6915)Fixed
--gpusdefault for parser returned byTrainer.add_argparse_args(#6898)Fixed TPU Spawn all gather (#6896)
Fixed
EarlyStoppinglogic whenmin_epochsormin_stepsrequirement is not met (#6705)Fixed csv extension check (#6436)
Fixed checkpoint issue when using Horovod distributed backend (#6958)
Fixed tensorboard exception raising (#6901)
Fixed setting the eval/train flag correctly on accelerator model (#6983)
Fixed DDP_SPAWN compatibility with bug_report_model.py (#6892)
Fixed bug where
BaseFinetuning.flatten_modules()was duplicating leaf node parameters (#6879)Set better defaults for
rank_zero_only.rankwhen training is launched with SLURM and torchelastic:
[1.2.7] - 2021-04-06¶
[1.2.7] - Fixed¶
Fixed resolve a bug with omegaconf and xm.save (#6741)
Fixed an issue with IterableDataset when len is not defined (#6828)
Sanitize None params during pruning (#6836)
Enforce an epoch scheduler interval when using SWA (#6588)
Fixed TPU Colab hang issue, post training (#6816)
Fixed a bug where
TensorBoardLoggerwould give a warning and not log correctly to a symbolic linksave_dir(#6730)Fixed bug where
predictcould not be used whenprogress_bar_refresh_rate=0(#6884)
[1.2.6] - 2021-03-30¶
[1.2.6] - Changed¶
Changed the behavior of
on_epoch_startto run at the beginning of validation & test epoch (#6498)
[1.2.6] - Removed¶
Removed legacy code to include
stepdictionary returns incallback_metrics. Useself.log_dictinstead. (#6682)
[1.2.6] - Fixed¶
Fixed
DummyLogger.log_hyperparamsraising aTypeErrorwhen running withfast_dev_run=True(#6398)Fixed error on TPUs when there was no
ModelCheckpoint(#6654)Fixed
trainer.testfreeze on TPUs (#6654)Fixed a bug where gradients were disabled after calling
Trainer.predict(#6657)Fixed bug where no TPUs were detected in a TPU pod env (#6719)
[1.2.5] - 2021-03-23¶
[1.2.5] - Changed¶
[1.2.5] - Fixed¶
[1.2.4] - 2021-03-16¶
[1.2.4] - Changed¶
Changed the default of
find_unused_parametersback toTruein DDP and DDP Spawn (#6438)
[1.2.4] - Fixed¶
Expose DeepSpeed loss parameters to allow users to fix loss instability (#6115)
Fixed DP reduction with collection (#6324)
Fixed an issue where the tuner would not tune the learning rate if also tuning the batch size (#4688)
Fixed broadcast to use PyTorch
broadcast_object_listand addreduce_decision(#6410)Fixed logger creating directory structure too early in DDP (#6380)
Fixed DeepSpeed additional memory use on rank 0 when default device not set early enough (#6460)
Fixed an issue with
Tuner.scale_batch_sizenot finding the batch size attribute in the datamodule (#5968)Fixed an exception in the layer summary when the model contains torch.jit scripted submodules (#6511)
Fixed when Train loop config was run during
Trainer.predict(#6541)
[1.2.3] - 2021-03-09¶
[1.2.3] - Fixed¶
Fixed
ModelPruning(make_pruning_permanent=True)pruning buffers getting removed when saved during training (#6073)Fixed when
_stable_1d_sortto work whenn >= N(#6177)Fixed
AttributeErrorwhenlogger=Noneon TPU (#6221)Fixed PyTorch Profiler with
emit_nvtx(#6260)Fixed
trainer.testfrombest_pathhangs after callingtrainer.fit(#6272)Fixed
SingleTPUcallingall_gather(#6296)Ensure we check DeepSpeed/Sharded in multi-node DDP (#6297
Check
LightningOptimizerdoesn’t delete optimizer hooks (#6305Resolve memory leak for evaluation (#6326
Ensure that clip gradients is only called if the value is greater than 0 (#6330
Fixed
Trainernot resettinglightning_optimizerswhen callingTrainer.fit()multiple times (#6372)
[1.2.2] - 2021-03-02¶
[1.2.2] - Added¶
Added
checkpointparameter to callback’son_save_checkpointhook (#6072)
[1.2.2] - Changed¶
[1.2.2] - Fixed¶
Fixed epoch level schedulers not being called when
val_check_interval < 1.0(#6075)Fixed multiple early stopping callbacks (#6197)
Fixed incorrect usage of
detach(),cpu(),to()(#6216)Fixed LBFGS optimizer support which didn’t converge in automatic optimization (#6147)
Prevent
WandbLoggerfrom dropping values (#5931)Fixed error thrown when using valid distributed mode in multi node (#6297
[1.2.1] - 2021-02-23¶
[1.2.1] - Fixed¶
[1.2.0] - 2021-02-18¶
[1.2.0] - Added¶
Added
DataType,AverageMethodandMDMCAverageMethodenum in metrics (#5657)Added support for summarized model total params size in megabytes (#5590)
Added support for multiple train loaders (#1959)
Added
Accuracymetric now generalizes to Top-k accuracy for (multi-dimensional) multi-class inputs using thetop_kparameter (#4838)Added
Accuracymetric now enables the computation of subset accuracy for multi-label or multi-dimensional multi-class inputs with thesubset_accuracyparameter (#4838)Added
HammingDistancemetric to compute the hamming distance (loss) (#4838)Added
max_fprparameter toaurocmetric for computing partial auroc metric (#3790)Added
StatScoresmetric to compute the number of true positives, false positives, true negatives and false negatives (#4839)Added
R2Scoremetric (#5241)Added
LambdaCallback(#5347)Added
BackboneLambdaFinetuningCallback(#5377)Accelerator
all_gathersupports collection (#5221)Added
image_gradientsfunctional metric to compute the image gradients of a given input image. (#5056)Added
MetricCollection(#4318)Added
.clone()method to metrics (#4318)Added
IoUclass interface (#4704)Support to tie weights after moving model to TPU via
on_post_move_to_devicehookAdded missing val/test hooks in
LightningModule(#5467)The
RecallandPrecisionmetrics (and their functional counterpartsrecallandprecision) can now be generalized to Recall@K and Precision@K with the use oftop_kparameter (#4842)Added
PyTorchProfiler(#5560)Added compositional metrics (#5464)
Added Trainer method
predict(...)for high performence predictions (#5579)Added
on_before_batch_transferandon_after_batch_transferdata hooks (#3671)Added AUC/AUROC class interface (#5479)
Added
PredictLoopobject (#5752)Added
LightningModule.configure_callbacksto enable the definition of model-specific callbacks (#5621)Added
dimtoPSNRmetric for mean-squared-error reduction (#5957)Added promxial policy optimization template to pl_examples (#5394)
Added
log_graphtoCometLogger(#5295)Added possibility for nested loaders (#5404)
Added
sync_stepto Wandb logger (#5351)Added
StochasticWeightAveragingcallback (#5640)Added
LightningDataModule.from_datasets(...)(#5133)Added
PL_TORCH_DISTRIBUTED_BACKENDenv variable to select backend (#5981)Added
Trainerflag to activate Stochastic Weight Averaging (SWA)Trainer(stochastic_weight_avg=True)(#6038)
[1.2.0] - Changed¶
Changed
stat_scoresmetric now calculates stat scores over all classes and gains new parameters, in line with the newStatScoresmetric (#4839)Changed
computer_vision_fine_tunningexample to useBackboneLambdaFinetuningCallback(#5377)Changed
automatic castingfor LoggerConnectormetrics(#5218)Changed
iou[func] to allow float input (#4704)Metric
compute()method will no longer automatically callreset()(#5409)Set PyTorch 1.4 as min requirements, also for testing and examples
torchvision>=0.5andtorchtext>=0.5(#5418)Changed
callbacksargument inTrainerto allowCallbackinput (#5446)Changed the default of
find_unused_parameterstoFalsein DDP (#5185)Changed
ModelCheckpointversion suffixes to start at 1 (#5008)Progress bar metrics tensors are now converted to float (#5692)
Changed the default value for the
progress_bar_refresh_rateTrainer argument in Google COLAB notebooks to 20 (#5516)Extended support for purely iteration-based training (#5726)
Made
LightningModule.global_rank,LightningModule.local_rankandLightningModule.loggerread-only properties (#5730)Forced
ModelCheckpointcallbacks to run after all others to guarantee all states are saved to the checkpoint (#5731)Refactored Accelerators and Plugins:
Added base classes for plugins (#5715)
Added parallel plugins for DP, DDP, DDPSpawn, DDP2 and Horovod (#5714)
Precision Plugins (#5718)
Added new Accelerators for CPU, GPU and TPU (#5719)
Added RPC and Sharded plugins (#5732)
Added missing
LightningModule-wrapper logic to new plugins and accelerator (#5734)Moved device-specific teardown logic from training loop to accelerator (#5973)
Moved accelerator_connector.py to the connectors subfolder (#6033)
Trainer only references accelerator (#6039)
Made parallel devices optional across all plugins (#6051)
Enabled
self.login callbacks (#5094)Renamed xxx_AVAILABLE as protected (#5082)
Unified module names in Utils (#5199)
Refactor: clean trainer device & distributed getters (#5300)
Simplified training phase as LightningEnum (#5419)
Updated metrics to use LightningEnum (#5689)
Changed the seq of
on_train_batch_end,on_batch_end&on_train_epoch_end,on_epoch_end hooks(#5688)Refactored
setup_trainingand removetest_mode(#5388)Disabled training with zero
num_training_batcheswhen insufficientlimit_train_batches(#5703)Refactored
EpochResultStore(#5522)Update
lr_finderto check for attribute if not runningfast_dev_run(#5990)LightningOptimizer manual optimizer is more flexible and expose
toggle_model(#5771)MlflowLoggerlimit parameter value length to 250 char (#5893)Re-introduced fix for Hydra directory sync with multiple process (#5993)
[1.2.0] - Deprecated¶
Function
stat_scores_multiple_classesis deprecated in favor ofstat_scores(#4839)Moved accelerators and plugins to its
legacypkg (#5645)Deprecated
LightningDistributedDataParallelin favor of new wrapper moduleLightningDistributedModule(#5185)Deprecated
LightningDataParallelin favor of new wrapper moduleLightningParallelModule(#5670)Renamed utils modules (#5199)
argparse_utils>>argparsemodel_utils>>model_helperswarning_utils>>warningsxla_device_utils>>xla_device
Deprecated using
'val_loss'to set theModelCheckpointmonitor (#6012)Deprecated
.get_model()with explicit.lightning_moduleproperty (#6035)Deprecated Trainer attribute
accelerator_backendin favor ofaccelerator(#6034)
[1.2.0] - Removed¶
[1.2.0] - Fixed¶
Fixed distributed setting and
ddp_cpuonly withnum_processes>1(#5297)Fixed
num_workersfor Windows example (#5375)Fixed loading yaml (#5619)
Fixed support custom DataLoader with DDP if they can be re-instantiated (#5745)
Fixed repeated
.fit()calls ignore max_steps iteration bound (#5936)Fixed throwing
MisconfigurationErroron unknown mode (#5255)Resolve bug with Finetuning (#5744)
Fixed
ModelCheckpointrace condition in file existence check (#5155)Fixed some compatibility with PyTorch 1.8 (#5864)
Fixed forward cache (#5895)
Fixed recursive detach of tensors to CPU (#6007)
Fixed passing wrong strings for scheduler interval doesn’t throw an error (#5923)
Fixed wrong
requires_gradstate afterreturn Nonewith multiple optimizers (#5738)Fixed add
on_epoch_endhook at the end ofvalidation,testepoch (#5986)Fixed missing
process_dataloadercall forTPUSpawnwhen in distributed mode (#6015)Fixed progress bar flickering by appending 0 to floats/strings (#6009)
Fixed synchronization issues with TPU training (#6027)
Fixed
hparams.yamlsaved twice when usingTensorBoardLogger(#5953)Fixed
fairscalecompatible with PT 1.8 (#5996)Ensured
process_dataloaderis called whentpu_cores > 1to use Parallel DataLoader (#6015)Attempted SLURM auto resume call when non-shell call fails (#6002)
Fixed wrapping optimizers upon assignment (#6006)
Fixed allowing hashing of metrics with lists in their state (#5939)
[1.1.8] - 2021-02-08¶
[1.1.8] - Fixed¶
[1.1.7] - 2021-02-03¶
[1.1.7] - Fixed¶
Fixed
TensorBoardLoggernot closingSummaryWriteronfinalize(#5696)Fixed filtering of pytorch “unsqueeze” warning when using DP (#5622)
Fixed
num_classesargument in F1 metric (#5663)Fixed
log_dirproperty (#5537)Fixed a race condition in
ModelCheckpointwhen checking if a checkpoint file exists (#5144)Remove unnecessary intermediate layers in Dockerfiles (#5697)
Fixed auto learning rate ordering (#5638)
[1.1.6] - 2021-01-26¶
[1.1.6] - Changed¶
[1.1.6] - Fixed¶
Fixed
toggle_optimizerto resetrequires_gradstate (#5574)Fixed FileNotFoundError for best checkpoint when using DDP with Hydra (#5629)
Fixed an error when logging a progress bar metric with a reserved name (#5620)
Fixed
Metric’sstate_dictnot included when child modules (#5614)Fixed Neptune logger creating multiple experiments when GPUs > 1 (#3256)
Fixed duplicate logs appearing in console when using the python logging module (#5509)
Fixed tensor printing in
trainer.test()(#5138)Fixed not using dataloader when
hparamspresent (#4559)
[1.1.5] - 2021-01-19¶
[1.1.5] - Fixed¶
[1.1.4] - 2021-01-12¶
[1.1.4] - Added¶
Add automatic optimization property setter to lightning module (#5169)
[1.1.4] - Changed¶
Changed deprecated
enable_pl_optimizer=True(#5244)
[1.1.4] - Fixed¶
Fixed
transfer_batch_to_devicefor DDP withlen(devices_ids) == 1(#5195)Logging only on
not should_accumulate()during training (#5417)Resolve interpolation bug with Hydra (#5406)
Check environ before selecting a seed to prevent warning message (#4743)
Fixed signature mismatch in
model_to_deviceofDDPCPUHPCAccelerator(#5505)
[1.1.3] - 2021-01-05¶
[1.1.3] - Added¶
[1.1.3] - Changed¶
[1.1.3] - Fixed¶
Fixed
trainer.testreturning non-test metrics (#5214)Fixed metric state reset (#5273)
Fixed
--num-nodesonDDPSequentialPlugin(#5327)Fixed invalid value for
weights_summary(#5296)Fixed
Trainer.testnot using the latestbest_model_path(#5161)Fixed existence check for hparams not using underlying filesystem (#5250)
Fixed
LightningOptimizerAMP bug (#5191)Fixed casted key to string in
_flatten_dict(#5354)
[1.1.2] - 2020-12-23¶
[1.1.2] - Added¶
[1.1.2] - Removed¶
enable_pl_optimizer=Falseby default to temporarily fix AMP issues (#5163)
[1.1.2] - Fixed¶
Metric reduction with Logging (#5150)
Remove nan loss in manual optimization (#5121)
Un-balanced logging properly supported (#5119)
Fix hanging in DDP HPC accelerators (#5157)
Fix reset
TensorRunningAccum(#5106)Updated
DALIClassificationLoaderto not use deprecated arguments (#4925)Corrected call to
torch.no_grad(#5124)
[1.1.1] - 2020-12-15¶
[1.1.1] - Added¶
Add a notebook example to reach a quick baseline of ~94% accuracy on CIFAR10 using Resnet in Lightning (#4818)
[1.1.1] - Changed¶
[1.1.1] - Removed¶
[1.1.1] - Fixed¶
Fixed trainer by default
NoneinDDPAccelerator(#4915)Fixed
LightningOptimizerto expose optimizer attributes (#5095)Do not warn when the
namekey is used in thelr_schedulerdict (#5057)Check if optimizer supports closure (#4981)
Add deprecated metric utility functions back to functional ( #5067, #5068)
Allow any input in
to_onnxandto_torchscript(#4378)Fixed
DDPHPCAcceleratorhangs in DDP construction by callinginit_device(#5157)
[1.1.0] - 2020-12-09¶
[1.1.0] - Added¶
Added “monitor” key to saved
ModelCheckpoints(#4383)Added
ConfusionMatrixclass interface (#4348)Added multiclass AUROC metric (#4236)
Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
Added optimizer hooks in callbacks (#4379)
Added option to log momentum (#4384)
Added
current_scoretoModelCheckpoint.on_save_checkpoint(#4721)Added logging using
self.login train and evaluation for epoch end hooks ( #4552, #4495, #4439, #4684, #4913)Added ability for DDP plugin to modify optimizer state saving (#4675)
Added
prefixargument in loggers (#4557)Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
Added
PrecisionRecallCurve, ROC, AveragePrecisionclass metric (#4549)Added custom
ApexandNativeAMPasPrecision plugins(#4355)Added
DALI MNISTexample (#3721)Added
sharded pluginfor DDP for multi-gpu training memory optimizations ( #4639, #4686, #4737, #4773)Added
experiment_idto the NeptuneLogger (#3462)Added
Pytorch Geometricintegration example with Lightning (#4568)Added
all_gathermethod toLightningModulewhich allows gradient based tensor synchronizations for use-cases such as negative sampling. (#5012)Enabled
self.login most functions (#4969)Added changeable extension variable for
ModelCheckpoint(#4977)
[1.1.0] - Changed¶
Tuner algorithms will be skipped if
fast_dev_run=True(#3903)WandbLoggerdoes not force wandbreinitarg to True anymore and creates a run only when needed (#4648)Changed
automatic_optimizationto be a model attribute (#4602)Changed
Simple Profilerreport to order by percentage time spent + num calls (#4880)Simplify optimization Logic (#4984)
Classification metrics overhaul (#4837)
Updated
fast_dev_runto accept integer representing num_batches (#4629)Refactored optimizer (#4658)
[1.1.0] - Deprecated¶
[1.1.0] - Removed¶
[1.1.0] - Fixed¶
Added feature to move tensors to CPU before saving (#4309)
Fixed
LoggerConnectorto have logged metrics on root device in DP (#4138)Auto convert tensors to contiguous format when
gather_all(#4907)Fixed
PYTHONPATHfor ddp test model (#4528)Fixed allowing logger to support indexing (#4595)
Fixed DDP and manual_optimization (#4976)
[1.0.8] - 2020-11-24¶
[1.0.8] - Added¶
[1.0.8] - Changed¶
Consistently use
step=trainer.global_stepinLearningRateMonitorindependently oflogging_interval(#4376)Metric states are no longer as default added to
state_dict(#4685)Renamed class metric
Fbeta>>FBeta(#4656)Model summary: add 1 decimal place (#4745)
Do not override
PYTHONWARNINGS(#4700)Changed
init_ddp_connectionmoved fromDDPtoDDPPlugin(#4407)
[1.0.8] - Fixed¶
Fixed checkpoint
hparamsdict casting whenomegaconfis available (#4770)Fixed incomplete progress bars when total batches not divisible by refresh rate (#4577)
Updated SSIM metric (#4566)
Fixed batch_arg_name - add
batch_arg_nameto all calls to_adjust_batch_sizebug (#4812)Fixed
torchtextdata to GPU (#4785)Fixed a crash bug in MLFlow logger (#4716)
[1.0.7] - 2020-11-17¶
[1.0.7] - Added¶
Added lambda closure to
manual_optimizer_step(#4618)
[1.0.7] - Changed¶
[1.0.7] - Fixed¶
Prevent crash if
sync_dist=Trueon CPU (#4626)Fixed average pbar Metrics (#4534)
Fixed
setupcallback hook to correctly pass the LightningModule through (#4608)Allowing decorate model init with saving
hparamsinside (#4662)Fixed
split_idxset byLoggerConnectorinon_trainer_inittoTrainer(#4697)
[1.0.6] - 2020-11-11¶
[1.0.6] - Added¶
Added metrics aggregation in Horovod and fixed early stopping (#3775)
Added
manual_optimizer_stepwhich work withAMP Nativeandaccumulated_grad_batches(#4485)Added
persistent(mode)method to metrics, to enable and disable metric states being added tostate_dict(#4482)Added congratulations at the end of our notebooks (#4555)
Added parameters
move_metrics_to_cpuin Trainer to disable gpu leak (#4592)
[1.0.6] - Changed¶
[1.0.6] - Fixed¶
Fixed feature-lack in
hpc_load(#4526)Fixed metrics states being overridden in DDP mode (#4482)
Fixed
lightning_getattr,lightning_hasattrnot finding the correct attributes in datamodule (#4347)Fixed automatic optimization AMP by
manual_optimization_step(#4485)Replace
MisconfigurationExceptionwith warning inModelCheckpointCallback (#4560)Fixed logged keys in mlflow logger (#4412)
Fixed
is_picklableby catchingAttributeError(#4508)Fixed multi test dataloaders dict
AttributeErrorerror (#4480)Fixed show progress bar only for
progress_rank 0onDDP_SLURM(#4437)
[1.0.5] - 2020-11-03¶
[1.0.5] - Added¶
[1.0.5] - Changed¶
W&B log in sync with
Trainerstep (#4405)Hook
on_after_backwardis called only whenoptimizer_stepis being called (#4439)Moved
track_and_norm_gradintotraining loopand called only whenoptimizer_stepis being called (#4439)Changed type checker with explicit cast of
ref_modelobject (#4457)Changed
distributed_backend->accelerator(#4429)
[1.0.5] - Deprecated¶
Deprecated passing
ModelCheckpointinstance tocheckpoint_callbackTrainer argument (#4336)
[1.0.5] - Fixed¶
Disable saving checkpoints if not trained (#4372)
Fixed error using
auto_select_gpus=Truewithgpus=-1(#4209)Disabled training when
limit_train_batches=0(#4371)Fixed that metrics do not store computational graph for all seen data (#4313)
Fixed AMP unscale for
on_after_backward(#4439)Fixed TorchScript export when module includes Metrics (#4428)
Fixed TorchScript trace method’s data to device and docstring (#4360)
Fixed CSV logger warning (#4419)
Fixed skip DDP parameter sync (#4301)
Fixed
WandbLogger_sanitize_callable function (#4422)Fixed
AMP Native_unscalegradient (#4441)
[1.0.4] - 2020-10-27¶
[1.0.4] - Added¶
Added
dirpathandfilenameparameter inModelCheckpoint(#4213)Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
Added
strictoption to the scheduler dictionary (#3586)Added
fsspecsupport for profilers (#4162)Added autogenerated helptext to
Trainer.add_argparse_args(#4344)Added support for string values in
Trainer’sprofilerparameter (#3656)Added
optimizer_closuretooptimizer.stepwhen supported (#4190)Added unification of regression metrics (#4166)
Added checkpoint load from Bytes (#4314)
[1.0.4] - Changed¶
[1.0.4] - Deprecated¶
[1.0.4] - Fixed¶
Fixed setting device ids in DDP (#4297)
Fixed synchronization of best model path in
ddp_accelerator(#4323)Fixed
WandbLoggernot uploading checkpoint artifacts at the end of training (#4341)Fixed
FBetacomputation (#4183)Fixed
accumulation across batcheshas completedbefore breaking training loop(#4278)Fixed
ModelCheckpointdon’t increase current_epoch and global_step when not training (#4291)Fixed
COMET_EXPERIMENT_KEYenvironment variable usage in comet logger (#4230)
[1.0.3] - 2020-10-20¶
[1.0.3] - Added¶
Added persistent flag to
Metric.add_state(#4195)
[1.0.3] - Changed¶
[1.0.3] - Fixed¶
[1.0.2] - 2020-10-15¶
[1.0.2] - Added¶
Added trace functionality to the function
to_torchscript(#4142)
[1.0.2] - Changed¶
Called
on_load_checkpointbefore loadingstate_dict(#4057)
[1.0.2] - Removed¶
Removed duplicate metric vs step log for train loop (#4173)
[1.0.2] - Fixed¶
[1.0.1] - 2020-10-14¶
[1.0.1] - Added¶
Added getstate/setstate method for torch.save serialization (#4127)
[1.0.0] - 2020-10-13¶
[1.0.0] - Added¶
Added Explained Variance Metric + metric fix (#4013)
Added Metric <-> Lightning Module integration tests (#4008)
Added parsing OS env vars in
Trainer(#4022)Added classification metrics (#4043)
Updated explained variance metric (#4024)
Enabled plugins (#4041)
Enabled custom clusters (#4048)
Enabled passing in custom accelerators (#4050)
Added
LightningModule.toggle_optimizer(#4058)Added
LightningModule.manual_backward(#4063)Added
outputargument to*_epoch_endhooks (#3967)
[1.0.0] - Changed¶
[1.0.0] - Removed¶
Removed support for EvalResult and TrainResult (#3968)
Removed deprecated trainer flags:
overfit_pct,log_save_interval,row_log_interval(#3969)Removed deprecated early_stop_callback (#3982)
Removed deprecated model hooks (#3980)
Removed deprecated callbacks (#3979)
Removed
trainerargument inLightningModule.backward#4056)
[1.0.0] - Fixed¶
[0.10.0] - 2020-10-07¶
[0.10.0] - Added¶
Enable PyTorch 1.7 compatibility (#3541)
Added
LightningModule.to_torchscriptto support exporting asScriptModule(#3258)Added warning when dropping unpicklable
hparams(#2874)Added EMB similarity (#3349)
Added
ModelCheckpoint.to_yamlmethod (#3048)Allow
ModelCheckpointmonitor to beNone, meaning it will always save (#3630)Disabled optimizers setup during testing (#3059)
Added support for datamodules to save and load checkpoints when training (#3563)
Added support for datamodule in learning rate finder (#3425)
Added gradient clip test for native AMP (#3754)
Added dist lib to enable syncing anything across devices (#3762)
Added
broadcasttoTPUBackend(#3814)Added
XLADeviceUtilsclass to check XLA device type (#3274)
[0.10.0] - Changed¶
Refactored accelerator backends:
moved TPU
xxx_stepto backend (#3118)refactored DDP backend
forward(#3119)refactored GPU backend
__step(#3120)remove obscure forward call in eval + CPU backend
___step(#3123)reduced all simplified forward (#3126)
added hook base method (#3127)
refactor eval loop to use hooks - use
test_modefor if so we can split later (#3129)moved
___step_endhooks (#3130)training forward refactor (#3134)
training AMP scaling refactor (#3135)
eval step scaling factor (#3136)
add eval loop object to streamline eval loop (#3138)
refactored dataloader process hook (#3139)
refactored inner eval loop (#3141)
final inner eval loop hooks (#3154)
clean up hooks in
run_evaluation(#3156)clean up data reset (#3161)
expand eval loop out (#3165)
moved hooks around in eval loop (#3195)
remove
_evaluatefx (#3197)Trainer.fithook clean up (#3198)DDPs train hooks (#3203)
reduced accelerator selection (#3211)
group prepare data hook (#3212)
added data connector (#3285)
modular is_overridden (#3290)
adding
Trainer.tune()(#3293)move
run_pretrain_routine->setup_training(#3294)move train outside of setup training (#3297)
move
prepare_datato data connector (#3307)moved accelerator router (#3309)
train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
duplicate data interface definition up into DataHooks class (#3344)
inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
all logging related calls in a connector (#3395)
added model connector (#3407)
moved eval loop logging to loggers (#3408)
moved eval loop (#3412#3408)
move
lr_finder(#3434)move specific accelerator code (#3457)
group connectors (#3472)
apex plugin (#3502)
precision plugins (#3504)
Result - make monitor default to
checkpoint_onto simplify (#3571)reference to the Trainer on the
LightningDataModule(#3684)add
.logto lightning module (#3686, #3699, #3701, #3704, #3715)enable tracking original metric when step and epoch are both true (#3685)
deprecated results obj, added support for simpler comms (#3681)
move backends back to individual files (#3712)
fixes logging for eval steps (#3763)
decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806, #3817, #3819, #3927)
remove weight loading hack for ddp_cpu (#3808)
separate
torchelasticfrom DDP (#3810)separate SLURM from DDP (#3809)
decoupled DDP2 (#3816)
bug fix with logging val epoch end + monitor (#3812)
callback system and init DDP (#3836)
epoch can now log independently (#3843)
test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
fixed
init_slurm_connectioncausing hostname errors (#3856)moves init apex from LM to apex connector (#3923)
moves sync bn to each backend (#3925)
moves configure ddp to each backend (#3924)
Deprecation warning (#3844)
Changed
LearningRateLoggertoLearningRateMonitor(#3251)Used
fsspecinstead ofgfilefor all IO (#3320)Swaped
torch.loadforfsspecload in DDP spawn backend (#3787)Swaped
torch.loadforfsspecload in cloud_io loading (#3692)Added support for
to_disk()to use remote filepaths withfsspec(#3930)Updated model_checkpoint’s to_yaml to use
fsspecopen (#3801)Fixed
fsspecis inconsistent when doingfs.ls(#3805)
Refactor
GPUStatsMonitorto improve training speed (#3257)Changed IoU score behavior for classes absent in target and pred (#3098)
Changed IoU
remove_bgbool toignore_indexoptional int (#3098)Changed defaults of
save_top_kandsave_lasttoNonein ModelCheckpoint (#3680)row_log_intervalandlog_save_intervalare now based on training loop’sglobal_stepinstead of epoch-internal batch index (#3667)Silenced some warnings. verified ddp refactors (#3483)
Cleaning up stale logger tests (#3490)
Allow
ModelCheckpointmonitor to beNone(#3633)Enable
Nonemodel checkpoint default (#3669)Skipped
best_model_pathifcheckpoint_callbackisNone(#2962)Used
raise .. from ..to explicitly chain exceptions (#3750)Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
Write predictions in LightningModule instead of EvalResult #3882
[0.10.0] - Deprecated¶
Deprecated
TrainResultandEvalResult, useself.logandself.writefrom theLightningModuleto log metrics and write predictions.training_stepcan now only return a scalar (for the loss) or a dictionary with anything you want. (#3681)Deprecate
early_stop_callbackTrainer argument (#3845)Rename Trainer arguments
row_log_interval>>log_every_n_stepsandlog_save_interval>>flush_logs_every_n_steps(#3748)
[0.10.0] - Removed¶
Removed experimental Metric API (#3943, #3949, #3946), listed changes before final removal:
Added hooks to metric module interface (#2528)
Added error when AUROC metric is used for multiclass problems (#3350)
Fixed
ModelCheckpointwithsave_top_k=-1option not tracking the best models when a monitor metric is available (#3735)Fixed counter-intuitive error being thrown in
Accuracymetric for zero target tensor (#3764)Fixed aggregation of metrics (#3517)
Fixed Metric aggregation (#3321)
Fixed RMSLE metric (#3188)
Renamed
reductiontoclass_reductionin classification metrics (#3322)Changed
class_reductionsimilar to sklearn for classification metrics (#3322)Renaming of precision recall metric (#3308)
[0.10.0] - Fixed¶
Fixed
on_train_batch_starthook to end epoch early (#3700)Fixed
num_sanity_val_stepsis clipped tolimit_val_batches(#2917)Fixed ONNX model save on GPU (#3145)
Fixed
GpuUsageLoggerto work on different platforms (#3008)Fixed auto-scale batch size not dumping
auto_lr_findparameter (#3151)Fixed
batch_outputswith optimizer frequencies (#3229)Fixed setting batch size in
LightningModule.datamodulewhen usingauto_scale_batch_size(#3266)Fixed Horovod distributed backend compatibility with native AMP (#3404)
Fixed batch size auto scaling exceeding the size of the dataset (#3271)
Fixed getting
experiment_idfrom MLFlow only once instead of each training loop (#3394)Fixed
overfit_batcheswhich now correctly disables shuffling for the training loader. (#3501)Fixed gradient norm tracking for
row_log_interval > 1(#3489)Fixed
ModelCheckpointname formatting (#3164)Fixed example implementation of AutoEncoder (#3190)
Fixed invalid paths when remote logging with TensorBoard (#3236)
Fixed change
t()totranspose()as XLA devices do not support.t()on 1-dim tensor (#3252)Fixed (weights only) checkpoints loading without PL (#3287)
Fixed
gather_all_tensorscross GPUs in DDP (#3319)Fixed CometML save dir (#3419)
Fixed forward key metrics (#3467)
Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
Fixed global step increment in training loop when
training_epoch_endhook is used (#3673)Fixed dataloader shuffling not getting turned off with
overfit_batches > 0anddistributed_backend = "ddp"(#3534)Fixed determinism in
DDPSpawnBackendwhen usingseed_everythingin main process (#3335)Fixed
ModelCheckpointperiodto actually save everyperiodepochs (#3630)Fixed
val_progress_bartotal withnum_sanity_val_steps(#3751)Fixed Tuner dump: add
current_epochto dumped_params (#3261)Fixed
current_epochandglobal_stepproperties mismatch betweenTrainerandLightningModule(#3785)Fixed learning rate scheduler for optimizers with internal state (#3897)
Fixed
tbptt_reduce_fxwhen non-floating tensors are logged (#3796)Fixed model checkpoint frequency (#3852)
Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
Fixed
TrainerEvaluationLoopMixinactivatesmodel.train()at the end (#3858)Fixed
overfit_batcheswhen using with multiple val/test_dataloaders (#3857)Fixed enables
training_stepto returnNone(#3862)Fixed init nan for checkpointing (#3863)
Fixed for
load_from_checkpoint(#2776)Fixes incorrect
batch_sizeswhen Dataloader returns a dict with multiple tensors (#3668)Fixed unexpected signature for
validation_step(#3947)
[0.9.0] - 2020-08-20¶
[0.9.0] - Added¶
Added basic
CSVLogger(#2721)Added SSIM metrics (#2671)
Added BLEU metrics (#2535)
Added support to export a model to ONNX format (#2596)
Added support for
Trainer(num_sanity_val_steps=-1)to check all validation data before training (#2246)Added struct. output:
Added class
LightningDataModule(#2668)Added support for PyTorch 1.6 (#2745)
Added call DataModule hooks implicitly in trainer (#2755)
Added support for Mean in DDP Sync (#2568)
Added remaining
sklearnmetrics:AveragePrecision,BalancedAccuracy,CohenKappaScore,DCG,Hamming,Hinge,Jaccard,MeanAbsoluteError,MeanSquaredError,MeanSquaredLogError,MedianAbsoluteError,R2Score,MeanPoissonDeviance,MeanGammaDeviance,MeanTweedieDeviance,ExplainedVariance(#2562)Added support for
limit_{mode}_batches (int)to work with infinite dataloader (IterableDataset) (#2840)Added support returning python scalars in DP (#1935)
Added support to Tensorboard logger for OmegaConf
hparams(#2846)Added tracking of basic states in
Trainer(#2541)Tracks all outputs including TBPTT and multiple optimizers (#2890)
Added GPU Usage Logger (#2932)
Added
strict=Falseforload_from_checkpoint(#2819)Added saving test predictions on multiple GPUs (#2926)
Auto log the computational graph for loggers that support this (#3003)
Added warning when changing monitor and using results obj (#3014)
Added a hook
transfer_batch_to_deviceto theLightningDataModule(#3038)
[0.9.0] - Changed¶
Truncated long version numbers in progress bar (#2594)
Enabling val/test loop disabling (#2692)
Refactored into
acceleratormodule:Using
.comet.configfile forCometLogger(#1913)Updated hooks arguments - breaking for
setupandteardown(#2850)Using
gfileto support remote directories (#2164)Moved optimizer creation after device placement for DDP backends (#2904)
Support
**DictConfigforhparamserialization (#2519)Removed callback metrics from test results obj (#2994)
Re-enabled naming metrics in ckpt name (#3060)
Changed progress bar epoch counting to start from 0 (#3061)
[0.9.0] - Deprecated¶
Deprecated Trainer attribute
ckpt_path, which will now be set byweights_save_path(#2681)
[0.9.0] - Removed¶
Removed deprecated: (#2760)
core decorator
data_loaderModule hook
on_sanity_check_startand loadingload_from_metricspackage
pytorch_lightning.loggingTrainer arguments:
show_progress_bar,num_tpu_cores,use_amp,print_nan_gradsLR Finder argument
num_accumulation_steps
[0.9.0] - Fixed¶
Fixed
accumulate_grad_batchesfor last batch (#2853)Fixed setup call while testing (#2624)
Fixed local rank zero casting (#2640)
Fixed single scalar return from training (#2587)
Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
Fixed
dtypeanddeviceproperties not getting updated in submodules (#2657)Fixed
fast_dev_runto run for all dataloaders (#2581)Fixed
save_dirin loggers getting ignored by default value ofweights_save_pathwhen user did not specifyweights_save_path(#2681)Fixed
weights_save_pathgetting ignored whenlogger=Falseis passed to Trainer (#2681)Fixed TPU multi-core and Float16 (#2632)
Fixed test metrics not being logged with
LoggerCollection(#2723)Fixed data transfer to device when using
torchtext.data.Fieldandinclude_lengths is True(#2689)Fixed shuffle argument for distributed sampler (#2789)
Fixed logging interval (#2694)
Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1(#2738)Fixed correct CWD for ddp sub-processes when using Hydra (#2719)
Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES(#2739)Fixed false
num_classeswarning in metrics (#2781)Fixed shell injection vulnerability in subprocess call (#2786)
Fixed LR finder and
hparamscompatibility (#2821)Fixed
ModelCheckpointnot saving the latest information whensave_last=True(#2881)Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
Fixed apex gradient clipping (#2829)
Fixed save apex scaler states (#2828)
Fixed a model loading issue with inheritance and variable positional arguments (#2911)
Fixed passing
non_blocking=Truewhen transferring a batch object that does not support it (#2910)Fixed checkpointing to remote file paths (#2925)
Fixed adding val step argument to metrics (#2986)
Fixed an issue that caused
Trainer.test()to stall in ddp mode (#2997)Fixed gathering of results with tensors of varying shape (#3020)
Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
Fixed automatic batch scaling not working with half precision (#3045)
Fixed setting device to root gpu (#3042)
[0.8.5] - 2020-07-09¶
[0.8.5] - Added¶
[0.8.5] - Removed¶
Removed auto val reduce (#2462)
[0.8.5] - Fixed¶
Flattening Wandb Hyperparameters (#2459)
Fixed using the same DDP python interpreter and actually running (#2482)
Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
Made
TensorBoardLoggerandCometLoggerpickleable (#2518)Fixed a problem with
MLflowLoggercreating multiple run folders (#2502)Fixed global_step increment (#2455)
Fixed TPU hanging example (#2488)
Fixed
argparsedefault value bug (#2526)Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
Fixed Trainer
.fit()returning last not best weights in “ddp_spawn” (#2565)Fixed passing (do not pass) TPU weights back on test (#2566)
[0.8.4] - 2020-07-01¶
[0.8.4] - Added¶
[0.8.4] - Changed¶
Enabled no returns from eval (#2446)
[0.8.4] - Fixed¶
[0.8.3] - 2020-06-29¶
[0.8.3] - Fixed¶
[0.8.2] - 2020-06-28¶
[0.8.2] - Added¶
Added TorchText support for moving data to GPU (#2379)
[0.8.2] - Changed¶
[0.8.2] - Removed¶
Moved
TrainsLoggerto Bolts (#2384)
[0.8.2] - Fixed¶
Fixed parsing TPU arguments and TPU tests (#2094)
Fixed number batches in case of multiple dataloaders and
limit_{*}_batches(#1920, #2226)Fixed an issue with forward hooks not being removed after model summary (#2298)
Fix for
load_from_checkpoint()not working with absolute path on Windows (#2294)Fixed an issue how _has_len handles
NotImplementedErrore.g. raised bytorchtext.data.Iterator(#2293), (#2307)Fixed
average_precisionmetric (#2319)Fixed ROC metric for CUDA tensors (#2304)
Fixed lost compatibility with custom datatypes implementing
.to(#2335)Fixed loading model with kwargs (#2387)
Fixed sum(0) for
trainer.num_val_batches(#2268)Fixed checking if the parameters are a
DictConfigObject (#2216)Fixed SLURM weights saving (#2341)
Fixed swaps LR scheduler order (#2356)
Fixed adding tensorboard
hparamslogging test (#2342)Fixed use model ref for tear down (#2360)
Fixed logger crash on DDP (#2388)
Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
Fixed loading past checkpoints from v0.7.x (#2405)
Fixed loading model without arguments (#2403)
Fixed Windows compatibility issue (#2358)
[0.8.1] - 2020-06-19¶
[0.8.1] - Fixed¶
[0.8.0] - 2020-06-18¶
[0.8.0] - Added¶
Added
overfit_batches,limit_{val|test}_batchesflags (overfit now uses training set for all three) (#2213)Added metrics
Allow dataloaders without sampler field present (#1907)
Added option
save_lastto save the model at the end of every epoch inModelCheckpoint(#1908)Early stopping checks
on_validation_end(#1458)Speed up single-core TPU training by loading data using
ParallelLoader(#2033)Added a model hook
transfer_batch_to_devicethat enables moving custom data structures to the target device (#1756)Added black formatter for the code with code-checker on pull (#1610)
Added back the slow spawn ddp implementation as
ddp_spawn(#2115)Added loading checkpoints from URLs (#1667)
Added a callback method
on_keyboard_interruptfor handling KeyboardInterrupt events during training (#2134)Added a decorator
auto_move_datathat moves data to the correct device when using the LightningModule for inference (#1905)Added
ckpt_pathoption toLightningModule.test(...)to load particular checkpoint (#2190)Added
setupandteardownhooks for model (#2229)
[0.8.0] - Changed¶
Allow user to select individual TPU core to train on (#1729)
Removed non-finite values from loss in
LRFinder(#1862)Allow passing model hyperparameters as complete kwarg list (#1896)
Renamed
ModelCheckpoint’s attributesbesttobest_model_scoreandkth_best_modeltokth_best_model_path(#1799)Re-Enable Logger’s
ImportErrors (#1938)Changed the default value of the Trainer argument
weights_summaryfromfulltotop(#2029)Raise an error when lightning replaces an existing sampler (#2020)
Enabled
prepare_datafrom correct processes - clarify local vs global rank (#2166)Remove explicit flush from tensorboard logger (#2126)
Changed epoch indexing from 1 instead of 0 (#2206)
[0.8.0] - Deprecated¶
Deprecated flags: (#2213)
overfit_pctin favour ofoverfit_batchesval_percent_checkin favour oflimit_val_batchestest_percent_checkin favour oflimit_test_batches
Deprecated
ModelCheckpoint’s attributesbestandkth_best_model(#1799)Dropped official support/testing for older PyTorch versions <1.3 (#1917)
Deprecated Trainer
proc_rankin favour ofglobal_rank(#2166, #2269)
[0.8.0] - Removed¶
Removed unintended Trainer argument
progress_bar_callback, the callback should be passed in byTrainer(callbacks=[...])instead (#1855)Removed obsolete
self._devicein Trainer (#1849)Removed deprecated API (#2073)
Packages:
pytorch_lightning.pt_overrides,pytorch_lightning.root_moduleModules:
pytorch_lightning.logging.comet_logger,pytorch_lightning.logging.mlflow_logger,pytorch_lightning.logging.test_tube_logger,pytorch_lightning.overrides.override_data_parallel,pytorch_lightning.core.model_saving,pytorch_lightning.core.root_moduleTrainer arguments:
add_row_log_interval,default_save_path,gradient_clip,nb_gpu_nodes,max_nb_epochs,min_nb_epochs,nb_sanity_val_stepsTrainer attributes:
nb_gpu_nodes,num_gpu_nodes,gradient_clip,max_nb_epochs,min_nb_epochs,nb_sanity_val_steps,default_save_path,tng_tqdm_dic
[0.8.0] - Fixed¶
Run graceful training teardown on interpreter exit (#1631)
Fixed user warning when apex was used together with learning rate schedulers (#1873)
Fixed multiple calls of
EarlyStoppingcallback (#1863)Fixed an issue with
Trainer.from_argparse_argswhen passing in unknown Trainer args (#1932)Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
Fixed root node resolution for SLURM cluster with dash in host name (#1954)
Fixed
LearningRateLoggerin multi-scheduler setting (#1944)Fixed test configuration check and testing (#1804)
Fixed an issue with Trainer constructor silently ignoring unknown/misspelled arguments (#1820)
Fixed
save_weights_onlyin ModelCheckpoint (#1780)Allow use of same
WandbLoggerinstance for multiple training loops (#2055)Fixed an issue with
_auto_collect_argumentscollecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself(#2048)Fixed mistake in parameters’ grad norm tracking (#2012)
Fixed CPU and hanging GPU crash (#2118)
Fixed an issue with the model summary and
example_input_arraydepending on a specific ordering of the submodules in a LightningModule (#1773)Fixed Tpu logging (#2230)
[0.7.6] - 2020-05-16¶
[0.7.6] - Added¶
Added callback for logging learning rates (#1498)
Added transfer learning example (for a binary classification task in computer vision) (#1564)
Added type hints in
Trainer.fit()andTrainer.test()to reflect that also a list of dataloaders can be passed in (#1723).Added auto scaling of batch size (#1638)
The progress bar metrics now also get updated in
training_epoch_end(#1724)Enable
NeptuneLoggerto work withdistributed_backend=ddp(#1753)Added option to provide seed to random generators to ensure reproducibility (#1572)
Added override for hparams in
load_from_ckpt(#1797)Added support multi-node distributed execution under
torchelastic(#1811, #1818)Added dummy logger for internally disabling logging for some features (#1836)
[0.7.6] - Changed¶
Enable
non-blockingfor device transfers to GPU (#1843)Replace mata_tags.csv with hparams.yaml (#1271)
Reduction when
batch_size < num_gpus(#1609)Updated LightningTemplateModel to look more like Colab example (#1577)
Don’t convert
namedtupletotuplewhen transferring the batch to target device (#1589)Allow passing hparams as keyword argument to LightningModule when loading from checkpoint (#1639)
Args should come after the last positional argument (#1807)
Made ddp the default if no backend specified with multiple GPUs (#1789)
[0.7.6] - Deprecated¶
Deprecated
tags_csvin favor ofhparams_file(#1271)
[0.7.6] - Fixed¶
Fixed broken link in PR template (#1675)
Fixed ModelCheckpoint not None checking filepath (#1654)
Trainer now calls
on_load_checkpoint()when resuming from a checkpoint (#1666)Fixed sampler logic for ddp with iterable dataset (#1734)
Fixed
_reset_eval_dataloader()for IterableDataset (#1560)Fixed Horovod distributed backend to set the
root_gpuproperty (#1669)Fixed wandb logger
global_stepaffects other loggers (#1492)Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
Fixed bugs that prevent lr finder to be used together with early stopping and validation dataloaders (#1676)
Fixed a bug in Trainer that prepended the checkpoint path with
version_when it shouldn’t (#1748)Fixed lr key name in case of param groups in LearningRateLogger (#1719)
Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
Fixed num processes wasn’t being set properly and auto sampler was ddp failing (#1819)
Fixed bugs in semantic segmentation example (#1824)
Fixed saving native AMP scaler state (#1777)
Fixed native amp + ddp (#1788)
Fixed
hparamlogging with metrics (#1647)
[0.7.5] - 2020-04-27¶
[0.7.5] - Changed¶
Allow logging of metrics together with
hparams(#1630)
[0.7.5] - Removed¶
Removed Warning from trainer loop (#1634)
[0.7.5] - Fixed¶
[0.7.4] - 2020-04-26¶
[0.7.4] - Added¶
Added flag
replace_sampler_ddpto manually disable sampler replacement in DDP (#1513)Added
auto_select_gpusflag to trainer that enables automatic selection of available GPUs on exclusive mode systems.Added learning rate finder (#1347)
Added support for DDP mode in clusters without SLURM (#1387)
Added
test_dataloadersparameter toTrainer.test()(#1434)Added
terminate_on_nanflag to trainer that performs a NaN check with each training iteration when set toTrue(#1475)Added speed parity tests (max 1 sec difference per epoch)(#1482)
Added
ddp_cpubackend for testing ddp without GPUs (#1158)Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')(#1529)Added support for 8 core distributed training on Kaggle TPU’s (#1568)
[0.7.4] - Changed¶
Changed the default behaviour to no longer include a NaN check with each training iteration (#1475)
Decoupled the progress bar from trainer` it is a callback now and can be customized or even be replaced entirely (#1450).
Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
Updated semantic segmentation example with custom U-Net and logging (#1371)
Disabled val and test shuffling (#1600)
[0.7.4] - Deprecated¶
Deprecated
training_tqdm_dictin favor ofprogress_bar_dict(#1450).
[0.7.4] - Removed¶
Removed
test_dataloadersparameter fromTrainer.fit()(#1434)
[0.7.4] - Fixed¶
Added the possibility to pass nested metrics dictionaries to loggers (#1582)
Fixed memory leak from opt return (#1528)
Fixed saving checkpoint before deleting old ones (#1453)
Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()results (#1459)Fixed optimizer configuration when
configure_optimizersreturns dict withoutlr_scheduler(#1443)Fixed
LightningModule- mixing hparams and arguments inLightningModule.__init__()crashes load_from_checkpoint() (#1505)Added a missing call to the
on_before_zero_gradmodel hook (#1493).Allow use of sweeps with
WandbLogger(#1512)Fixed a bug that caused the
callbacksTrainer argument to reference a global variable (#1534).Fixed a bug that set all boolean CLI arguments from
Trainer.add_argparse_argsalways to True (#1571)Fixed do not copy the batch when training on a single GPU (#1576, #1579)
Fixed soft checkpoint removing on DDP (#1408)
Fixed automatic parser bug (#1585)
Fixed bool conversion from string (#1606)
[0.7.3] - 2020-04-09¶
[0.7.3] - Added¶
Added
rank_zero_warnfor warning only in rank 0 (#1428)
[0.7.3] - Fixed¶
[0.7.2] - 2020-04-07¶
[0.7.2] - Added¶
Added same step loggers’ metrics aggregation (#1278)
Added parity test between a vanilla MNIST model and lightning model (#1284)
Added parity test between a vanilla RNN model and lightning model (#1351)
Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
Added support for hierarchical
dict(#1152)Added
TrainsLoggerclass (#1122)Added type hints to
pytorch_lightning.core(#946)Added support for
IterableDatasetin validation and testing (#1104)Added support for non-primitive types in
hparamsforTensorboardLogger(#1130)Added a check that stops the training when loss or weights contain
NaNorinfvalues. (#1097)Added support for
IterableDatasetwhenval_check_interval=1.0(default), this will trigger validation at the end of each epoch. (#1283)Added
summarymethod to Profilers. (#1259)Added informative errors if user defined dataloader has zero length (#1280)
Added testing for python 3.8 (#915)
Added model configuration checking (#1199)
Added support for optimizer frequencies through
LightningModule.configure_optimizers()(#1269)Added option to run without an optimizer by returning
Nonefromconfigure_optimizers. (#1279)Added a warning when the number of data loader workers is small. (#1378)
[0.7.2] - Changed¶
Changed (renamed and refatored)
TensorRunningMean->TensorRunningAccum: running accumulations were generalized. (#1278)Changed
progress_bar_refresh_ratetrainer flag to disable progress bar when set to 0. (#1108)Enhanced
load_from_checkpointto also forward params to the model (#1307)Updated references to
self.forward()to instead use the__call__interface. (#1211)Changed default behaviour of
configure_optimizersto use no optimizer rather than Adam. (#1279)Allow to upload models on W&B (#1339)
On DP and DDP2 unsqueeze is automated now (#1319)
Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
Did not interfere with a default sampler (#1318)
Remove default Adam optimizer (#1317)
Give warnings for unimplemented required lightning methods (#1317)
Made
evaluatemethod private >>Trainer._evaluate(...). (#1260)Simplify the PL examples structure (shallower and more readable) (#1247)
Changed min max gpu memory to be on their own plots (#1358)
Remove
.itemwhich causes sync issues (#1254)Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
Change default logger to dedicated one (#1064)
[0.7.2] - Deprecated¶
[0.7.2] - Removed¶
[0.7.2] - Fixed¶
Fixed
model_checkpointwhen saving all models (#1359)Trainer.add_argparse_argsclassmethod fixed. Now it adds a type for the arguments (#1147)Fixed bug related to type checking of
ReduceLROnPlateaulr schedulers(#1126)Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch(#1196)Fixed all warnings and errors in the docs build process (#1191)
Fixed an issue where
val_percent_check=0would not disable validation (#1251)Fixed average of incomplete
TensorRunningMean(#1309)Fixed
WandbLogger.watchwithwandb.init()(#1311)Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
Fixed a bug that would cause
trainer.test()to run on the validation set when overloadingvalidation_epoch_endandtest_end(#1353)Fixed
WandbLogger.watch- use of the watch method without importingwandb(#1311)Fixed
WandbLoggerto be used with ‘ddp’ - allow reinits in sub-processes (#1149, #1360)Made
training_epoch_endbehave likevalidation_epoch_end(#1357)Fixed
fast_dev_runrunning validation twice (#1365)Fixed pickle error from quick patch
__code__(#1352)Fixed checkpointing interval (#1272)
Fixed validation and training loops run the partial dataset (#1192)
Fixed running
on_validation_endonly on main process in DDP (#1125)Fixed
load_spawn_weightsonly in proc rank 0 (#1385)Fixes using deprecated
use_ampattribute (#1145)Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1377)
Fixed
Unimplemented backend XLAerror on TPU (#1387)
[0.7.1] - 2020-03-07¶
[0.7.1] - Fixed¶
Fixes
printissues anddata_loader(#1080)
[0.7.0] - 2020-03-06¶
[0.7.0] - Added¶
Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
Added
reload_dataloaders_every_epoch=Falseflag for trainer. Some users require reloading data every epoch (#926)Added
progress_bar_refresh_rate=50flag for trainer. Throttle refresh rate on notebooks (#926)Updated governance docs
Added a check to ensure that the metric used for early stopping exists before training commences (#542)
Added
optimizer_idxargument tobackwardhook (#733)Added
entityargument toWandbLoggerto be passed towandb.init(#783)Added a tool for profiling training runs (#782)
Improved flexibility for naming of TensorBoard logs, can now set
versionto astrto just save to that directory, and usename=''to prevent experiment-name directory (#804)Added option to specify
stepkey when logging metrics (#808)Added
train_dataloader,val_dataloaderandtest_dataloaderarguments toTrainer.fit(), for alternative data parsing (#759)Added Tensor Processing Unit (TPU) support (#868)
Split callbacks in multiple files (#849)
Added support for multiple loggers to be passed to
Traineras an iterable (e.g. list, tuple, etc.) (#903)Added support for step-based learning rate scheduling (#941)
Added support for logging
hparamsas dict (#1029)Checkpoint and early stopping now work without val. step (#1041)
Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
Added type hints for function arguments (#912, )
Added TPU gradient clipping (#963)
Added max/min number of steps in
Trainer(#728)
[0.7.0] - Changed¶
Improved
NeptuneLoggerby addingclose_after_fitargument to allow logging after training(#908)Changed default TQDM to use
tqdm.autofor prettier outputs in IPython notebooks (#752)Changed
pytorch_lightning.loggingtopytorch_lightning.loggers(#767)Moved the default
tqdm_dictdefinition from Trainer toLightningModule, so it can be overridden by the user (#749)Moved functionality of
LightningModule.load_from_metricsintoLightningModule.load_from_checkpoint(#995)Changed Checkpoint path parameter from
filepathtodirpath(#1016)Freezed models
hparamsasNamespaceproperty (#1029)Dropped
loggingconfig in package init (#1015)Renames model steps (#1051)
training_end>>training_epoch_endvalidation_end>>validation_epoch_endtest_end>>test_epoch_end
Refactor dataloading, supports infinite dataloader (#955)
Create single file in
TensorBoardLogger(#777)
[0.7.0] - Deprecated¶
[0.7.0] - Removed¶
[0.7.0] - Fixed¶
Fixed a bug where early stopping
on_end_epochwould be called inconsistently whencheck_val_every_n_epoch == 0(#743)Fixed a bug where the model checkpointer didn’t write to the same directory as the logger (#771)
Fixed a bug where the
TensorBoardLoggerclass would create an additional empty log file during fitting (#777)Fixed a bug where
global_stepwas advanced incorrectly when usingaccumulate_grad_batches > 1(#832)Fixed a bug when calling
self.logger.experimentwith multiple loggers (#1009)Fixed a bug when calling
logger.append_tagson aNeptuneLoggerwith a single tag (#1009)Fixed sending back data from
.spawnby saving and loading the trained model in/out of the process (#1017Fixed port collision on DDP (#1010)
Fixed/tested pass overrides (#918)
Fixed comet logger to log after train (#892)
Remove deprecated args to learning rate step function (#890)
[0.6.0] - 2020-01-21¶
[0.6.0] - Added¶
Added support for resuming from a specific checkpoint via
resume_from_checkpointargument (#516)Added support for
ReduceLROnPlateauscheduler (#320)Added support for Apex mode
O2in conjunction with Data Parallel (#493)Added option (
save_top_k) to save the top k models in theModelCheckpointclass (#128)Added
on_train_startandon_train_endhooks toModelHooks(#598)Added
TensorBoardLogger(#607)Added support for weight summary of model with multiple inputs (#543)
Added
map_locationargument toload_from_metricsandload_from_checkpoint(#625)Added option to disable validation by setting
val_percent_check=0(#649)Added
NeptuneLoggerclass (#648)Added
WandbLoggerclass (#627)
[0.6.0] - Changed¶
Changed the default progress bar to print to stdout instead of stderr (#531)
Renamed
step_idxtostep,epoch_idxtoepoch,max_num_epochstomax_epochsandmin_num_epochstomin_epochs(#589)Renamed
total_batch_nbtototal_batches,nb_val_batchestonum_val_batches,nb_training_batchestonum_training_batches,max_nb_epochstomax_epochs,min_nb_epochstomin_epochs,nb_test_batchestonum_test_batches, andnb_val_batchestonum_val_batches(#567)Changed gradient logging to use parameter names instead of indexes (#660)
Changed the default logger to
TensorBoardLogger(#609)Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
[0.6.0] - Deprecated¶
[0.6.0] - Removed¶
Removed the
save_best_onlyargument fromModelCheckpoint, usesave_top_k=1instead (#128)
[0.6.0] - Fixed¶
Fixed a bug which ocurred when using Adagrad with cuda (#554)
Fixed a bug where training would be on the GPU despite setting
gpus=0orgpus=[](#561)Fixed an error with
print_nan_gradientswhen some parameters do not require gradient (#579)Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
Fixed support for PyTorch 1.1.0 (#552)
Fixed an issue with early stopping when using a
val_check_interval < 1.0inTrainer(#492)Fixed bugs relating to the
CometLoggerobject that would cause it to not work properly (#481)Fixed a bug that would occur when returning
-1fromon_batch_startfollowing an early exit or when the batch wasNone(#509)Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
Fixed a bug where batch ‘segments’ would remain on the GPU when using
truncated_bptt > 1(#532)Fixed a bug when using
IterableDataset(#547)Fixed a bug where
.itemwas called on non-tensor objects (#602)Fixed a bug where
Trainer.trainwould crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs(#608)Fixed a bug where early stopping would begin two epochs early (#617)
Fixed a bug where
num_training_batchesandnum_test_batcheswould sometimes be rounded down to zero (#649)Fixed a bug where an additional batch would be processed when manually setting
num_training_batches(#653)Fixed a bug when batches did not have a
.copymethod (#701)Fixed a bug when using
log_gpu_memory=Truein Python 3.6 (#715)Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
Fixed a bug where
on_train_endwas not called when ealy stopping (#723)
[0.5.3] - 2019-11-06¶
[0.5.3] - Added¶
Added option to disable default logger, checkpointer, and early stopping by passing
logger=False,checkpoint_callback=Falseandearly_stop_callback=FalserespectivelyAdded
CometLoggerfor use with Comet.mlAdded
val_check_intervalargument toTrainerallowing validition to be performed at every given number of batchesAdded functionality to save and load hyperparameters using the standard checkpoint mechanism
Added call to
torch.cuda.empty_cachebefore training startsAdded option for user to override the call t
backwardAdded support for truncated backprop through time via the
truncated_bptt_stepsargument inTrainerAdded option to operate on all outputs from
training_stepin DDP2Added a hook for modifying DDP init
Added a hook for modifying Apex
[0.5.3] - Changed¶
Changed experiment version to be padded with zeros (e.g.
/dir/version_9becomes/dir/version_0009)Changed callback metrics to include any metrics given in logs or progress bar
Changed the default for
save_best_onlyinModelCheckpointtoTrueAdded
tng_data_loaderfor backwards compatibilityRenamed
MLFlowLogger.clienttoMLFlowLogger.experimentfor consistencyMoved
global_stepincrement to happen after the batch has been processedChanged weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
Changed progress bar functionality to add multiple progress bars for train/val/test
Changed calls to
printto uselogginginstead
[0.5.3] - Deprecated¶
Deprecated
tng_dataloader
[0.5.3] - Fixed¶
Fixed an issue where the number of batches was off by one during training
Fixed a bug that occured when setting a ckeckpoint callback and
early_stop_callback=FalseFixed an error when importing CometLogger
Fixed a bug where the
gpusargument had some unexpected behaviourFixed a bug where the computed total number of batches was sometimes incorrect
Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
Fixed a bug when using the
log_gpu_memory='min_max'option inTrainerFixed a bug where checkpointing would sometimes erase the current directory
[0.5.2] - 2019-10-10¶
[0.5.2] - Added¶
Added
weights_summaryargument toTrainerto be set tofull(full summary),top(just top level modules) or otherAdded
tagsargument toMLFlowLogger
[0.5.2] - Changed¶
Changed default for
amp_leveltoO1
[0.5.2] - Removed¶
Removed the
print_weights_summaryargument fromTrainer
[0.5.2] - Fixed¶
Fixed a bug where logs were not written properly
Fixed a bug where
logger.finalizewasn’t called after training is completeFixed callback metric errors in DDP
Fixed a bug where
TestTubeLoggerdidn’t log to the correct directory
[0.5.1] - 2019-10-05¶
[0.5.1] - Added¶
Added the
LightningLoggerBaseclass for experiment loggersAdded
MLFlowLoggerfor logging withmlflowAdded
TestTubeLoggerfor logging withtest_tubeAdded a different implementation of DDP (
distributed_backed='ddp2') where every node has one model using all GPUsAdded support for optimisers which require a closure (e.g. LBFGS)
Added automatic
MASTER_PORTdefualt for DDP when not set manuallyAdded new GPU memory logging options
'min_max'(log only the min/max utilization) and'all'(log all the GPU memory)
[0.5.1] - Changed¶
Changed schedulers to always be called with the current epoch
Changed
test_tubeto an optional dependencyChanged data loaders to internally use a getter instead of a python property
Disabled auto GPU loading when restoring weights to prevent out of memory errors
Changed logging, early stopping and checkpointing to occur by default
[0.5.1] - Fixed¶
Fixed a bug with samplers that do not specify
set_epochFixed a bug when using the
MLFlowLoggerwith unsupported data types, this will now raise a warningFixed a bug where gradient norms were alwasy zero using
track_grad_normFixed a bug which causes a crash when logging memory
[0.5.0] - 2019-09-26¶
[0.5.0] - Changed¶
Changed
data_batchargument tobatchthroughoutChanged
batch_iargument tobatch_idxthroughoutChanged
tng_dataloadermethod totrain_dataloaderChanged
on_tng_metricsmethod toon_training_metricsChanged
gradient_clipargument togradient_clip_valChanged
add_log_row_intervaltorow_log_interval
[0.5.0] - Fixed¶
Fixed a bug with tensorboard logging in multi-gpu setup
[0.4.9] - 2019-09-16¶
[0.4.9] - Added¶
Added the flag
log_gpu_memorytoTrainerto deactivate logging of GPU memory utilizationAdded SLURM resubmit functionality (port from test-tube)
Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
Added option to use single gpu per node with
DistributedDataParallel
[0.4.9] - Changed¶
Changed functionality of
validation_endandtest_endwith multiple dataloaders to be given all of the dataloaders at once rather than in seperate callsChanged print_nan_grads to only print the parameter value and gradients when they contain NaN
Changed gpu API to take integers as well (e.g.
gpus=2instead ofgpus=[0, 1])All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
[0.4.9] - Fixed¶
Fixed a bug where data types that implement
.tobut not.cudawould not be properly moved onto the GPUFixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
[0.4.8] - 2019-08-31¶
[0.4.8] - Added¶
Added
test_stepandtest_endmethods, used whenTrainer.testis calledAdded
GradientAccumulationSchedulercallback which can be used to schedule changes to the number of accumulation batchesAdded option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
[0.4.8] - Fixed¶
Fixed a bug when setting
nb_sanity_val_steps = 0
[0.4.7] - 2019-08-24¶
[0.4.7] - Changed¶
Changed the default
val_check_intervalto1.0Changed defaults for
nb_val_batches,nb_tng_batchesandnb_test_batchesto 0
[0.4.7] - Fixed¶
Fixed a bug where the full validation set as used despite setting
val_percent_checkFixed a bug where an
Exceptionwas thrown when using a data set containing a single batchFixed a bug where an
Exceptionwas thrown if noval_dataloaderwas givenFixed a bug where tuples were not properly transfered to the GPU
Fixed a bug where data of a non standard type was not properly handled by the trainer
Fixed a bug when loading data as a tuple
Fixed a bug where
AttributeErrorcould be suppressed by theTrainer
[0.4.6] - 2019-08-15¶
[0.4.6] - Added¶
Added support for data to be given as a
dictorlistwith a single gpuAdded support for
configure_optimizersto return a single optimizer, two list (optimizers and schedulers), or a single list
[0.4.6] - Fixed¶
Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizerswould throw anException
[0.4.5] - 2019-08-13¶
[0.4.5] - Added¶
Added
optimizer_stepmethod that can be overridden to change the standard optimizer behaviour
[0.4.4] - 2019-08-12¶
[0.4.4] - Added¶
Added supoort for multiple validation dataloaders
Added support for latest test-tube logger (optimised for
torch==1.2.0)
[0.4.4] - Changed¶
validation_stepandval_dataloaderare now optionallr_scheduleris now activated after epoch
[0.4.4] - Fixed¶
Fixed a bug where a warning would show when using
lr_schedulerintorch>1.1.0Fixed a bug where an
Exceptionwould be thrown if usingtorch.DistributedDataParallelwithout using aDistributedSampler, this now throws aWarninginstead
[0.4.3] - 2019-08-10¶
[0.4.3] - Fixed¶
Fixed a bug where accumulate gradients would scale the loss incorrectly
[0.4.2] - 2019-08-08¶
[0.4.2] - Changed¶
Changed install requirement to
torch==1.2.0
[0.4.1] - 2019-08-08¶
[0.4.1] - Changed¶
Changed install requirement to
torch==1.1.0
[0.4.0] - 2019-08-08¶
[0.4.0] - Added¶
Added 16-bit support for a single GPU
Added support for training continuation (preserves epoch, global step etc.)
[0.4.0] - Changed¶
Changed
training_stepandvalidation_step, outputs will no longer be automatically reduced
[0.4.0] - Removed¶
Removed need for
Experimentobject inTrainer
[0.4.0] - Fixed¶
Fixed issues with reducing outputs from generative models (such as images and text)
[0.3.6] - 2019-07-25¶
[0.3.6] - Added¶
Added a decorator to do lazy data loading internally
[0.3.6] - Fixed¶
Fixed a bug where
Experimentobject was not process safe, potentially causing logs to be overwritten