accelerators¶
The Accelerator base class for Lightning PyTorch. |
|
Accelerator for CPU devices. |
|
Accelerator for NVIDIA CUDA devices. |
|
Accelerator for HPU devices. |
|
Accelerator for IPUs. |
|
Accelerator for TPU devices. |
callbacks¶
Finetune a backbone model based on a learning rate user-defined scheduling. |
|
This class implements the base logic for writing your own Finetuning Callback. |
|
Base class to implement how the predictions should be stored. |
|
The |
|
Abstract base class used to build new callbacks. |
|
Automatically monitors and logs device stats during training, validation and testing stage. |
|
Monitor a metric and stop training when it stops improving. |
|
Change gradient accumulation factor according to scheduling. |
|
Create a simple callback on the fly using lambda functions. |
|
The |
|
Automatically monitor and logs learning rate for learning rate schedulers during training. |
|
Save the model periodically by monitoring a quantity. |
|
Model pruning Callback, using PyTorch's prune utilities. |
|
Generates a summary of all layers in a |
|
The base class for progress bars in Lightning. |
|
Quantization allows speeding up inference and decreasing memory requirements by performing computations and storing tensors at lower bitwidths (such as INT8 or FLOAT16) than floating point precision. |
|
Generates a summary of all layers in a |
|
Create a progress bar with rich text formatting. |
|
Implements the Stochastic Weight Averaging (SWA) Callback to average a model. |
|
The Timer callback tracks the time spent in the training, validation, and test loops and interrupts the Trainer if the given time limit for the training loop is reached. |
|
This is the default progress bar used by Lightning. |
cli¶
Implementation of a configurable command line tool for pytorch-lightning. |
|
Extension of jsonargparse's ArgumentParser for pytorch-lightning. |
|
Saves a LightningCLI config to the log_dir when training starts. |
core¶
Hooks to be used with Checkpointing. |
|
Hooks to be used for data related stuff. |
|
Hooks to be used in LightningModule. |
|
A DataModule standardizes the training, val, test splits, data preparation and transforms. |
|
This class is used to wrap the user optimizers and handle properly the backward and optimizer_step logic across accelerators, AMP, accumulate_grad_batches. |
|
loggers¶
Abstract base class used to build new loggers. |
|
Comet Logger |
|
CSV logger |
|
MLflow Logger |
|
Neptune Logger |
|
TensorBoard Logger |
|
Weights and Biases Logger |
loops¶
Base Classes¶
Base class to loop over all dataloaders. |
|
Basic Loops interface. |
Training¶
Runs over a single batch of data. |
|
Runs over all batches in a dataloader (one epoch). |
|
This Loop iterates over the epochs to run the training. |
|
A special loop implementing what is known in Lightning as Manual Optimization where the optimization happens entirely in the |
|
Runs over a sequence of optimizers. |
Validation and Testing¶
This is the loop performing the evaluation. |
|
Loops over all dataloaders for evaluation. |
Prediction¶
Loop performing prediction on arbitrary sequentially used dataloaders. |
|
Loop to run over dataloaders for prediction. |
plugins¶
precision¶
Precision plugin for ColossalAI integration. |
|
Precision plugin for DeepSpeed integration. |
|
Plugin for training with double ( |
|
Native AMP for Fully Sharded Training. |
|
Native AMP for Fully Sharded Native Training. |
|
Plugin that enables bfloat/half support on HPUs. |
|
Precision plugin for IPU integration. |
|
Plugin for Automatic Mixed Precision (AMP) training with |
|
Base class for all plugins handling the precision-specific parts of the training. |
|
Native AMP for Sharded Training. |
|
Plugin that enables bfloats on TPUs. |
|
Precision plugin for TPU integration. |
environments¶
Specification of a cluster environment. |
|
Environment for distributed training using the PyTorchJob operator from Kubeflow |
|
The default environment used by Lightning for a single node or free cluster (not managed). |
|
An environment for running on clusters managed by the LSF resource manager. |
|
Cluster environment for training on a cluster managed by SLURM. |
|
Environment for fault-tolerant and elastic training with torchelastic |
|
Cluster environment for training on a TPU Pod with the PyTorch/XLA library. |
io¶
|
|
Interface to save/load checkpoints as they are saved through the |
|
CheckpointIO to save checkpoints for HPU training strategies. |
|
CheckpointIO that utilizes |
|
CheckpointIO that utilizes |
others¶
Abstract base class for creating plugins that wrap layers of a model with synchronization logic for multiprocessing. |
|
A plugin that wraps all batch normalization layers of a model with synchronization logic for multiprocessing. |
profiler¶
This profiler uses Python's cProfiler to record more detailed information about time spent in each function call recorded during a given action. |
|
This class should be used when you don't want the (small) overhead of profiling. |
|
If you wish to write a custom profiler, you should inherit from this class. |
|
This profiler uses PyTorch's Autograd Profiler and lets you inspect the cost of. |
|
This profiler simply records the duration of actions (in seconds) and reports the mean duration of each action and the total time spent over the entire training run. |
|
XLA Profiler will help you debug and optimize training workload performance for your models using Cloud TPU performance tools. |
trainer¶
Customize every aspect of training via flags. |
strategies¶
Strategy for training using the Bagua library, with advanced distributed training algorithms and system optimizations. |
|
ColossalAI strategy. |
|
Strategy for Fully Sharded Data Parallel provided by torch.distributed. |
|
Plugin for Fully Sharded Data Parallel provided by FairScale. |
|
Optimizer and gradient sharded training provided by FairScale. |
|
Optimizer sharded training provided by FairScale. |
|
Spawns processes using the |
|
Strategy for multi-process single-device training on one or multiple nodes. |
|
Implements data-parallel training in a single process, i.e., the model gets replicated to each device and each gets a split of the data. |
|
Provides capabilities to run training using the DeepSpeed library, with training optimizations for large billion parameter models. |
|
Provides capabilities to train using the Hivemind Library, training collaboratively across the internet with unreliable machines. |
|
Strategy for distributed training on multiple HPU devices. |
|
Plugin for training on IPU devices. |
|
Plugin for training with multiple processes in parallel. |
|
Strategy that handles communication on a single device. |
|
Strategy for training on single HPU device. |
|
Strategy for training on a single TPU device. |
|
Base class for all strategies that change the behaviour of the training, validation and test- loop. |
|
Strategy for training multiple TPU devices using the |
tuner¶
Tuner class to tune your model. |
utilities¶
Utilities used for collections. |
|
Utilities for Argument Parsing within Lightning Components. |
|
Utilities related to data saving/loading. |
|
Utilities that can be used with Deepspeed. |
|
Utilities that can be used with distributed training. |
|
Helper functions to detect NaN/Inf values. |
|
Utilities related to memory. |
|
Utilities used for parameter parsing. |
|
Utilities that can be used for calling functions on a particular rank. |
|
Utilities to help with reproducibility of models. |
|
Warning-related utilities. |