Plugins¶
Plugins allow custom integrations to the internals of the Trainer such as custom precision, checkpointing or cluster environment implementation.
Under the hood, the Lightning Trainer is using plugins in the training routine, added automatically depending on the provided Trainer arguments.
There are three types of plugins in Lightning with different responsibilities:
Precision plugins
CheckpointIO plugins
Cluster environments
You can make the Trainer use one or multiple plugins by adding it to the plugins
argument like so:
trainer = Trainer(plugins=[plugin1, plugin2, ...])
By default, the plugins get selected based on the rest of the Trainer settings such as the strategy
.
Precision Plugins¶
We provide precision plugins for you to benefit from numerical representations with lower precision than 32-bit floating-point or higher precision, such as 64-bit floating-point.
# Training with 16-bit precision
trainer = Trainer(precision=16)
The full list of built-in precision plugins is listed below.
Precision plugin for DeepSpeed integration. |
|
Plugin for training with double ( |
|
Plugin for training with half precision. |
|
Precision plugin for training with Fully Sharded Data Parallel (FSDP). |
|
Plugin for Automatic Mixed Precision (AMP) training with |
|
Base class for all plugins handling the precision-specific parts of the training. |
|
Plugin for training with XLA. |
|
Plugin for training with fp8 precision via nvidia's Transformer Engine. |
|
Plugin for quantizing weights with bitsandbytes. |
More information regarding precision with Lightning can be found here
CheckpointIO Plugins¶
As part of our commitment to extensibility, we have abstracted Lightning’s checkpointing logic into the CheckpointIO
plugin.
With this, you have the ability to customize the checkpointing logic to match the needs of your infrastructure.
Below is a list of built-in plugins for checkpointing.
|
|
Interface to save/load checkpoints as they are saved through the |
|
CheckpointIO that utilizes |
|
CheckpointIO that utilizes |
Learn more about custom checkpointing with Lightning here.
Cluster Environments¶
You can define the interface of your own cluster environment based on the requirements of your infrastructure.
Specification of a cluster environment. |
|
Environment for distributed training using the PyTorchJob operator from Kubeflow. |
|
The default environment used by Lightning for a single node or free cluster (not managed). |
|
An environment for running on clusters managed by the LSF resource manager. |
|
Cluster environment for training on a cluster managed by SLURM. |
|
Environment for fault-tolerant and elastic training with torchelastic |
|
Cluster environment for training on a TPU Pod with the PyTorch/XLA library. |