Plugins¶
Plugins allow custom integrations to the internals of the Trainer such as a custom precision or distributed implementation.
Under the hood, the Lightning Trainer is using plugins in the training routine, added automatically depending on the provided Trainer arguments. For example:
# accelerator: GPUAccelerator
# training type: DDPPlugin
# precision: NativeMixedPrecisionPlugin
trainer = Trainer(gpus=4, precision=16)
We expose Accelerators and Plugins mainly for expert users that want to extend Lightning for:
- New hardware (like TPU plugin) 
- Distributed backends (e.g. a backend not yet supported by PyTorch itself) 
- Clusters (e.g. customized access to the cluster’s environment interface) 
There are two types of Plugins in Lightning with different responsibilities:
TrainingTypePlugin¶
- Launching and teardown of training processes (if applicable) 
- Setup communication between processes (NCCL, GLOO, MPI, …) 
- Provide a unified communication interface for reduction, broadcast, etc. 
- Provide access to the wrapped LightningModule 
PrecisionPlugin¶
- Perform pre- and post backward/optimizer step operations such as scaling gradients 
- Provide context managers for forward, training_step, etc. 
- Gradient clipping 
Futhermore, for multi-node training Lightning provides cluster environment plugins that allow the advanced user to configure Lighting to integrate with a 3. Custom cluster.
Create a custom plugin¶
Expert users may choose to extend an existing plugin by overriding its methods …
from pytorch_lightning.plugins import DDPPlugin
class CustomDDPPlugin(DDPPlugin):
    def configure_ddp(self):
        self._model = MyCustomDistributedDataParallel(
            self.model,
            device_ids=...,
        )
or by subclassing the base classes TrainingTypePlugin or
PrecisionPlugin to create new ones. These custom plugins
can then be passed into the Trainer directly or via a (custom) accelerator:
# custom plugins
trainer = Trainer(plugins=[CustomDDPPlugin(), CustomPrecisionPlugin()])
# fully custom accelerator and plugins
accelerator = MyAccelerator(
    precision_plugin=CustomPrecisionPlugin(),
    training_type_plugin=CustomDDPPlugin(),
)
trainer = Trainer(accelerator=accelerator)
The full list of built-in plugins is listed below.
Warning
The Plugin API is in beta and subject to change. For help setting up custom plugins/accelerators, please reach out to us at support@pytorchlightning.ai
Training Type Plugins¶
| Base class for all training type plugins that change the behaviour of the training, validation and test-loop. | |
| Plugin that handles communication on a single device. | |
| Plugin for training with multiple processes in parallel. | |
| Implements data-parallel training in a single process, i.e., the model gets replicated to each device and each gets a split of the data. | |
| Plugin for multi-process single-device training on one or multiple nodes. | |
| DDP2 behaves like DP in one node, but synchronization across nodes behaves like in DDP. | |
| Optimizer and gradient sharded training provided by FairScale. | |
| Optimizer sharded training provided by FairScale. | |
| Spawns processes using the  | |
| Provides capabilities to run training using the DeepSpeed library, with training optimizations for large billion parameter models. | |
| Plugin for Horovod distributed training integration. | |
| Plugin for training on a single TPU device. | |
| Plugin for training multiple TPU devices using the  | 
Precision Plugins¶
| Base class for all plugins handling the precision-specific parts of the training. | |
| Plugin for native mixed precision training with  | |
| Mixed Precision for Sharded Training | |
| Mixed Precision Plugin based on Nvidia/Apex (https://github.com/NVIDIA/apex) | |
| Precision plugin for DeepSpeed integration. | |
| Plugin that enables bfloats on TPUs | |
| Plugin for training with double ( | 
Cluster Environments¶
| Specification of a cluster environment. | |
| The default environment used by Lightning for a single node or free cluster (not managed). | |
| An environment for running on clusters managed by the LSF resource manager. | |
| Environment for fault-tolerant and elastic training with torchelastic | |
| Environment for distributed training using the PyTorchJob operator from Kubeflow | |
| Cluster environment for training on a cluster managed by SLURM. |