Shortcuts

Plugins

Plugins allow custom integrations to the internals of the Trainer such as custom precision, checkpointing or cluster environment implementation.

Under the hood, the Lightning Trainer is using plugins in the training routine, added automatically depending on the provided Trainer arguments.

There are three types of Plugins in Lightning with different responsibilities:

  • Precision Plugins

  • CheckpointIO Plugins

  • Cluster Environments

You can make the Trainer use one or multiple plugins by adding it to the plugins argument like so:

trainer = Trainer(plugins=[plugin1, plugin2, ...])

By default, the plugins get selected based on the rest of the Trainer settings such as the strategy.


Precision Plugins

We provide precision plugins for you to benefit from numerical representations with lower precision than 32-bit floating-point or higher precision, such as 64-bit floating-point.

# Training with 16-bit precision
trainer = Trainer(precision=16)

The full list of built-in precision plugins is listed below.

ColossalAIPrecisionPlugin

Precision plugin for ColossalAI integration.

DeepSpeedPrecisionPlugin

Precision plugin for DeepSpeed integration.

DoublePrecisionPlugin

Plugin for training with double (torch.float64) precision.

FullyShardedNativeMixedPrecisionPlugin

Native AMP for Fully Sharded Training.

FullyShardedNativeNativeMixedPrecisionPlugin

Native AMP for Fully Sharded Native Training.

HPUPrecisionPlugin

Plugin that enables bfloat/half support on HPUs.

IPUPrecisionPlugin

Precision plugin for IPU integration.

MixedPrecisionPlugin

Plugin for Automatic Mixed Precision (AMP) training with torch.autocast.

PrecisionPlugin

Base class for all plugins handling the precision-specific parts of the training.

ShardedNativeMixedPrecisionPlugin

Native AMP for Sharded Training.

TPUBf16PrecisionPlugin

Plugin that enables bfloats on TPUs.

TPUPrecisionPlugin

Precision plugin for TPU integration.

More information regarding precision with Lightning can be found here


CheckpointIO Plugins

As part of our commitment to extensibility, we have abstracted Lightning’s checkpointing logic into the CheckpointIO plugin. With this, you have the ability to customize the checkpointing logic to match the needs of your infrastructure.

Below is a list of built-in plugins for checkpointing.

AsyncCheckpointIO

AsyncCheckpointIO enables saving the checkpoints asynchronously in a thread.

CheckpointIO

Interface to save/load checkpoints as they are saved through the Strategy.

HPUCheckpointIO

CheckpointIO to save checkpoints for HPU training strategies.

TorchCheckpointIO

CheckpointIO that utilizes torch.save() and torch.load() to save and load checkpoints respectively, common for most use cases.

XLACheckpointIO

CheckpointIO that utilizes xm.save() to save checkpoints for TPU training strategies.

Learn more about custom checkpointing with Lightning here.


Cluster Environments

You can define the interface of your own cluster environment based on the requirements of your infrastructure.

ClusterEnvironment

Specification of a cluster environment.

KubeflowEnvironment

Environment for distributed training using the PyTorchJob operator from Kubeflow

LightningEnvironment

The default environment used by Lightning for a single node or free cluster (not managed).

LSFEnvironment

An environment for running on clusters managed by the LSF resource manager.

SLURMEnvironment

Cluster environment for training on a cluster managed by SLURM.

TorchElasticEnvironment

Environment for fault-tolerant and elastic training with torchelastic

XLAEnvironment

Cluster environment for training on a TPU Pod with the PyTorch/XLA library.