• Docs >
  • GPU training (Expert)
Shortcuts

GPU training (Expert)

Audience: Experts creating new scaling techniques such as Deepspeed or FSDP


Lightning enables experts focused on researching new ways of optimizing distributed training/inference strategies to create new strategies and plug them into Lightning.

For example, Lightning worked closely with the Microsoft team to develop a Deepspeed integration and with the Facebook(Meta) team to develop a FSDP integration.


What is a Strategy?

Strategy controls the model distribution across training, evaluation, and prediction to be used by the Trainer. It can be controlled by passing different strategy with aliases ("ddp", "ddp_spawn", "deepspeed" and so on) as well as a custom strategy to the strategy parameter for Trainer.

The Strategy in PyTorch Lightning handles the following responsibilities:

  • Launch and teardown of training processes (if applicable).

  • Setup communication between processes (NCCL, GLOO, MPI, and so on).

  • Provide a unified communication interface for reduction, broadcast, and so on.

  • Owns the LightningModule

  • Handles/owns optimizers and schedulers.

Strategy is a composition of one Accelerator, one Precision Plugin, a CheckpointIO plugin and other optional plugins such as the ClusterEnvironment.

Illustration of the Strategy as a composition of the Accelerator and several plugins

We expose Strategies mainly for expert users that want to extend Lightning for new hardware support or new distributed backends (e.g. a backend not yet supported by PyTorch itself).


Selecting a Built-in Strategy

Built-in strategies can be selected in two ways.

  1. Pass the shorthand name to the strategy Trainer argument

  2. Import a Strategy from pytorch_lightning.strategies, instantiate it and pass it to the strategy Trainer argument

The latter allows you to configure further options on the specifc strategy. Here are some examples:

# Training with the DistributedDataParallel strategy on 4 GPUs
trainer = Trainer(strategy="ddp", accelerator="gpu", devices=4)

# Training with the DistributedDataParallel strategy on 4 GPUs, with options configured
trainer = Trainer(strategy=DDPStrategy(find_unused_parameters=False), accelerator="gpu", devices=4)

# Training with the DDP Spawn strategy using auto accelerator selection
trainer = Trainer(strategy="ddp_spawn", accelerator="auto", devices=4)

# Training with the DeepSpeed strategy on available GPUs
trainer = Trainer(strategy="deepspeed", accelerator="gpu", devices="auto")

# Training with the DDP strategy using 3 CPU processes
trainer = Trainer(strategy="ddp", accelerator="cpu", devices=3)

# Training with the DDP Spawn strategy on 8 TPU cores
trainer = Trainer(strategy="ddp_spawn", accelerator="tpu", devices=8)

# Training with the default IPU strategy on 8 IPUs
trainer = Trainer(accelerator="ipu", devices=8)

The below table lists all relevant strategies available in Lightning with their corresponding short-hand name:

Strategy Classes and Nicknames

Name

Class

Description

bagua

BaguaStrategy

Strategy for training using the Bagua library, with advanced distributed training algorithms and system optimizations. Learn more.

collaborative

HivemindStrategy

Strategy for training collaboratively on local machines or unreliable GPUs across the internet. Learn more.

colossalai

ColossalAIStrategy

Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. Learn more.

fsdp_native

DDPFullyShardedNativeStrategy

Strategy for Fully Sharded Data Parallel provided by PyTorch. Learn more.

fsdp

DDPFullyShardedStrategy

Strategy for Fully Sharded Data Parallel provided by FairScale. Learn more.

ddp_sharded

DDPShardedStrategy

Optimizer and gradient sharded training provided by FairScale. Learn more.

ddp_sharded_spawn

DDPSpawnShardedStrategy

Optimizer sharded training provided by FairScale. Learn more.

ddp_spawn

DDPSpawnStrategy

Spawns processes using the torch.multiprocessing.spawn() method and joins processes after training finishes. Learn more.

ddp

DDPStrategy

Strategy for multi-process single-device training on one or multiple nodes. Learn more.

dp

DataParallelStrategy

Implements data-parallel training in a single process, i.e., the model gets replicated to each device and each gets a split of the data. Learn more.

deepspeed

DeepSpeedStrategy

Provides capabilities to run training using the DeepSpeed library, with training optimizations for large billion parameter models. Learn more.

horovod

HorovodStrategy

Strategy for Horovod distributed training integration. Learn more.

hpu_parallel

HPUParallelStrategy

Strategy for distributed training on multiple HPU devices. Learn more.

hpu_single

SingleHPUStrategy

Strategy for training on a single HPU device. Learn more.

ipu_strategy

IPUStrategy

Plugin for training on IPU devices. Learn more.

tpu_spawn

TPUSpawnStrategy

Strategy for training on multiple TPU devices using the torch_xla.distributed.xla_multiprocessing.spawn() method. Learn more.

single_tpu

SingleTPUStrategy

Strategy for training on a single TPU device. Learn more.


Create a Custom Strategy

Every strategy in Lightning is a subclass of one of the main base classes: Strategy, SingleDeviceStrategy or ParallelStrategy.

Strategy base classes

As an expert user, you may choose to extend either an existing built-in Strategy or create a completely new one by subclassing the base classes.

from pytorch_lightning.strategies import DDPStrategy


class CustomDDPStrategy(DDPStrategy):
    def configure_ddp(self):
        self.model = MyCustomDistributedDataParallel(
            self.model,
            device_ids=...,
        )

    def setup(self, trainer):
        # you can access the accelerator and plugins directly
        self.accelerator.setup()
        self.precision_plugin.connect(...)

The custom strategy can then be passed into the Trainer directly via the strategy parameter.

# custom strategy
trainer = Trainer(strategy=CustomDDPStrategy())

Since the strategy also hosts the Accelerator and various plugins, you can customize all of them to work together as you like:

# custom strategy, with new accelerator and plugins
accelerator = MyAccelerator()
precision_plugin = MyPrecisionPlugin()
strategy = CustomDDPStrategy(accelerator=accelerator, precision_plugin=precision_plugin)
trainer = Trainer(strategy=strategy)

Strategy Registry

Warning

The Strategy Registry is experimental and subject to change.

Lightning includes a registry that holds information about Training strategies and allows for the registration of new custom strategies.

The Strategies are assigned strings that identify them, such as “ddp”, “deepspeed_stage_2_offload”, and so on. It also returns the optional description and parameters for initialising the Strategy that were defined during registration.

# Training with the DDP Strategy with `find_unused_parameters` as False
trainer = Trainer(strategy="ddp_find_unused_parameters_false", accelerator="gpu", devices=4)

# Training with DeepSpeed ZeRO Stage 3 and CPU Offload
trainer = Trainer(strategy="deepspeed_stage_3_offload", accelerator="gpu", devices=3)

# Training with the TPU Spawn Strategy with `debug` as True
trainer = Trainer(strategy="tpu_spawn_debug", accelerator="tpu", devices=8)

Additionally, you can pass your custom registered training strategies to the strategy argument.

from pytorch_lightning.strategies import DDPStrategy, StrategyRegistry, CheckpointIO


class CustomCheckpointIO(CheckpointIO):
    def save_checkpoint(self, checkpoint: Dict[str, Any], path: Union[str, Path]) -> None:
        ...

    def load_checkpoint(self, path: Union[str, Path]) -> Dict[str, Any]:
        ...


custom_checkpoint_io = CustomCheckpointIO()

# Register the DDP Strategy with your custom CheckpointIO plugin
StrategyRegistry.register(
    "ddp_custom_checkpoint_io",
    DDPStrategy,
    description="DDP Strategy with custom checkpoint io plugin",
    checkpoint_io=custom_checkpoint_io,
)

trainer = Trainer(strategy="ddp_custom_checkpoint_io", accelerator="gpu", devices=2)

© Copyright Copyright (c) 2018-2023, Lightning AI et al...

Built with Sphinx using a theme provided by Read the Docs.