################### What is a Strategy? ################### Strategy controls the model distribution across training, evaluation, and prediction to be used by the :doc:`Trainer <../common/trainer>`. It can be controlled by passing different strategy with aliases (``"ddp"``, ``"ddp_spawn"``, ``"deepspeed"`` and so on) as well as a custom strategy to the ``strategy`` parameter for Trainer. The Strategy in PyTorch Lightning handles the following responsibilities: * Launch and teardown of training processes (if applicable). * Setup communication between processes (NCCL, GLOO, MPI, and so on). * Provide a unified communication interface for reduction, broadcast, and so on. * Owns the :class:`~pytorch_lightning.core.module.LightningModule` * Handles/owns optimizers and schedulers. Strategy is a composition of one :doc:`Accelerator <../extensions/accelerator>`, one :ref:`Precision Plugin `, a :ref:`CheckpointIO ` plugin and other optional plugins such as the :ref:`ClusterEnvironment `. .. image:: https://pl-public-data.s3.amazonaws.com/docs/static/images/strategies/overview.jpeg :alt: Illustration of the Strategy as a composition of the Accelerator and several plugins We expose Strategies mainly for expert users that want to extend Lightning for new hardware support or new distributed backends (e.g. a backend not yet supported by `PyTorch `_ itself). ---------- ***************************** Selecting a Built-in Strategy ***************************** Built-in strategies can be selected in two ways. 1. Pass the shorthand name to the ``strategy`` Trainer argument 2. Import a Strategy from :mod:`pytorch_lightning.strategies`, instantiate it and pass it to the ``strategy`` Trainer argument The latter allows you to configure further options on the specifc strategy. Here are some examples: .. code-block:: python # Training with the DistributedDataParallel strategy on 4 GPUs trainer = Trainer(strategy="ddp", accelerator="gpu", devices=4) # Training with the DistributedDataParallel strategy on 4 GPUs, with options configured trainer = Trainer(strategy=DDPStrategy(find_unused_parameters=False), accelerator="gpu", devices=4) # Training with the DDP Spawn strategy using auto accelerator selection trainer = Trainer(strategy="ddp_spawn", accelerator="auto", devices=4) # Training with the DeepSpeed strategy on available GPUs trainer = Trainer(strategy="deepspeed", accelerator="gpu", devices="auto") # Training with the DDP strategy using 3 CPU processes trainer = Trainer(strategy="ddp", accelerator="cpu", devices=3) # Training with the DDP Spawn strategy on 8 TPU cores trainer = Trainer(strategy="ddp_spawn", accelerator="tpu", devices=8) # Training with the default IPU strategy on 8 IPUs trainer = Trainer(accelerator="ipu", devices=8) The below table lists all relevant strategies available in Lightning with their corresponding short-hand name: .. list-table:: Strategy Classes and Nicknames :widths: 20 20 20 :header-rows: 1 * - Name - Class - Description * - bagua - :class:`~pytorch_lightning.strategies.BaguaStrategy` - Strategy for training using the Bagua library, with advanced distributed training algorithms and system optimizations. :ref:`Learn more. ` * - collaborative - :class:`~pytorch_lightning.strategies.HivemindStrategy` - Strategy for training collaboratively on local machines or unreliable GPUs across the internet. :ref:`Learn more. ` * - colossalai - :class:`~pytorch_lightning.strategies.ColossalAIStrategy` - Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. `__ * - fsdp_native - :class:`~pytorch_lightning.strategies.DDPFullyShardedNativeStrategy` - Strategy for Fully Sharded Data Parallel. :ref:`Learn more. ` * - ddp_spawn - :class:`~pytorch_lightning.strategies.DDPSpawnStrategy` - Spawns processes using the :func:`torch.multiprocessing.spawn` method and joins processes after training finishes. :ref:`Learn more. ` * - ddp - :class:`~pytorch_lightning.strategies.DDPStrategy` - Strategy for multi-process single-device training on one or multiple nodes. :ref:`Learn more. ` * - dp - :class:`~pytorch_lightning.strategies.DataParallelStrategy` - Implements data-parallel training in a single process, i.e., the model gets replicated to each device and each gets a split of the data. :ref:`Learn more. ` * - deepspeed - :class:`~pytorch_lightning.strategies.DeepSpeedStrategy` - Provides capabilities to run training using the DeepSpeed library, with training optimizations for large billion parameter models. :ref:`Learn more. ` * - hpu_parallel - :class:`~pytorch_lightning.strategies.HPUParallelStrategy` - Strategy for distributed training on multiple HPU devices. :doc:`Learn more. <../accelerators/hpu>` * - hpu_single - :class:`~pytorch_lightning.strategies.SingleHPUStrategy` - Strategy for training on a single HPU device. :doc:`Learn more. <../accelerators/hpu>` * - ipu_strategy - :class:`~pytorch_lightning.strategies.IPUStrategy` - Plugin for training on IPU devices. :doc:`Learn more. <../accelerators/ipu>` * - tpu_spawn - :class:`~pytorch_lightning.strategies.TPUSpawnStrategy` - Strategy for training on multiple TPU devices using the :func:`torch_xla.distributed.xla_multiprocessing.spawn` method. :doc:`Learn more. <../accelerators/tpu>` * - single_tpu - :class:`~pytorch_lightning.strategies.SingleTPUStrategy` - Strategy for training on a single TPU device. :doc:`Learn more. <../accelerators/tpu>` ---- ************************ Create a Custom Strategy ************************ Every strategy in Lightning is a subclass of one of the main base classes: :class:`~pytorch_lightning.strategies.Strategy`, :class:`~pytorch_lightning.strategies.SingleDeviceStrategy` or :class:`~pytorch_lightning.strategies.ParallelStrategy`. .. image:: https://pl-public-data.s3.amazonaws.com/docs/static/images/strategies/hierarchy.jpeg :alt: Strategy base classes As an expert user, you may choose to extend either an existing built-in Strategy or create a completely new one by subclassing the base classes. .. code-block:: python from pytorch_lightning.strategies import DDPStrategy class CustomDDPStrategy(DDPStrategy): def configure_ddp(self): self.model = MyCustomDistributedDataParallel( self.model, device_ids=..., ) def setup(self, trainer): # you can access the accelerator and plugins directly self.accelerator.setup() self.precision_plugin.connect(...) The custom strategy can then be passed into the ``Trainer`` directly via the ``strategy`` parameter. .. code-block:: python # custom strategy trainer = Trainer(strategy=CustomDDPStrategy()) Since the strategy also hosts the Accelerator and various plugins, you can customize all of them to work together as you like: .. code-block:: python # custom strategy, with new accelerator and plugins accelerator = MyAccelerator() precision_plugin = MyPrecisionPlugin() strategy = CustomDDPStrategy(accelerator=accelerator, precision_plugin=precision_plugin) trainer = Trainer(strategy=strategy)