Distributed Data Parallel (DDP)
###############################

Thunder has its own Distributed Data Parallel (DDP) transform that we recommend using, although compiled modules also work with PyTorch's DDP transform.

You can wrap a model in Thunder's ddp like this::

  from thunder.distributed import ddp

  model = MyModel()
  ddp_model = ddp(model)
  cmodel = thunder.jit(ddp_model)

Specifying which rank to broadcast from is optional. ``ddp()`` will broadcast from the lowest rank in that group if ``broadcast_from`` is not specified.

Thunder's ddp is compatible with PyTorch distributed runners like ``torchrun`` (https://pytorch.org/docs/stable/elastic/run.html).

When using PyTorch's DDP, call DDP on the jitted module::

  from torch.nn.parallel import DistributedDataParallel as DDP

  model = MyModel()
  jitted_model = thunder.jit(model)
  ddp_model = DDP(jitted_model)

The ability of Thunder to express distributed algorithms like DDP as a simple transform on the trace is one of Thunder's strengths and is being leveraged to quickly implement more elaborate distributed strategies, like Fully Sharded Data Parallel (FSDP).