:orphan: .. _hpu_basics: Accelerator: HPU training ========================= **Audience:** Users looking to save money and run large models faster using single or multiple Gaudi devices. ---- What is an HPU? --------------- `Habana® Gaudi® AI Processor (HPU) `__ training processors are built on a heterogeneous architecture with a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries, and a configurable Matrix Math engine. The TPC core is a VLIW SIMD processor with an instruction set and hardware tailored to serve training workloads efficiently. The Gaudi memory architecture includes on-die SRAM and local memories in each TPC and, Gaudi is the first DL training processor that has integrated RDMA over Converged Ethernet (RoCE v2) engines on-chip. On the software side, the PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to enable the execution of deep learning models on the Habana Gaudi device. Gaudi offers a substantial price/performance advantage -- so you get to do more deep learning training while spending less. For more information, check out `Gaudi Architecture `__ and `Gaudi Developer Docs `__. ---- Run on Gaudi ------------ To enable PyTorch Lightning to utilize the HPU accelerator, simply provide ``accelerator=HPUAccelerator()"`` parameter to the Trainer class. .. code-block:: python from lightning_habana.pytorch.accelerator import HPUAccelerator # run on as many Gaudi devices as available by default trainer = Trainer(accelerator="auto", devices="auto", strategy="auto") # equivalent to trainer = Trainer() # run on one Gaudi device trainer = Trainer(accelerator=HPUAccelerator(), devices=1) # run on multiple Gaudi devices trainer = Trainer(accelerator=HPUAccelerator(), devices=8) # choose the number of devices automatically trainer = Trainer(accelerator=HPUAccelerator(), devices="auto") The ``devices=1`` parameter with HPUs enables the Habana accelerator for single card training. It uses :class:`~lightning_habana.pytorch.strategies.SingleHPUStrategy`. The ``devices>1`` parameter with HPUs enables the Habana accelerator for distributed training. It uses :class:`~lightning_habana.pytorch.strategies.HPUParallelStrategy` which is based on DDP strategy with the addition of Habana's collective communication library (HCCL) to support scale-up within a node and scale-out across multiple nodes. ---- Scale-out on Gaudis ------------------- To train a Lightning model using multiple HPU nodes, set the ``num_nodes`` parameter with the available nodes in the ``Trainer`` class. .. code-block:: python from lightning_habana.pytorch.accelerator import HPUAccelerator from lightning_habana.pytorch.strategies import HPUParallelStrategy hpus = 8 parallel_hpus = [torch.device("hpu")] * hpus trainer = Trainer(accelerator=HPUAccelerator(), devices=hpus, strategy=HPUParallelStrategy(parallel_devices=parallel_hpus), num_nodes=2) In addition to this, the following environment variables need to be set to establish communication across nodes. - *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0 - *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node - *WORLD_SIZE* - required; how many workers are in the cluster - *NODE_RANK* - required; id of the node in the cluster The trainer needs to be instantiated on every node participating in the training. On Node 1: .. code-block:: bash MASTER_ADDR= MASTER_PORT= NODE_RANK=0 WORLD_SIZE=16 python -m some_model_trainer.py (--arg1 ... train script args...) On Node 2: .. code-block:: bash MASTER_ADDR= MASTER_PORT= NODE_RANK=1 WORLD_SIZE=16 python -m some_model_trainer.py (--arg1 ... train script args...) ---- How to access HPUs ------------------ To use HPUs, you must have access to a system with HPU devices. AWS ^^^ You can either use `Gaudi-based AWS EC2 DL1 instances `__ or `Supermicro X12 Gaudi server `__ to get access to HPUs. Check out the `PyTorch Model on AWS DL1 Instance Quick Start `__.