:orphan:

.. _profiler_intermediate:

############################################
Find bottlenecks in your code (intermediate)
############################################
**Audience**: Users who want to see more granular profiling information

----

**************************
Profile pytorch operations
**************************
To understand the cost of each PyTorch operation, use the :class:`~lightning.pytorch.profilers.pytorch.PyTorchProfiler` built on top of the `PyTorch profiler <https://pytorch.org/docs/master/profiler.html>`__.

.. code-block:: python

    from lightning.pytorch.profilers import PyTorchProfiler

    profiler = PyTorchProfiler()
    trainer = Trainer(profiler=profiler)

The profiler will generate an output like this:

.. code-block::

    Profiler Report

    Profile stats for: training_step
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    t                      62.10%           1.044ms          62.77%           1.055ms          1.055ms
    addmm                  32.32%           543.135us        32.69%           549.362us        549.362us
    mse_loss               1.35%            22.657us         3.58%            60.105us         60.105us
    mean                   0.22%            3.694us          2.05%            34.523us         34.523us
    div_                   0.64%            10.756us         1.90%            32.001us         16.000us
    ones_like              0.21%            3.461us          0.81%            13.669us         13.669us
    sum_out                0.45%            7.638us          0.74%            12.432us         12.432us
    transpose              0.23%            3.786us          0.68%            11.393us         11.393us
    as_strided             0.60%            10.060us         0.60%            10.060us         3.353us
    to                     0.18%            3.059us          0.44%            7.464us          7.464us
    empty_like             0.14%            2.387us          0.41%            6.859us          6.859us
    empty_strided          0.38%            6.351us          0.38%            6.351us          3.175us
    fill_                  0.28%            4.782us          0.33%            5.566us          2.783us
    expand                 0.20%            3.336us          0.28%            4.743us          4.743us
    empty                  0.27%            4.456us          0.27%            4.456us          2.228us
    copy_                  0.15%            2.526us          0.15%            2.526us          2.526us
    broadcast_tensors      0.15%            2.492us          0.15%            2.492us          2.492us
    size                   0.06%            0.967us          0.06%            0.967us          0.484us
    is_complex             0.06%            0.961us          0.06%            0.961us          0.481us
    stride                 0.03%            0.517us          0.03%            0.517us          0.517us
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    Self CPU time total: 1.681ms

.. note::
    When using the PyTorch Profiler, wall clock time will not not be representative of the true wall clock time.
    This is due to forcing profiled operations to be measured synchronously, when many CUDA ops happen asynchronously.
    It is recommended to use this Profiler to find bottlenecks/breakdowns, however for end to end wall clock time use
    the ``SimpleProfiler``.

----

***************************
Profile a distributed model
***************************
To profile a distributed model, use the :class:`~lightning.pytorch.profilers.pytorch.PyTorchProfiler` with the *filename* argument which will save a report per rank.

.. code-block:: python

    from lightning.pytorch.profilers import PyTorchProfiler

    profiler = PyTorchProfiler(filename="perf-logs")
    trainer = Trainer(profiler=profiler)

With two ranks, it will generate a report like so:

.. code-block::

    Profiler Report: rank 0

    Profile stats for: training_step
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    t                      62.10%           1.044ms          62.77%           1.055ms          1.055ms
    addmm                  32.32%           543.135us        32.69%           549.362us        549.362us
    mse_loss               1.35%            22.657us         3.58%            60.105us         60.105us
    mean                   0.22%            3.694us          2.05%            34.523us         34.523us
    div_                   0.64%            10.756us         1.90%            32.001us         16.000us
    ones_like              0.21%            3.461us          0.81%            13.669us         13.669us
    sum_out                0.45%            7.638us          0.74%            12.432us         12.432us
    transpose              0.23%            3.786us          0.68%            11.393us         11.393us
    as_strided             0.60%            10.060us         0.60%            10.060us         3.353us
    to                     0.18%            3.059us          0.44%            7.464us          7.464us
    empty_like             0.14%            2.387us          0.41%            6.859us          6.859us
    empty_strided          0.38%            6.351us          0.38%            6.351us          3.175us
    fill_                  0.28%            4.782us          0.33%            5.566us          2.783us
    expand                 0.20%            3.336us          0.28%            4.743us          4.743us
    empty                  0.27%            4.456us          0.27%            4.456us          2.228us
    copy_                  0.15%            2.526us          0.15%            2.526us          2.526us
    broadcast_tensors      0.15%            2.492us          0.15%            2.492us          2.492us
    size                   0.06%            0.967us          0.06%            0.967us          0.484us
    is_complex             0.06%            0.961us          0.06%            0.961us          0.481us
    stride                 0.03%            0.517us          0.03%            0.517us          0.517us
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    Self CPU time total: 1.681ms

.. code-block::

    Profiler Report: rank 1

    Profile stats for: training_step
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    t                      42.10%           1.044ms          62.77%           1.055ms          1.055ms
    addmm                  32.32%           543.135us        32.69%           549.362us        549.362us
    mse_loss               1.35%            22.657us         3.58%            60.105us         60.105us
    mean                   0.22%            3.694us          2.05%            34.523us         34.523us
    div_                   0.64%            10.756us         1.90%            32.001us         16.000us
    ones_like              0.21%            3.461us          0.81%            13.669us         13.669us
    sum_out                0.45%            7.638us          0.74%            12.432us         12.432us
    transpose              0.23%            3.786us          0.68%            11.393us         11.393us
    as_strided             0.60%            10.060us         0.60%            10.060us         3.353us
    to                     0.18%            3.059us          0.44%            7.464us          7.464us
    empty_like             0.14%            2.387us          0.41%            6.859us          6.859us
    empty_strided          0.38%            6.351us          0.38%            6.351us          3.175us
    fill_                  0.28%            4.782us          0.33%            5.566us          2.783us
    expand                 0.20%            3.336us          0.28%            4.743us          4.743us
    empty                  0.27%            4.456us          0.27%            4.456us          2.228us
    copy_                  0.15%            2.526us          0.15%            2.526us          2.526us
    broadcast_tensors      0.15%            2.492us          0.15%            2.492us          2.492us
    size                   0.06%            0.967us          0.06%            0.967us          0.484us
    is_complex             0.06%            0.961us          0.06%            0.961us          0.481us
    stride                 0.03%            0.517us          0.03%            0.517us          0.517us
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
    Self CPU time total: 1.681ms

This profiler will record ``training_step``, ``validation_step``, ``test_step``, and ``predict_step``.
The output above shows the profiling for the action ``training_step``.

.. note::
    When using the PyTorch Profiler, wall clock time will not not be representative of the true wall clock time.
    This is due to forcing profiled operations to be measured synchronously, when many CUDA ops happen asynchronously.
    It is recommended to use this Profiler to find bottlenecks/breakdowns, however for end to end wall clock time use
    the ``SimpleProfiler``.

----

*****************************
Visualize profiled operations
*****************************
To visualize the profiled operations, enable **emit_nvtx** in the :class:`~lightning.pytorch.profilers.pytorch.PyTorchProfiler`.

.. code-block:: python

    from lightning.pytorch.profilers import PyTorchProfiler

    profiler = PyTorchProfiler(emit_nvtx=True)
    trainer = Trainer(profiler=profiler)

Then run as following:

.. code-block::

    nvprof --profile-from-start off -o trace_name.prof -- <regular command here>

To visualize the profiled operation, you can either use **nvvp**:

.. code-block::

    nvvp trace_name.prof

or python:

.. code-block::

    python -c 'import torch; print(torch.autograd.profiler.load_nvprof("trace_name.prof"))'