Shortcuts

gpu_stats_monitor

Classes

GPUStatsMonitor

Automatically monitors and logs GPU stats during training stage.

GPU Stats Monitor

Monitor and logs GPU stats during training.

class pytorch_lightning.callbacks.gpu_stats_monitor.GPUStatsMonitor(memory_utilization=True, gpu_utilization=True, intra_step_time=False, inter_step_time=False, fan_speed=False, temperature=False)[source]

Bases: pytorch_lightning.callbacks.base.Callback

Automatically monitors and logs GPU stats during training stage. GPUStatsMonitor is a callback and in order to use it you need to assign a logger in the Trainer.

Parameters
  • memory_utilization (bool) – Set to True to monitor used, free and percentage of memory utilization at the start and end of each step. Default: True.

  • gpu_utilization (bool) – Set to True to monitor percentage of GPU utilization at the start and end of each step. Default: True.

  • intra_step_time (bool) – Set to True to monitor the time of each step. Default: False.

  • inter_step_time (bool) – Set to True to monitor the time between the end of one step and the start of the next step. Default: False.

  • fan_speed (bool) – Set to True to monitor percentage of fan speed. Default: False.

  • temperature (bool) – Set to True to monitor the memory and gpu temperature in degree Celsius. Default: False.

Raises

MisconfigurationException – If NVIDIA driver is not installed, not running on GPUs, or Trainer has no logger.

Example:

>>> from pytorch_lightning import Trainer
>>> from pytorch_lightning.callbacks import GPUStatsMonitor
>>> gpu_stats = GPUStatsMonitor() 
>>> trainer = Trainer(callbacks=[gpu_stats]) 

GPU stats are mainly based on nvidia-smi –query-gpu command. The description of the queries is as follows:

  • fan.speed – The fan speed value is the percent of maximum speed that the device’s fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.

  • memory.used – Total memory allocated by active contexts.

  • memory.free – Total free memory.

  • utilization.gpu – Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.

  • utilization.memory – Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.

  • temperature.gpu – Core GPU temperature, in degrees C.

  • temperature.memory – HBM memory temperature, in degrees C.

on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx)[source]

Called when the train batch ends.

Return type

None

on_train_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx)[source]

Called when the train batch begins.

Return type

None

on_train_epoch_start(trainer, pl_module)[source]

Called when the train epoch begins.

Return type

None

setup(trainer, pl_module, stage=None)[source]

Called when fit, validate, test, predict, or tune begins

Return type

None