gpu_stats_monitor¶
Classes
Automatically monitors and logs GPU stats during training stage. |
GPU Stats Monitor¶
Monitor and logs GPU stats during training.
- class pytorch_lightning.callbacks.gpu_stats_monitor.GPUStatsMonitor(memory_utilization=True, gpu_utilization=True, intra_step_time=False, inter_step_time=False, fan_speed=False, temperature=False)[source]¶
Bases:
pytorch_lightning.callbacks.base.Callback
Automatically monitors and logs GPU stats during training stage.
GPUStatsMonitor
is a callback and in order to use it you need to assign a logger in theTrainer
.- Parameters
memory_utilization¶ (
bool
) – Set toTrue
to monitor used, free and percentage of memory utilization at the start and end of each step. Default:True
.gpu_utilization¶ (
bool
) – Set toTrue
to monitor percentage of GPU utilization at the start and end of each step. Default:True
.intra_step_time¶ (
bool
) – Set toTrue
to monitor the time of each step. Default:False
.inter_step_time¶ (
bool
) – Set toTrue
to monitor the time between the end of one step and the start of the next step. Default:False
.fan_speed¶ (
bool
) – Set toTrue
to monitor percentage of fan speed. Default:False
.temperature¶ (
bool
) – Set toTrue
to monitor the memory and gpu temperature in degree Celsius. Default:False
.
- Raises
MisconfigurationException – If NVIDIA driver is not installed, not running on GPUs, or
Trainer
has no logger.
Example:
>>> from pytorch_lightning import Trainer >>> from pytorch_lightning.callbacks import GPUStatsMonitor >>> gpu_stats = GPUStatsMonitor() >>> trainer = Trainer(callbacks=[gpu_stats])
GPU stats are mainly based on nvidia-smi –query-gpu command. The description of the queries is as follows:
fan.speed – The fan speed value is the percent of maximum speed that the device’s fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
memory.used – Total memory allocated by active contexts.
memory.free – Total free memory.
utilization.gpu – Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.
utilization.memory – Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.
temperature.gpu – Core GPU temperature, in degrees C.
temperature.memory – HBM memory temperature, in degrees C.
- on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx)[source]¶
Called when the train batch ends.
- Return type
- on_train_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx)[source]¶
Called when the train batch begins.
- Return type