Track and Visualize Experiments (advanced)¶
Audience: Users who want to do advanced speed optimizations by customizing the logging behavior.
Change progress bar defaults¶
To change the default values (ie: version number) shown in the progress bar, override the get_metrics()
method in your logger.
from lightning.pytorch.callbacks.progress import Tqdm
class CustomProgressBar(Tqdm):
def get_metrics(self, *args, **kwargs):
# don't show the version number
items = super().get_metrics()
items.pop("v_num", None)
return items
Customize tracking to speed up model¶
Modify logging frequency¶
Logging a metric on every single batch can slow down training. By default, Lightning logs every 50 rows, or 50 training steps.
To change this behaviour, set the log_every_n_steps Trainer
flag.
k = 10
trainer = Trainer(log_every_n_steps=k)
Modify flushing frequency¶
Some loggers keep logged metrics in memory for N steps and only periodically flush them to disk to improve training efficiency. Every logger handles this a bit differently. For example, here is how to fine-tune flushing for the TensorBoard logger:
# Default used by TensorBoard: Write to disk after 10 logging events or every two minutes
logger = TensorBoardLogger(..., max_queue=10, flush_secs=120)
# Faster training, more memory used
logger = TensorBoardLogger(..., max_queue=100)
# Slower training, less memory used
logger = TensorBoardLogger(..., max_queue=1)
Customize self.log¶
The LightningModule self.log method offers many configurations to customize its behavior.
add_dataloader_idx¶
Default: True
If True, appends the index of the current dataloader to the name (when using multiple dataloaders). If False, user needs to give unique names for each dataloader to not mix the values.
self.log(add_dataloader_idx=True)
batch_size¶
Default: None
Current batch size used for accumulating logs logged with on_epoch=True
. This will be directly inferred from the loaded batch, but for some data structures you might need to explicitly provide it.
self.log(batch_size=32)
logger¶
Default: True
Send logs to the logger like Tensorboard
, or any other custom logger passed to the Trainer
(Default: True
).
self.log(logger=True)
on_epoch¶
Default: It varies
If this is True, that specific self.log call accumulates and reduces all metrics to the end of the epoch.
self.log(on_epoch=True)
The default value depends in which function this is called
def training_step(self, batch, batch_idx):
# Default: False
self.log(on_epoch=False)
def validation_step(self, batch, batch_idx):
# Default: True
self.log(on_epoch=True)
def test_step(self, batch, batch_idx):
# Default: True
self.log(on_epoch=True)
on_step¶
Default: It varies
If this is True, that specific self.log call will NOT accumulate metrics. Instead it will generate a timeseries across steps.
self.log(on_step=True)
The default value depends in which function this is called
def training_step(self, batch, batch_idx):
# Default: True
self.log(on_step=True)
def validation_step(self, batch, batch_idx):
# Default: False
self.log(on_step=False)
def test_step(self, batch, batch_idx):
# Default: False
self.log(on_step=False)
prog_bar¶
Default: False
If set to True, logs will be sent to the progress bar.
self.log(prog_bar=True)
rank_zero_only¶
Default: True
Whether the value will be logged only on rank 0. This will prevent synchronization which would produce a deadlock as not all processes would perform this log call.
self.log(rank_zero_only=True)
reduce_fx¶
Default: torch.mean()
Reduction function over step values for end of epoch. Uses torch.mean()
by default and is not applied when a torchmetrics.Metric
is logged.
self.log(..., reduce_fx=torch.mean)
sync_dist¶
Default: False
If True, reduces the metric across devices. Use with care as this may lead to a significant communication overhead.
self.log(sync_dist=False)
sync_dist_group¶
Default: None
The DDP group to sync across.
import torch.distributed as dist
group = dist.init_process_group("nccl", rank=self.global_rank, world_size=self.world_size)
self.log(sync_dist_group=group)
Enable metrics for distributed training¶
For certain types of metrics that need complex aggregation, we recommended to build your metric using torchmetric which ensures all the complexities of metric aggregation in distributed environments is handled.
First, implement your metric:
import torch
import torchmetrics
class MyAccuracy(Metric):
def __init__(self, dist_sync_on_step=False):
# call `self.add_state`for every internal state that is needed for the metrics computations
# dist_reduce_fx indicates the function that should be used to reduce
# state from multiple processes
super().__init__(dist_sync_on_step=dist_sync_on_step)
self.add_state("correct", default=torch.tensor(0), dist_reduce_fx="sum")
self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
def update(self, preds: torch.Tensor, target: torch.Tensor):
# update metric states
preds, target = self._input_format(preds, target)
assert preds.shape == target.shape
self.correct += torch.sum(preds == target)
self.total += target.numel()
def compute(self):
# compute final result
return self.correct.float() / self.total
To use the metric inside Lightning, 1) initialize it in the init, 2) compute the metric, 3) pass it into self.log
class LitModel(LightningModule):
def __init__(self):
# 1. initialize the metric
self.accuracy = MyAccuracy()
def training_step(self, batch, batch_idx):
x, y = batch
preds = self(x)
# 2. compute the metric
self.accuracy(preds, y)
# 3. log it
self.log("train_acc_step", self.accuracy)
Log to a custom cloud filesystem¶
Lightning is integrated with the major remote file systems including local filesystems and several cloud storage providers such as S3 on AWS, GCS on Google Cloud, or ADL on Azure.
PyTorch Lightning uses fsspec internally to handle all filesystem operations.
To save logs to a remote filesystem, prepend a protocol like “s3:/” to the root_dir used for writing and reading model data.
from lightning.pytorch.loggers import TensorBoardLogger
logger = TensorBoardLogger(save_dir="s3://my_bucket/logs/")
trainer = Trainer(logger=logger)
trainer.fit(model)
Track both step and epoch metrics¶
To track the timeseries over steps (on_step) as well as the accumulated epoch metric (on_epoch), set both to True
self.log(on_step=True, on_epoch=True)
Setting both to True will generate two graphs with _step for the timeseries over steps and _epoch for the epoch metric.
# TODO: show images of both
Understand self.log automatic behavior¶
This table shows the default values of on_step and on_epoch depending on the LightningModule or Callback method.
In LightningModule¶
Method |
on_step |
on_epoch |
---|---|---|
on_after_backward, on_before_backward, on_before_optimizer_step, optimizer_step, configure_gradient_clipping, on_before_zero_grad, training_step |
True |
False |
test_step, validation_step |
False |
True |
In Callback¶
Method |
on_step |
on_epoch |
---|---|---|
on_after_backward, on_before_backward, on_before_optimizer_step, on_before_zero_grad, on_train_batch_start, on_train_batch_end |
True |
False |
on_train_epoch_start, on_train_epoch_end, on_train_start, on_validation_batch_start, on_validation_batch_end, on_validation_start, on_validation_epoch_start, on_validation_epoch_end |
False |
True |
Note
To add logging to an unsupported method, please open an issue with a clear description of why it is blocking you.