Permutation Invariant Training (PIT)

Module Interface

class torchmetrics.audio.PermutationInvariantTraining(metric_func, mode='speaker-wise', eval_func='max', **kwargs)[source]

Calculate Permutation invariant training (PIT).

This metric can evaluate models for speaker independent multi-talker speech separation in a permutation invariant way.

As input to forward and update the metric accepts the following input

  • preds (Tensor): float tensor with shape (batch_size,num_speakers,...)

  • target (Tensor): float tensor with shape (batch_size,num_speakers,...)

As output of forward and compute the metric returns the following output

  • pesq (Tensor): float scalar tensor with average PESQ value over samples

Parameters:
  • metric_func (Callable) –

    a metric function accept a batch of target and estimate.

    if mode`==’speaker-wise’, then ``metric_func(preds[:, i, …], target[:, j, …])` is called and expected to return a batch of metric tensors (batch,);

    if mode`==’permutation-wise’, then ``metric_func(preds[:, p, …], target[:, :, …])` is called, where p is one possible permutation, e.g. [0,1] or [1,0] for 2-speaker case, and expected to return a batch of metric tensors (batch,);

  • mode (Literal['speaker-wise', 'permutation-wise']) – can be ‘speaker-wise’ or ‘permutation-wise’.

  • eval_func (Literal['max', 'min']) – the function to find the best permutation, can be ‘min’ or ‘max’, i.e. the smaller the better or the larger the better.

  • kwargs (Any) – Additional keyword arguments for either the metric_func or distributed communication, see Advanced metric settings for more info.

Example

>>> from torch import randn
>>> from torchmetrics.audio import PermutationInvariantTraining
>>> from torchmetrics.functional.audio import scale_invariant_signal_noise_ratio
>>> preds = randn(3, 2, 5) # [batch, spk, time]
>>> target = randn(3, 2, 5) # [batch, spk, time]
>>> pit = PermutationInvariantTraining(scale_invariant_signal_noise_ratio,
...     mode="speaker-wise", eval_func="max")
>>> pit(preds, target)
tensor(-2.1065)
plot(val=None, ax=None)[source]

Plot a single or multiple values from the metric.

Parameters:
  • val (Union[Tensor, Sequence[Tensor], None]) – Either a single result from calling metric.forward or metric.compute or a list of these results. If no value is provided, will automatically call metric.compute and plot that result.

  • ax (Optional[Axes]) – An matplotlib axis object. If provided will add plot to that axis

Return type:

tuple[Figure, Union[Axes, ndarray]]

Returns:

Figure and Axes object

Raises:

ModuleNotFoundError – If matplotlib is not installed

>>> # Example plotting a single value
>>> import torch
>>> from torchmetrics.audio import PermutationInvariantTraining
>>> from torchmetrics.functional.audio import scale_invariant_signal_noise_ratio
>>> preds = torch.randn(3, 2, 5) # [batch, spk, time]
>>> target = torch.randn(3, 2, 5) # [batch, spk, time]
>>> metric = PermutationInvariantTraining(scale_invariant_signal_noise_ratio,
...     mode="speaker-wise", eval_func="max")
>>> metric.update(preds, target)
>>> fig_, ax_ = metric.plot()
../_images/permutation_invariant_training-1.png
>>> # Example plotting multiple values
>>> import torch
>>> from torchmetrics.audio import PermutationInvariantTraining
>>> from torchmetrics.functional.audio import scale_invariant_signal_noise_ratio
>>> preds = torch.randn(3, 2, 5) # [batch, spk, time]
>>> target = torch.randn(3, 2, 5) # [batch, spk, time]
>>> metric = PermutationInvariantTraining(scale_invariant_signal_noise_ratio,
...     mode="speaker-wise", eval_func="max")
>>> values = [ ]
>>> for _ in range(10):
...     values.append(metric(preds, target))
>>> fig_, ax_ = metric.plot(values)
../_images/permutation_invariant_training-2.png

Functional Interface

torchmetrics.functional.audio.permutation_invariant_training(preds, target, metric_func, mode='speaker-wise', eval_func='max', **kwargs)[source]

Calculate Permutation invariant training (PIT).

This metric can evaluate models for speaker independent multi-talker speech separation in a permutation invariant way.

Parameters:
  • preds (Tensor) – float tensor with shape (batch_size,num_speakers,...)

  • target (Tensor) – float tensor with shape (batch_size,num_speakers,...)

  • metric_func (Callable) –

    a metric function accept a batch of target and estimate. if mode`==’speaker-wise’, then ``metric_func(preds[:, i, …], target[:, j, …])` is called and expected to return a batch of metric tensors (batch,);

    if mode`==’permutation-wise’, then ``metric_func(preds[:, p, …], target[:, :, …])` is called, where p is one possible permutation, e.g. [0,1] or [1,0] for 2-speaker case, and expected to return a batch of metric tensors (batch,);

  • mode (Literal['speaker-wise', 'permutation-wise']) – can be ‘speaker-wise’ or ‘permutation-wise’.

  • eval_func (Literal['max', 'min']) – the function to find the best permutation, can be 'min' or 'max', i.e. the smaller the better or the larger the better.

  • kwargs (Any) – Additional args for metric_func

Return type:

tuple[Tensor, Tensor]

Returns:

Tuple of two float tensors. First tensor with shape (batch,) contains the best metric value for each sample and second tensor with shape (batch,) contains the best permutation.

Example

>>> from torchmetrics.functional.audio import scale_invariant_signal_distortion_ratio
>>> # [batch, spk, time]
>>> preds = torch.tensor([[[-0.0579,  0.3560, -0.9604], [-0.1719,  0.3205,  0.2951]]])
>>> target = torch.tensor([[[ 1.0958, -0.1648,  0.5228], [-0.4100,  1.1942, -0.5103]]])
>>> best_metric, best_perm = permutation_invariant_training(
...     preds, target, scale_invariant_signal_distortion_ratio,
...     mode="speaker-wise", eval_func="max")
>>> best_metric
tensor([-5.1091])
>>> best_perm
tensor([[0, 1]])
>>> pit_permutate(preds, best_perm)
tensor([[[-0.0579,  0.3560, -0.9604],
         [-0.1719,  0.3205,  0.2951]]])