CLIP Score

Module Interface

class torchmetrics.multimodal.clip_score.CLIPScore(model_name_or_path='openai/clip-vit-large-patch14', **kwargs)[source]

Calculates CLIP Score which is a text-to-image similarity metric.

CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image, as well as the similarity between texts or images. It has been found to be highly correlated with human judgement. The metric is defined as:

\[\text{CLIPScore(I, C)} = max(100 * cos(E_I, E_C), 0)\]

which corresponds to the cosine similarity between visual CLIP embedding \(E_i\) for an image \(i\) and textual CLIP embedding \(E_C\) for an caption \(C\). The score is bound between 0 and 100 and the closer to 100 the better.

Additionally, the CLIP Score can be calculated for the same modalities:

\[\text{CLIPScore(I_1, I_2)} = max(100 * cos(E_{I_1}, E_{I_2}), 0)\]

where \(E_{I_1}\) and \(E_{I_2}\) are the visual embeddings for images \(I_1\) and \(I_2\).

\[\text{CLIPScore(T_1, T_2)} = max(100 * cos(E_{T_1}, E_{T_2}), 0)\]

where \(E_{T_1}\) and \(E_{T_2}\) are the textual embeddings for texts \(T_1\) and \(T_2\).

Caution

Metric is not scriptable

As input to forward and update the metric accepts the following input

  • source: Source input. This can be:
    • Images: (Tensor or list of tensors): tensor with images feed to the feature extractor with. If

      a single tensor it should have shape (N, C, H, W). If a list of tensors, each tensor should have shape (C, H, W). C is the number of channels, H and W are the height and width of the image.

    • Text: (str or list of str): text to compare with the images, one for each image.

  • target: Target input. This can be:
    • Images: (Tensor or list of tensors): tensor with images feed to the feature extractor with. If

      a single tensor it should have shape (N, C, H, W). If a list of tensors, each tensor should have shape (C, H, W). C is the number of channels, H and W are the height and width of the image.

    • Text: (str or list of str): text to compare with the images, one for each image.

As output of forward and compute the metric returns the following output

  • clip_score (Tensor): float scalar tensor with mean CLIP score over samples

Parameters:
  • model_name_or_path (Literal['openai/clip-vit-base-patch16', 'openai/clip-vit-base-patch32', 'openai/clip-vit-large-patch14-336', 'openai/clip-vit-large-patch14']) –

    string indicating the version of the CLIP model to use. Available models are:

    • ”openai/clip-vit-base-patch16”

    • ”openai/clip-vit-base-patch32”

    • ”openai/clip-vit-large-patch14-336”

    • ”openai/clip-vit-large-patch14”

  • kwargs (Any) – Additional keyword arguments, see Advanced metric settings for more info.

Raises:

ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0

Example

>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> image = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> score = metric(image, "a photo of a cat")
>>> score.detach().round()
tensor(24.)

Example

>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> image1 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> image2 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(43))
>>> score = metric(image1, image2)
>>> score.detach().round()
tensor(99.)

Example

>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> score = metric("28-year-old chef found dead in San Francisco mall",
...               "A 28-year-old chef who recently moved to San Francisco was found dead.")
>>> score.detach().round()
tensor(91.)
compute()[source]

Compute accumulated clip score.

Return type:

Tensor

plot(val=None, ax=None)[source]

Plot a single or multiple values from the metric.

Parameters:
  • val (Union[Tensor, Sequence[Tensor], None]) – Either a single result from calling metric.forward or metric.compute or a list of these results. If no value is provided, will automatically call metric.compute and plot that result.

  • ax (Optional[Axes]) – An matplotlib axis object. If provided will add plot to that axis

Return type:

tuple[Figure, Union[Axes, ndarray]]

Returns:

Figure and Axes object

Raises:

ModuleNotFoundError – If matplotlib is not installed

>>> # Example plotting a single value
>>> import torch
>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> metric.update(torch.randint(255, (3, 224, 224)), "a photo of a cat")
>>> fig_, ax_ = metric.plot()
../_images/clip_score-1.png
>>> # Example plotting multiple values
>>> import torch
>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> values = [ ]
>>> for _ in range(10):
...     values.append(metric(torch.randint(255, (3, 224, 224)), "a photo of a cat"))
>>> fig_, ax_ = metric.plot(values)
../_images/clip_score-2.png
update(source, target)[source]

Update CLIP score on a batch of images and text.

Parameters:
  • source (Union[Tensor, List[Tensor], List[str], str]) – Source input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

  • target (Union[Tensor, List[Tensor], List[str], str]) – Target input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

Raises:
  • ValueError – If not all images have format [C, H, W]

  • ValueError – If the number of images and captions do not match

Return type:

None

Functional Interface

torchmetrics.functional.multimodal.clip_score.clip_score(source, target, model_name_or_path='openai/clip-vit-large-patch14')[source]

Calculates CLIP Score which is a text-to-image similarity metric.

CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image, as well as the similarity between texts or images. It has been found to be highly correlated with human judgement. The metric is defined as:

\[\text{CLIPScore(I, C)} = max(100 * cos(E_I, E_C), 0)\]

which corresponds to the cosine similarity between visual CLIP embedding \(E_i\) for an image \(i\) and textual CLIP embedding \(E_C\) for an caption \(C\). The score is bound between 0 and 100 and the closer to 100 the better.

Additionally, the CLIP Score can be calculated for the same modalities:

\[\text{CLIPScore(I_1, I_2)} = max(100 * cos(E_{I_1}, E_{I_2}), 0)\]

where \(E_{I_1}\) and \(E_{I_2}\) are the visual embeddings for images \(I_1\) and \(I_2\).

\[\text{CLIPScore(T_1, T_2)} = max(100 * cos(E_{T_1}, E_{T_2}), 0)\]

where \(E_{T_1}\) and \(E_{T_2}\) are the textual embeddings for texts \(T_1\) and \(T_2\).

Note

Metric is not scriptable

Parameters:
  • source (Union[Tensor, List[Tensor], List[str], str]) – Source input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

  • target (Union[Tensor, List[Tensor], List[str], str]) – Target input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

  • model_name_or_path (Literal['openai/clip-vit-base-patch16', 'openai/clip-vit-base-patch32', 'openai/clip-vit-large-patch14-336', 'openai/clip-vit-large-patch14']) – String indicating the version of the CLIP model to use. Available models are: - “openai/clip-vit-base-patch16” - “openai/clip-vit-base-patch32” - “openai/clip-vit-large-patch14-336” - “openai/clip-vit-large-patch14”

Raises:
  • ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0

  • ValueError – If not all images have format [C, H, W]

  • ValueError – If the number of images and captions do not match

Return type:

Tensor

Example

>>> from torchmetrics.functional.multimodal import clip_score
>>> image = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> score = clip_score(image, "a photo of a cat", "openai/clip-vit-base-patch16")
>>> score.detach()
tensor(24.4255)

Example

>>> from torchmetrics.functional.multimodal import clip_score
>>> image1 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> image2 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(43))
>>> score = clip_score(image1, image2, "openai/clip-vit-base-patch16")
>>> score.detach()
tensor(99.4859)

Example

>>> from torchmetrics.functional.multimodal import clip_score
>>> score = clip_score(
...     "28-year-old chef found dead in San Francisco mall",
...     "A 28-year-old chef who recently moved to San Francisco was found dead.",
...     "openai/clip-vit-base-patch16"
... )
>>> score.detach()
tensor(91.3950)