CLIP Score

Module Interface

class torchmetrics.multimodal.clip_score.CLIPScore(model_name_or_path='openai/clip-vit-large-patch14', **kwargs)[source]

Calculates CLIP Score which is a text-to-image similarity metric.

CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image, as well as the similarity between texts or images. It has been found to be highly correlated with human judgement. The metric is defined as:

CLIPScore(I, C)=max(100cos(EI,EC),0)

which corresponds to the cosine similarity between visual CLIP embedding Ei for an image i and textual CLIP embedding EC for an caption C. The score is bound between 0 and 100 and the closer to 100 the better.

Additionally, the CLIP Score can be calculated for the same modalities:

CLIPScore(I_1, I_2)=max(100cos(EI1,EI2),0)

where EI1 and EI2 are the visual embeddings for images I1 and I2.

CLIPScore(T_1, T_2)=max(100cos(ET1,ET2),0)

where ET1 and ET2 are the textual embeddings for texts T1 and T2.

Caution

Metric is not scriptable

As input to forward and update the metric accepts the following input

  • source: Source input.

    This can be:

    • Images: Tensor or list of Tensor

      If a single tensor, it should have shape (N, C, H, W). If a list of tensors, each tensor should have shape (C, H, W). C is the number of channels, H and W are the height and width of the image.

    • Text: str or list of str

      Either a single caption or a list of captions.

  • target: Target input.

    This can be:

    • Images: Tensor or list of Tensor

      If a single tensor, it should have shape (N, C, H, W). If a list of tensors, each tensor should have shape (C, H, W). C is the number of channels, H and W are the height and width of the image.

    • Text: str or list of str

      Either a single caption or a list of captions.

As output of forward and compute the metric returns the following output

  • clip_score (Tensor): float scalar tensor with mean CLIP score over samples

Parameters:
  • model_name_or_path (Literal['openai/clip-vit-base-patch16', 'openai/clip-vit-base-patch32', 'openai/clip-vit-large-patch14-336', 'openai/clip-vit-large-patch14']) –

    string indicating the version of the CLIP model to use. Available models are:

    • ”openai/clip-vit-base-patch16”

    • ”openai/clip-vit-base-patch32”

    • ”openai/clip-vit-large-patch14-336”

    • ”openai/clip-vit-large-patch14”

  • kwargs (Any) – Additional keyword arguments, see Advanced metric settings for more info.

Raises:

ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0

Example

>>>
>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> image = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> score = metric(image, "a photo of a cat")
>>> score.detach().round()
tensor(24.)

Example

>>>
>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> image1 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> image2 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(43))
>>> score = metric(image1, image2)
>>> score.detach().round()
tensor(99.)

Example

>>>
>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> score = metric("28-year-old chef found dead in San Francisco mall",
...               "A 28-year-old chef who recently moved to San Francisco was found dead.")
>>> score.detach().round()
tensor(91.)
compute()[source]

Compute accumulated clip score.

Return type:

Tensor

plot(val=None, ax=None)[source]

Plot a single or multiple values from the metric.

Parameters:
  • val (Union[Tensor, Sequence[Tensor], None]) – Either a single result from calling metric.forward or metric.compute or a list of these results. If no value is provided, will automatically call metric.compute and plot that result.

  • ax (Optional[Axes]) – An matplotlib axis object. If provided will add plot to that axis

Return type:

tuple[Figure, Union[Axes, ndarray]]

Returns:

Figure and Axes object

Raises:

ModuleNotFoundError – If matplotlib is not installed

>>>
>>> # Example plotting a single value
>>> import torch
>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> metric.update(torch.randint(255, (3, 224, 224)), "a photo of a cat")
>>> fig_, ax_ = metric.plot()
../_images/clip_score-1.png
>>>
>>> # Example plotting multiple values
>>> import torch
>>> from torchmetrics.multimodal.clip_score import CLIPScore
>>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
>>> values = [ ]
>>> for _ in range(10):
...     values.append(metric(torch.randint(255, (3, 224, 224)), "a photo of a cat"))
>>> fig_, ax_ = metric.plot(values)
../_images/clip_score-2.png
update(source, target)[source]

Update CLIP score on a batch of images and text.

Parameters:
  • source (Union[Tensor, List[Tensor], List[str], str]) – Source input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

  • target (Union[Tensor, List[Tensor], List[str], str]) – Target input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

Raises:
  • ValueError – If not all images have format [C, H, W]

  • ValueError – If the number of images and captions do not match

Return type:

None

Functional Interface

torchmetrics.functional.multimodal.clip_score.clip_score(source, target, model_name_or_path='openai/clip-vit-large-patch14')[source]

Calculates CLIP Score which is a text-to-image similarity metric.

CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image, as well as the similarity between texts or images. It has been found to be highly correlated with human judgement. The metric is defined as:

CLIPScore(I, C)=max(100cos(EI,EC),0)

which corresponds to the cosine similarity between visual CLIP embedding Ei for an image i and textual CLIP embedding EC for an caption C. The score is bound between 0 and 100 and the closer to 100 the better.

Additionally, the CLIP Score can be calculated for the same modalities:

CLIPScore(I_1, I_2)=max(100cos(EI1,EI2),0)

where EI1 and EI2 are the visual embeddings for images I1 and I2.

CLIPScore(T_1, T_2)=max(100cos(ET1,ET2),0)

where ET1 and ET2 are the textual embeddings for texts T1 and T2.

Note

Metric is not scriptable

Parameters:
  • source (Union[Tensor, List[Tensor], List[str], str]) – Source input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

  • target (Union[Tensor, List[Tensor], List[str], str]) – Target input. This can be: - Images: Either a single [N, C, H, W] tensor or a list of [C, H, W] tensors. - Text: Either a single caption or a list of captions.

  • model_name_or_path (Literal['openai/clip-vit-base-patch16', 'openai/clip-vit-base-patch32', 'openai/clip-vit-large-patch14-336', 'openai/clip-vit-large-patch14']) – String indicating the version of the CLIP model to use. Available models are: - “openai/clip-vit-base-patch16” - “openai/clip-vit-base-patch32” - “openai/clip-vit-large-patch14-336” - “openai/clip-vit-large-patch14”

Raises:
  • ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0

  • ValueError – If not all images have format [C, H, W]

  • ValueError – If the number of images and captions do not match

Return type:

Tensor

Example

>>>
>>> from torchmetrics.functional.multimodal import clip_score
>>> image = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> score = clip_score(image, "a photo of a cat", "openai/clip-vit-base-patch16")
>>> score.detach()
tensor(24.4255)

Example

>>>
>>> from torchmetrics.functional.multimodal import clip_score
>>> image1 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(42))
>>> image2 = torch.randint(255, (3, 224, 224), generator=torch.Generator().manual_seed(43))
>>> score = clip_score(image1, image2, "openai/clip-vit-base-patch16")
>>> score.detach()
tensor(99.4859)

Example

>>>
>>> from torchmetrics.functional.multimodal import clip_score
>>> score = clip_score(
...     "28-year-old chef found dead in San Francisco mall",
...     "A 28-year-old chef who recently moved to San Francisco was found dead.",
...     "openai/clip-vit-base-patch16"
... )
>>> score.detach()
tensor(91.3950)

You are viewing an outdated version of TorchMetrics Docs

Click here to view the latest version→