CLIP Score¶
Module Interface¶
- class torchmetrics.multimodal.clip_score.CLIPScore(model_name_or_path='openai/clip-vit-large-patch14', **kwargs)[source]¶
Calculates CLIP Score which is a text-to-image similarity metric.
CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image. It has been found to be highly correlated with human judgement. The metric is defined as:
\[\text{CLIPScore(I, C)} = max(100 * cos(E_I, E_C), 0)\]which corresponds to the cosine similarity between visual CLIP embedding \(E_i\) for an image \(i\) and textual CLIP embedding \(E_C\) for an caption \(C\). The score is bound between 0 and 100 and the closer to 100 the better.
Note
Metric is not scriptable
As input to
forward
andupdate
the metric accepts the following inputimages
(Tensor
or list of tensors): tensor with images feed to the feature extractor with. Ifa single tensor it should have shape
(N, C, H, W)
. If a list of tensors, each tensor should have shape(C, H, W)
.C
is the number of channels,H
andW
are the height and width of the image.
text
(str
orlist
ofstr
): text to compare with the images, one for each image.
As output of forward and compute the metric returns the following output
clip_score
(Tensor
): float scalar tensor with mean CLIP score over samples
- Parameters:
model_name_or_path¶ (
Literal
['openai/clip-vit-base-patch16'
,'openai/clip-vit-base-patch32'
,'openai/clip-vit-large-patch14-336'
,'openai/clip-vit-large-patch14'
]) –string indicating the version of the CLIP model to use. Available models are:
”openai/clip-vit-base-patch16”
”openai/clip-vit-base-patch32”
”openai/clip-vit-large-patch14-336”
”openai/clip-vit-large-patch14”
kwargs¶ (
Any
) – Additional keyword arguments, see Advanced metric settings for more info.
- Raises:
ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0
Example
>>> from torch import randint >>> from torchmetrics.multimodal.clip_score import CLIPScore >>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16") >>> score = metric(randint(255, (3, 224, 224)), "a photo of a cat") >>> score.detach().round() tensor(25.)
- plot(val=None, ax=None)[source]¶
Plot a single or multiple values from the metric.
- Parameters:
val¶ (
Union
[Tensor
,Sequence
[Tensor
],None
]) – Either a single result from calling metric.forward or metric.compute or a list of these results. If no value is provided, will automatically call metric.compute and plot that result.ax¶ (
Optional
[Axes
]) – An matplotlib axis object. If provided will add plot to that axis
- Return type:
- Returns:
Figure and Axes object
- Raises:
ModuleNotFoundError – If matplotlib is not installed
>>> # Example plotting a single value >>> import torch >>> from torchmetrics.multimodal.clip_score import CLIPScore >>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16") >>> metric.update(torch.randint(255, (3, 224, 224)), "a photo of a cat") >>> fig_, ax_ = metric.plot()
>>> # Example plotting multiple values >>> import torch >>> from torchmetrics.multimodal.clip_score import CLIPScore >>> metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16") >>> values = [ ] >>> for _ in range(10): ... values.append(metric(torch.randint(255, (3, 224, 224)), "a photo of a cat")) >>> fig_, ax_ = metric.plot(values)
- update(images, text)[source]¶
Update CLIP score on a batch of images and text.
- Parameters:
- Raises:
ValueError – If not all images have format [C, H, W]
ValueError – If the number of images and captions do not match
- Return type:
Functional Interface¶
- torchmetrics.functional.multimodal.clip_score.clip_score(images, text, model_name_or_path='openai/clip-vit-large-patch14')[source]¶
Calculate CLIP Score which is a text-to-image similarity metric.
CLIP Score is a reference free metric that can be used to evaluate the correlation between a generated caption for an image and the actual content of the image. It has been found to be highly correlated with human judgement. The metric is defined as:
\[\text{CLIPScore(I, C)} = max(100 * cos(E_I, E_C), 0)\]which corresponds to the cosine similarity between visual CLIP embedding \(E_i\) for an image \(i\) and textual CLIP embedding \(E_C\) for an caption \(C\). The score is bound between 0 and 100 and the closer to 100 the better.
Note
Metric is not scriptable
- Parameters:
images¶ (
Union
[Tensor
,List
[Tensor
]]) – Either a single [N, C, H, W] tensor or a list of [C, H, W] tensorstext¶ (
Union
[str
,List
[str
]]) – Either a single caption or a list of captionsmodel_name_or_path¶ (
Literal
['openai/clip-vit-base-patch16'
,'openai/clip-vit-base-patch32'
,'openai/clip-vit-large-patch14-336'
,'openai/clip-vit-large-patch14'
]) – string indicating the version of the CLIP model to use. Available models are “openai/clip-vit-base-patch16”, “openai/clip-vit-base-patch32”, “openai/clip-vit-large-patch14-336” and “openai/clip-vit-large-patch14”,
- Raises:
ModuleNotFoundError – If transformers package is not installed or version is lower than 4.10.0
ValueError – If not all images have format [C, H, W]
ValueError – If the number of images and captions do not match
- Return type:
Example
>>> from torchmetrics.functional.multimodal import clip_score >>> score = clip_score(torch.randint(255, (3, 224, 224)), "a photo of a cat", "openai/clip-vit-base-patch16") >>> score.detach() tensor(24.4255)