Perplexity

Perplexity is a measure of how well a probabilistic model predicts a sample.

In the context of language modeling, perplexity equals the exponential of the cross-entropy loss. A lower perplexity score indicates that the model is more certain about its predictions. Since Perplexity measures token probabilities, it is not suitable for evaluating decoding tasks like text generation or machine translation. Instead, it is commonly used to evaluate the logits of generative language models.

Here’s a hypothetical Python example demonstrating the usage of Perplexity to evaluate a generative language model

14 import torch
15 from torchmetrics.text import Perplexity
16 from transformers import AutoModelWithLMHead, AutoTokenizer

Load the GPT-2 model and tokenizer

21 model = AutoModelWithLMHead.from_pretrained("gpt2")
22 tokenizer = AutoTokenizer.from_pretrained("gpt2")
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/transformers/models/auto/modeling_auto.py:1748: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

Generate token logits for a sample text

27 sample_text = "The quick brown fox jumps over the lazy dog"
28 sample_input_ids = tokenizer.encode(sample_text, return_tensors="pt")
29
30 with torch.no_grad():
31     sample_outputs = model(sample_input_ids, labels=sample_input_ids)
32 logits = sample_outputs.logits

We can now calculate the perplexity of the logits

37 perplexity = Perplexity()
38 score = perplexity(preds=logits, target=sample_input_ids)
39 print(f"Perplexity, unshifted: {score.item()}")
Perplexity, unshifted: 1929.9822998046875

This perplexity score is suspiciously high. The cause of this is that the model labels need to be shifted to the right by one position. We can fix this by removing the first token from the logits and the last token from the target

44 score = perplexity(preds=logits[:, :-1], target=sample_input_ids[:, 1:])
45 print(f"Perplexity, shifted: {score.item()}")
Perplexity, shifted: 227.27783203125

Since the perplexity equates to the exponential of the cross-entropy loss, we can verify the perplexity calculation by comparing it to the loss

50 cross_entropy = score
51 perplexity = sample_outputs.loss.exp()
52 print(torch.allclose(perplexity, cross_entropy))
True

Be aware that sequences are often padded to ensure equal length. In such cases, the padding tokens should be ignored when calculating the perplexity. This can be achieved by specifying the ignore_index argument in the Perplexity metric

57 tokenizer.pad_token_id = tokenizer.eos_token_id
58 sample_input_ids = tokenizer.encode(sample_text, return_tensors="pt", padding="max_length", max_length=20)
59 with torch.no_grad():
60     sample_outputs = model(sample_input_ids, labels=sample_input_ids)
61 logits = sample_outputs.logits
62
63 perplexity = Perplexity(ignore_index=None)
64 score = perplexity(preds=logits[:, :-1], target=sample_input_ids[:, 1:])
65 print(f"Perplexity, including padding: {score.item()}")
66
67 perplexity = Perplexity(ignore_index=tokenizer.pad_token_id)
68 score = perplexity(preds=logits[:, :-1], target=sample_input_ids[:, 1:])
69 print(f"Perplexity, ignoring padding: {score.item()}")
Perplexity, including padding: 24400.68359375
Perplexity, ignoring padding: 227.27783203125

Total running time of the script: (0 minutes 5.775 seconds)

Gallery generated by Sphinx-Gallery