ROUGE

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric used to evaluate the quality of generated text compared to a reference text. It does so by computing the overlap between two texts, for which a subsequent precision and recall value can be computed. The ROUGE score is often used in the context of generative tasks such as text summarization and machine translation.

A major difference with Perplexity comes from the fact that ROUGE evaluates actual text, whereas Perplexity evaluates logits.

Here’s a hypothetical Python example demonstrating the usage of unigram ROUGE F-score to evaluate a generative language model:

12 from torchmetrics.text import ROUGEScore
13 from transformers import AutoTokenizer, pipeline
14
15 pipe = pipeline("text-generation", model="openai-community/gpt2")
16 tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

Define the prompt and target texts

21 prompt = "The quick brown fox"
22 target_text = "The quick brown fox jumps over the lazy dog."

Generate a sample text using the GPT-2 model

27 sample_text = pipe(prompt, max_length=20, do_sample=True, temperature=0.1, pad_token_id=tokenizer.eos_token_id)[0][
28     "generated_text"
29 ]
30 print(sample_text)
The quick brown foxes are the most common species of foxes in the United States. They are

Calculate the ROUGE of the generated text

35 rouge = ROUGEScore()
36 rouge(preds=[sample_text], target=[target_text])
{'rouge1_fmeasure': tensor(0.3077), 'rouge1_precision': tensor(0.2353), 'rouge1_recall': tensor(0.4444), 'rouge2_fmeasure': tensor(0.1667), 'rouge2_precision': tensor(0.1250), 'rouge2_recall': tensor(0.2500), 'rougeL_fmeasure': tensor(0.3077), 'rougeL_precision': tensor(0.2353), 'rougeL_recall': tensor(0.4444), 'rougeLsum_fmeasure': tensor(0.3077), 'rougeLsum_precision': tensor(0.2353), 'rougeLsum_recall': tensor(0.4444)}

By default, the ROUGE score is calculated using a whitespace tokenizer. You can also calculate the ROUGE for the tokens directly:

40 token_rouge = ROUGEScore(tokenizer=lambda text: tokenizer.tokenize(text))
41 token_rouge(preds=[sample_text], target=[target_text])
{'rouge1_fmeasure': tensor(0.3448), 'rouge1_precision': tensor(0.2632), 'rouge1_recall': tensor(0.5000), 'rouge2_fmeasure': tensor(0.2222), 'rouge2_precision': tensor(0.1667), 'rouge2_recall': tensor(0.3333), 'rougeL_fmeasure': tensor(0.3448), 'rougeL_precision': tensor(0.2632), 'rougeL_recall': tensor(0.5000), 'rougeLsum_fmeasure': tensor(0.4000), 'rougeLsum_precision': tensor(0.3000), 'rougeLsum_recall': tensor(0.6000)}

Since ROUGE is a text-based metric, it can be used to benchmark decoding strategies. For example, you can compare temperature settings:

46 import matplotlib.pyplot as plt  # noqa: E402
47
48 temperatures = [x * 0.1 for x in range(1, 10)]  # Generate temperature values from 0 to 1 with a step of 0.1
49 n_samples = 100  # Note that a real benchmark typically requires more data
50
51 average_scores = []
52
53 for temperature in temperatures:
54     sample_text = pipe(
55         prompt, max_length=20, do_sample=True, temperature=temperature, pad_token_id=tokenizer.eos_token_id
56     )[0]["generated_text"]
57     scores = [rouge(preds=[sample_text], target=[target_text])["rouge1_fmeasure"] for _ in range(n_samples)]
58     average_scores.append(sum(scores) / n_samples)
59
60 # Plot the average ROUGE score for each temperature
61 plt.plot(temperatures, average_scores)
62 plt.xlabel("Generation temperature")
63 plt.ylabel("Average unigram ROUGE F-Score")
64 plt.title("ROUGE for varying temperature settings")
65 plt.show()
ROUGE for varying temperature settings

Total running time of the script: (0 minutes 30.057 seconds)

Gallery generated by Sphinx-Gallery