Note
Go to the end to download the full example code.
ROUGE¶
The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric used to evaluate the quality of generated text compared to a reference text. It does so by computing the overlap between two texts, for which a subsequent precision and recall value can be computed. The ROUGE score is often used in the context of generative tasks such as text summarization and machine translation.
A major difference with Perplexity comes from the fact that ROUGE evaluates actual text, whereas Perplexity evaluates logits.
Here’s a hypothetical Python example demonstrating the usage of unigram ROUGE F-score to evaluate a generative language model:
12 from torchmetrics.text import ROUGEScore
13 from transformers import AutoTokenizer, pipeline
14
15 pipe = pipeline("text-generation", model="openai-community/gpt2")
16 tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
Define the prompt and target texts
21 prompt = "The quick brown fox"
22 target_text = "The quick brown fox jumps over the lazy dog."
Generate a sample text using the GPT-2 model
27 sample_text = pipe(prompt, max_length=20, do_sample=True, temperature=0.1, pad_token_id=tokenizer.eos_token_id)[0][
28 "generated_text"
29 ]
30 print(sample_text)
The quick brown foxes are the most common species of foxes in the United States. They are
Calculate the ROUGE of the generated text
35 rouge = ROUGEScore()
36 rouge(preds=[sample_text], target=[target_text])
{'rouge1_fmeasure': tensor(0.3077), 'rouge1_precision': tensor(0.2353), 'rouge1_recall': tensor(0.4444), 'rouge2_fmeasure': tensor(0.1667), 'rouge2_precision': tensor(0.1250), 'rouge2_recall': tensor(0.2500), 'rougeL_fmeasure': tensor(0.3077), 'rougeL_precision': tensor(0.2353), 'rougeL_recall': tensor(0.4444), 'rougeLsum_fmeasure': tensor(0.3077), 'rougeLsum_precision': tensor(0.2353), 'rougeLsum_recall': tensor(0.4444)}
By default, the ROUGE score is calculated using a whitespace tokenizer. You can also calculate the ROUGE for the tokens directly:
40 token_rouge = ROUGEScore(tokenizer=lambda text: tokenizer.tokenize(text))
41 token_rouge(preds=[sample_text], target=[target_text])
{'rouge1_fmeasure': tensor(0.3448), 'rouge1_precision': tensor(0.2632), 'rouge1_recall': tensor(0.5000), 'rouge2_fmeasure': tensor(0.2222), 'rouge2_precision': tensor(0.1667), 'rouge2_recall': tensor(0.3333), 'rougeL_fmeasure': tensor(0.3448), 'rougeL_precision': tensor(0.2632), 'rougeL_recall': tensor(0.5000), 'rougeLsum_fmeasure': tensor(0.4000), 'rougeLsum_precision': tensor(0.3000), 'rougeLsum_recall': tensor(0.6000)}
Since ROUGE is a text-based metric, it can be used to benchmark decoding strategies. For example, you can compare temperature settings:
46 import matplotlib.pyplot as plt # noqa: E402
47
48 temperatures = [x * 0.1 for x in range(1, 10)] # Generate temperature values from 0 to 1 with a step of 0.1
49 n_samples = 100 # Note that a real benchmark typically requires more data
50
51 average_scores = []
52
53 for temperature in temperatures:
54 sample_text = pipe(
55 prompt, max_length=20, do_sample=True, temperature=temperature, pad_token_id=tokenizer.eos_token_id
56 )[0]["generated_text"]
57 scores = [rouge(preds=[sample_text], target=[target_text])["rouge1_fmeasure"] for _ in range(n_samples)]
58 average_scores.append(sum(scores) / n_samples)
59
60 # Plot the average ROUGE score for each temperature
61 plt.plot(temperatures, average_scores)
62 plt.xlabel("Generation temperature")
63 plt.ylabel("Average unigram ROUGE F-Score")
64 plt.title("ROUGE for varying temperature settings")
65 plt.show()
Total running time of the script: (0 minutes 28.023 seconds)