.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "gallery/text/bertscore.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_gallery_text_bertscore.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_gallery_text_bertscore.py:

BERTScore
===============================

BERTScore is a text generation metric to compute the similarity between a generated text and a reference text using a pre-trained BERT model. Instead of relying on exact token matches, BERTScore leverages contextual embeddings to capture the semantic similarity between the texts. This makes BERTScore robust to paraphrasing and word order variations. BERTScore has been shown to correlate well with human judgments and is widely used in evaluating text generation models.

Let's consider a use case in natural language processing where BERTScore is used to evaluate the quality of a text generation model. In this case we are imaging that we are developing a automated news summarization system. The goal is to create concise summaries of news articles that accurately capture the key points of the original articles. To evaluate the performance of your summarization system, you need a metric that can compare the generated summaries to human-written summaries. This is where the BERTScore can be used.

.. GENERATED FROM PYTHON SOURCE LINES 8-16

.. code-block:: Python
   :lineno-start: 9


    from transformers import AutoTokenizer, pipeline

    from torchmetrics.text import BERTScore, ROUGEScore

    pipe = pipeline("text-generation", model="openai-community/gpt2")
    tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")


.. GENERATED FROM PYTHON SOURCE LINES 17-18

Define the prompt and target texts

.. GENERATED FROM PYTHON SOURCE LINES 18-22

.. code-block:: Python
   :lineno-start: 19


    prompt = "Economic recovery is underway with a 3.5% GDP growth and a decrease in unemployment. Experts forecast continued improvement with boosts from consumer spending and government projects. In summary: "
    target_summary = "the recession is ending."


.. GENERATED FROM PYTHON SOURCE LINES 23-24

Generate a sample text using the GPT-2 model

.. GENERATED FROM PYTHON SOURCE LINES 24-29

.. code-block:: Python
   :lineno-start: 25


    generated_summary = pipe(prompt, max_new_tokens=20, do_sample=False, pad_token_id=tokenizer.eos_token_id)[0][
        "generated_text"
    ][len(prompt) :].strip()


.. GENERATED FROM PYTHON SOURCE LINES 30-31

Calculate the BERTScore of the generated text

.. GENERATED FROM PYTHON SOURCE LINES 31-40

.. code-block:: Python
   :lineno-start: 32


    bertscore = BERTScore(model_name_or_path="roberta-base")
    score = bertscore(preds=[generated_summary], target=[target_summary])

    print(f"Prompt: {prompt}")
    print(f"Target summary: {target_summary}")
    print(f"Generated summary: {generated_summary}")
    print(f"BERTScore: {score['f1']:.4f}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/torch/nn/modules/module.py:1784: FutureWarning: `encoder_attention_mask` is deprecated and will be removed in version 4.55.0 for `RobertaSdpaSelfAttention.forward`.
      return forward_call(*args, **kwargs)
    Prompt: Economic recovery is underway with a 3.5% GDP growth and a decrease in unemployment. Experts forecast continued improvement with boosts from consumer spending and government projects. In summary: 
    Target summary: the recession is ending.
    Generated summary: The economy is growing at a 3.5% GDP growth and a decrease in unemployment.
    BERTScore: 0.9075


.. GENERATED FROM PYTHON SOURCE LINES 41-42

In addition, to illustrate BERTScore's robustness to paraphrasing, let's consider two candidate texts that are variations of the reference text.

.. GENERATED FROM PYTHON SOURCE LINES 42-46

.. code-block:: Python
   :lineno-start: 42

    reference = "the weather is freezing"
    candidate_good = "it is cold today"
    candidate_bad = "it is warm outside"


.. GENERATED FROM PYTHON SOURCE LINES 47-48

Here we see that using the BERTScore we are able to differentiate between the candidate texts based on their similarity to the reference text, whereas the ROUGE scores for the same text pairs are identical.

.. GENERATED FROM PYTHON SOURCE LINES 48-54

.. code-block:: Python
   :lineno-start: 48

    rouge = ROUGEScore()

    print(f"ROUGE for candidate_good: {rouge(preds=[candidate_good], target=[reference])['rouge1_fmeasure'].item()}")
    print(f"ROUGE for candidate_bad: {rouge(preds=[candidate_bad], target=[reference])['rouge1_fmeasure'].item()}")
    print(f"BERTScore for candidate_good: {bertscore(preds=[candidate_good], target=[reference])['f1'].item():.4f}")
    print(f"BERTScore for candidate_bad: {bertscore(preds=[candidate_bad], target=[reference])['f1'].item():.4f}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ROUGE for candidate_good: 0.25
    ROUGE for candidate_bad: 0.25
    BERTScore for candidate_good: 0.9254
    BERTScore for candidate_bad: 0.9145


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 10.593 seconds)


.. _sphx_glr_download_gallery_text_bertscore.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: bertscore.ipynb <bertscore.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: bertscore.py <bertscore.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: bertscore.zip <bertscore.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_