Lightning AI Studios: Never set up a local environment again →

Log in or create a free Lightning.ai account to track your progress and access additional course materials  

8.5 Understanding Self-Attention

References

What we covered in this video lecture

To understand large language transformers, it is essential to understand self-attention, which is the underlying mechanism that powers these models: self-attention can be understood as a way to create context-aware text embedding vectors.

In this lecture, we explain self-attention from the ground up. We are starting with a simple parameter-free version of self-attention to explain the underlying principles. Then, we cover the parameterized self-attention mechanism used in transformers: self-attention with learnable weights.

Additional resources if you want to learn more

This lecture introduced the attention mechanism with conceptual illustration. If you prefer a coding-based approach, also check out my article Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch.

Log in or create a free Lightning.ai account to access:

  • Quizzes
  • Completion badges
  • Progress tracking
  • Additional downloadable content
  • Additional AI education resources
  • Notifications when new units are released
  • Free cloud computing credits

Quiz: 8.5 Understanding Self-Attention (Part 1)

If we have an input text with 3 words, how many output vectors does the attention mechanism yield?

Incorrect. The number of input and output vectors is always the same.

Incorrect. The number of input and output vectors is always the same.

Correct. The number of input and output vectors is always the same.

Incorrect. The number of input and output vectors is always the same.

Please answer all questions to proceed.

Quiz: 8.5 Understanding Self-Attention (Part 2)

If we have 3 input words, how many attention weights α are there computed in total?

Incorrect. Hint: For each of the 3 input words, there is an attention score for each output word.

Incorrect. Hint: For each of the 3 input words, there is an attention score for each output word.

Incorrect. Hint: For each of the 3 input words, there is an attention score for each output word.

Correct. For each of the 3 input words, there is an attention score for each output word. Since we have 3 output words, we have 3*3 = 9 attention weights.

Please answer all questions to proceed.

Quiz: 8.5 Understanding Self-Attention (Part 3)

If we have a multi-head attention layer with 8 heads, how many weight matrices does this include?

Incorrect. Hint: Remember we have separate matrices for the query, a key, and a values, …

Incorrect. Hint: Remember we have separate matrices for the query, a key, and a values, …

Incorrect. Hint: Remember we have separate matrices for the query, a key, and a values, …

Correct. In a multi-head attention layer with 8 heads, there are four sets of weight matrices: $$U_Q$$ (query), $$U_K$$ (key), $$U_V$$ (value), and the linear layer $$U_O$$ (output) used after concatenation. Each head has its own set of these matrices. Therefore, the total number of weight matrices for 8 heads is: 8 heads * 4 matrices per head = 32 weight matrices.
Please answer all questions to proceed.

Quiz: 8.5 Understanding Self-Attention (Part 4)

By design, self-attention is not aware of the word position (word order). This is because they process all the input tokens simultaneously and focus on finding relationships between them, but do not have a built-in way to account for the order in which those tokens appear. So, transformers require a positional encoding to encode ordering information. These positional encodings can be either learned or fixed, depending on the approach. In the transformer architecture, what is the size of the positional encoding vector relative to the word embedding vector size?

Incorrect. The word embedding and positional embedding sizes have to match when we want to add them together.

Incorrect. The word embedding and positional embedding sizes have to match when we want to add them together.

Correct. The word embedding and positional embedding sizes have to match when we want to add them together.

Incorrect. The word embedding and positional embedding sizes have to match when we want to add them together.

Please answer all questions to proceed.
Watch Video 1

Unit 8.5

Videos