Deep Learning Fundamentals

Pages

Deep Learning Fundamentals

Deep Learning Fundamentals > Unit 8 > Unit 8.5

Course Progress:

8.5 Understanding Self-Attention

Slides

Part 1: A Basic Attention Mechanism
Part 2: Self-Attention with Learnable Weights
Part 3: From Self-Attention to Multi-Head Attention
Part 4: Masked Attention And Positional Encoding

References

Attention Is All You Need (2018) — the original transformer paper

What we covered in this video lecture

To understand large language transformers, it is essential to understand self-attention, which is the underlying mechanism that powers these models: self-attention can be understood as a way to create context-aware text embedding vectors.

In this lecture, we explain self-attention from the ground up. We are starting with a simple parameter-free version of self-attention to explain the underlying principles. Then, we cover the parameterized self-attention mechanism used in transformers: self-attention with learnable weights.

Additional resources if you want to learn more

This lecture introduced the attention mechanism with conceptual illustration. If you prefer a coding-based approach, also check out my article Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch.

Log in or create a free Lightning.ai account to access:

Quizzes
Completion badges
Progress tracking
Additional downloadable content
Additional AI education resources
Notifications when new units are released
Free cloud computing credits

Quiz: 8.5 Understanding Self-Attention (Part 1)

If we have an input text with 3 words, how many output vectors does the attention mechanism yield?

Incorrect. The number of input and output vectors is always the same.

Correct. The number of input and output vectors is always the same.

It depends.

Incorrect. The number of input and output vectors is always the same.

Please answer all questions to proceed.

Quiz: 8.5 Understanding Self-Attention (Part 2)

If we have 3 input words, how many attention weights α are there computed in total?

Incorrect. Hint: For each of the 3 input words, there is an attention score for each output word.

Correct. For each of the 3 input words, there is an attention score for each output word. Since we have 3 output words, we have 3*3 = 9 attention weights.

Please answer all questions to proceed.

Quiz: 8.5 Understanding Self-Attention (Part 3)

If we have a multi-head attention layer with 8 heads, how many weight matrices does this include?

Incorrect. Hint: Remember we have separate matrices for the query, a key, and a values, …

Correct. In a multi-head attention layer with 8 heads, there are four sets of weight matrices: $$U_Q$$ (query), $$U_K$$ (key), $$U_V$$ (value), and the linear layer $$U_O$$ (output) used after concatenation. Each head has its own set of these matrices. Therefore, the total number of weight matrices for 8 heads is: 8 heads * 4 matrices per head = 32 weight matrices.

Please answer all questions to proceed.

Quiz: 8.5 Understanding Self-Attention (Part 4)

By design, self-attention is not aware of the word position (word order). This is because they process all the input tokens simultaneously and focus on finding relationships between them, but do not have a built-in way to account for the order in which those tokens appear. So, transformers require a positional encoding to encode ordering information. These positional encodings can be either learned or fixed, depending on the approach. In the transformer architecture, what is the size of the positional encoding vector relative to the word embedding vector size?

Larger than the word embedding vector size.

Incorrect. The word embedding and positional embedding sizes have to match when we want to add them together.

Smaller than the word embedding vector size.

Incorrect. The word embedding and positional embedding sizes have to match when we want to add them together.

Equal to the word embedding vector size.

Correct. The word embedding and positional embedding sizes have to match when we want to add them together.

Not related to the word embedding vector size.

Incorrect. The word embedding and positional embedding sizes have to match when we want to add them together.

Please answer all questions to proceed.

Watch Video 1 Mark complete and go to Unit 8.6 →

Unit 8.5

Videos

Follow along in a Lightning Studio

DL Fundamentals 8: Large Language Models

Sebastian

Launch Studio →