Back to Roadmap05

Attention

How attention lets tokens communicate using Query, Key, and Value.

The previous chapter gave each token both meaning (from embeddings) and position (from positional encoding). But each token still has a single, fixed representation. The word "bright" produces the same vector whether it appears in "bright student" or "bright light."

Language depends on context. When we read "bright student," we interpret "bright" as intelligent. When we read "bright light," we interpret it as luminous. For the model to make this distinction, tokens need to look at each other. This chapter introduces that mechanism.

Query, Key, Value

We want each token to gather relevant information from other tokens. The word "bright" should check whether "student" or "light" appears nearby. Some neighbors will be more useful than others, so the mechanism needs to do two things: identify relevant tokens, and retrieve useful content from them.

Attention does this with three vectors per token. The Query represents what this token is looking for. The Key represents what this token has to offer. When we compare a Query against a Key, we get a relevance score that tells us whether these two tokens should interact. The third vector, Value, contains the information that gets passed along based on that score.

Think of it this way. Each token asks "what do I need?" (Query) while also announcing "here's what I contain" (Key). When a token's question matches what another token offers, the interaction is strong. The querying token then receives content from the matching token's Value.

The Query and Key determine the strength of attention between tokens. The Value carries the content that flows through those connections.

Where Do Q, K, V Come From?

Each token's representation (x) produces all three by projection through learned weight matrices:

Q = x × WQ
K = x × WK
V = x × WV

Initially, these weight matrices contain random values, so Q, K, and V are meaningless. But during training, the model adjusts these matrices to produce useful representations. It learns, for example, that pronouns should generate Queries that match Keys from nouns, or that adjectives should attend to the nouns they modify.

No rules are programmed about which tokens should attend to which. We provide the structure, and training discovers what each token should look for and what it should offer.

Measuring Relevance

Each token has a Query expressing what it looks for and a Key advertising what it contains. To determine how strongly token i should attend to token j, we compute the dot product of i's Query with j's Key. The more closely what i is looking for matches what j offers, the higher their dot product is.

We compute this for every pair of tokens in the sequence:

scoreij = Qi · Kj

The result is a score for each pair, indicating how relevant token j is to token i.

Key
The
bright
student
reads
Query
The
bright
student
reads
0.2
0.1
0.3
0.1
0.1
0.2
0.8
0.4
0.2
0.7
0.3
0.1
0.1
0.2
0.6
0.2
Higher scores = stronger relevance

Combining Values

We now have a score for how relevant each token is to every other. The next step is to use these scores to gather information from the Value vectors. A natural approach is to combine all Values, giving more influence to tokens with higher scores.

But raw scores are just numbers on an arbitrary scale. To combine Values in a way that makes sense, we want to treat the scores as weights in a weighted average. For a weighted average to work, the weights must sum to 1 so that each weight represents a share of the total influence.

The function that does this is softmax. It takes the raw scores, exponentiates them, and normalizes so they sum to 1. Higher scores become larger weights, lower scores become smaller weights, and the relative differences are preserved.

As the dimension of the Query and Key vectors grows, dot products tend to produce larger numbers because each additional dimension adds to the sum. When softmax receives these larger values, it concentrates nearly all weight on whichever token has the highest score, and the output becomes essentially just that token's Value, losing the ability to blend information from multiple source tokens. Scaling the scores down by √d before softmax keeps the values moderate, allowing weight to spread across several tokens when useful:

attention weights = softmax( scores / √d )

With weights computed, we take the weighted sum of Values:

output = Σ (weightj × Vj)

If token i gives weight 0.7 to token j and weight 0.1 to token k, the output is dominated by j's Value. This output is the contextualized representation of token i, its original meaning now enriched by information gathered from the tokens it attended to.

Putting it all together, the complete attention operation is:

Attention(Q, K, V) = softmax(QKT / √d) × V

Causal Masking

Recall from the introduction chapter that the model learns by predicting the next word. Given "The cat sat on the", it predicts "mat" and adjusts its weights based on the error. During training, however, we feed the model entire sequences at once for efficiency. If position 5 can attend to position 6, it could simply copy the answer instead of learning to predict it. This is information leakage.

To prevent this, we hide future tokens from each position by setting their attention scores to negative infinity before softmax. Softmax converts negative infinity to zero weight, so those positions contribute nothing to the output. As a result, position 3 can attend to positions 1, 2, and 3, but positions 4 and beyond are invisible to it.

This creates a triangular pattern. The first token sees only itself, the second sees the first two, the third sees the first three, and so on. This technique is called causal masking because it respects causality: the present can depend on the past, but not on the future.

Key Position (attending to)
1
2
3
4
5
Query Position
(attending from)
1
2
3
4
5
Can attend
Masked (−∞)
Summary
  • Each token produces a Query (what it looks for), Key (what it offers), and Value (content to pass along)
  • These are learned projections that the model discovers during training
  • The dot product of Query and Key measures relevance between token pairs
  • Softmax converts scores to weights that sum to 1; scaling by √d prevents extreme distributions
  • The weighted sum of Values produces a contextualized representation for each token
  • Causal masking hides future positions, ensuring tokens only attend to the past

This chapter covered single-head attention, where one set of Q, K, V projections learns to relate tokens. In the next chapter, we'll see how multi-head attention runs several of these in parallel, letting the model capture different types of relationships simultaneously.