In the previous chapters, attention allowed each token to gather relevant information from other positions in the sequence, producing a context-aware representation. This chapter introduces the feed-forward network (FFN), which takes each token's enriched vector and rewrites its features through a learned transformation applied independently at every position. While attention handles the flow of information between positions, the FFN processes that information within each position.
The FFN
The FFN takes a token's vector of size d_model and passes it through two linear transformations with a nonlinear activation function in between. The first linear projection expands the vector from d_model to a larger intermediate size called d_ff. Then a nonlinear activation function is applied element-wise. Finally, a second linear projection contracts the result back from d_ff to d_model.
In equation form, the complete operation is:
FFN(x) = W2 · activation(W1 · x + b1) + b2
The expansion through W1 produces a vector of d_ff values, where each value is a dot product between the token's vector and one of W1's learned weight vectors. Each dot product measures how much the token's representation aligns with what that weight vector has learned to respond to. A large positive result means strong alignment, a value near zero means weak alignment, and a negative value means the representation points away from what that weight vector has learned.
The activation function then filters the expanded vector element-wise. With ReLU for example, every negative value becomes zero and every positive value passes through unchanged. Since different tokens produce different values after W1, the activation zeros out or suppresses different neurons for each one, so each token passes a different combination of active and suppressed neurons forward to W2.
The contraction through W2 then maps these filtered values back from d_ff to d_model dimensions. Each of the d_ff neurons has a corresponding column in W2 that defines how much that neuron contributes to each of the d_model output dimensions. Since different tokens activate different neurons, a different set of W2 columns drives the output for each token, so the same weight matrices effectively apply a different transformation depending on the token's representation. This is what allows the FFN to capture nonlinear relationships rather than applying a single fixed transformation to every token.
In GPT-2's base configuration, d_model = 768 and d_ff = 4 × d_model = 3072, giving W1 the shape 768 × 3072 and W2 the shape 3072 × 768. This 4× expansion ratio is a common default across many Transformer architectures based on empirical findings.
Why the Activation Matters
Without the activation, the FFN is just two linear projections applied in sequence, and since composing two linear transformations always produces another single linear transformation, the product W2·W1 collapses into one matrix. That reduces the entire FFN to a single d_model × d_model multiplication where the expansion to d_ff adds nothing and the network can no longer capture nonlinear relationships in the data.
The activation function prevents this collapse by inserting a nonlinear operation between the two projections. As covered above, it selectively zeros out or suppresses different neurons for each token, so W2 receives a different filtered signal each time rather than a plain linear combination of W1's output. This is what keeps the two matrices from collapsing into a single linear operation and what makes the expansion to d_ff meaningful.
Activation Functions
We used ReLU as the example earlier because its behavior is straightforward, but the choice of activation function affects how gradients flow during training and whether certain neurons can become permanently inactive. The original Transformer used ReLU, GPT-2 switched to GELU, and models like DeepSeek and Mistral use gated variants like SwiGLU.
ReLU (Rectified Linear Unit) is the simplest option. It zeros out any negative value and passes positive values through unchanged, making the filtering binary. During training, weight initialization and normalization keep most neurons operating near the zero boundary. So, a weight update can push a neuron past zero for every input, and once there, it outputs zero, which means zero gradient during backpropagation, so the weights never update and the neuron is permanently dead. In large networks, a significant fraction of neurons can die this way, silently reducing the model's capacity.
ReLU(x) = max(0, x)
GELU (Gaussian Error Linear Unit) is what GPT-2 uses. Instead of the hard cutoff at zero, GELU curves smoothly through the transition region, gradually suppressing small negative values rather than forcing them to zero. Because of this smooth transition, neurons near the zero boundary still receive meaningful gradient updates even when their output is slightly negative, so they can continue learning, which prevents them from becoming permanently dead.
GELU(x) = x · Φ(x)
Φ(x) is the standard normal cumulative distribution function. In practice, it is approximated using a tanh function.
Φ(x) ≈ 0.5(1 + tanh(√(2/π) · (x + 0.044715x³)))
- The FFN processes each token independently through an expand, activate, contract pipeline. W1 expands the vector from
d_modeltod_ff, the activation filters the expanded values element-wise, and W2 contracts the result back tod_model - The nonlinear activation between the two projections is what prevents W2·W1 from collapsing into a single linear transformation, enabling the network to capture nonlinear relationships in the data
- ReLU zeros out all negative values and passes positive values unchanged, which can cause neurons that drift negative for every input to stop receiving gradients and become permanently dead
- GELU, used by GPT-2, replaces the hard cutoff with a smooth curve that gradually suppresses small negative values, so neurons near the zero boundary keep receiving gradient updates and continue learning
With the FFN in place, each Transformer layer combines two complementary operations, attention and the FFN. To build a capable models, we need to stack many of these layers so that representations are refined progressively. But deep stacks introduce training stability problems. The next chapter covers residual connections and layer normalization, the two mechanisms that keep gradients and activations well-behaved across depth.