Module 5: Transformers

I watched and am summarizing the lecture content on self-attention and Transformer architectures from CS224N.

Neural Architectures and Their Properties

Progress in NLP has been driven by general-purpose techniques like Hidden Markov Models, Conditional Random Fields, RNNs, CNNs, and Support Vector Machines. The limitations of recurrent models, especially in parallelization and capturing long-term dependencies, motivated the development of self-attention and Transformer architectures.

Limitations of RNNs

Parallelization Issues: RNNs process sequentially, where each step depends on the previous one. This dependency hinders efficient parallel computation. For instance, the hidden state \( h_t \) is computed as: \[ h_t = \sigma(W h_{t-1} + U x_t), \] where \( \sigma \) is a nonlinearity, \( W \) and \( U \) are weight matrices, and \( x_t \) is the input at time \( t \).
Linear Interaction Distance: Capturing relationships between distant tokens requires traversing multiple layers, making long-term dependencies difficult to learn.

A Minimal Self-Attention Architecture

Self-attention replaces recurrence with attention mechanisms, enabling parallelization and better contextual representations.

Key-Query-Value Mechanism

Self-attention computes a weighted sum of values based on the similarity between queries and keys:

Definitions:
- Query: \( q_i = Q x_i \),
- Key: \( k_j = K x_j \),
- Value: \( v_j = V x_j \), where \( Q, K, V \) are learned weight matrices.
Contextual Representation: \[ h_i = \sum_{j=1}^n \alpha_{ij} v_j, \] where the weights \( \alpha_{ij} \) are computed using the softmax function: \[ \alpha_{ij} = \frac{\exp(q_i^\top k_j)}{\sum_{j’=1}^n \exp(q_i^\top k_{j’})}. \]

Position Representations

Self-attention lacks an inherent sense of order. Position embeddings are added to the token embeddings: \[ \tilde{x}_i = x_i + P_i, \] where \( P_i \) encodes positional information.

Feed-Forward Layers

Nonlinear transformations are applied after self-attention: \[ h_{\text{FF}} = W_2 \text{ReLU}(W_1 h + b_1) + b_2, \] with \( W_1, W_2 \) being weight matrices and \( b_1, b_2 \) biases.

Future Masking

For autoregressive tasks, masking ensures tokens only attend to preceding ones: \[ \alpha_{ij} = \begin{cases} \alpha_{ij} & j \leq i,
0 & j > i. \end{cases} \]

The Transformer Architecture

Transformers stack attention and feed-forward layers with additional components.

Multi-Head Attention

Instead of a single attention mechanism, multiple heads independently compute: \[ h_i^{(\ell)} = \sum_{j=1}^n \alpha_{ij}^{(\ell)} v_j^{(\ell)}, \] and their outputs are concatenated and linearly transformed: \[ h_i = O \cdot [h_i^{(1)}; h_i^{(2)}; \ldots; h_i^{(k)}], \] where \( O \) is a learned projection matrix.

Layer Normalization

Layer norm stabilizes activations by normalizing each token’s hidden states: \[ \text{LN}(h_i) = \frac{h_i - \mu_i}{\sigma_i}, \] where \( \mu_i \) and \( \sigma_i \) are the mean and standard deviation.

Residual Connections

Residual connections enhance gradient flow: \[ f_{\text{residual}}(h) = f(h) + h. \]

Scaled Dot-Product Attention

To mitigate large dot products when dimensions grow, attention scores are scaled: \[ \alpha = \text{softmax}\left(\frac{x_1:n Q K^\top x_{1:n}^\top}{\sqrt{d}}\right). \]

Encoder and Decoder Structures

Encoder: Processes input sequences with no masking.
Decoder: Uses future masking for autoregressive tasks. Decoders also incorporate cross-attention with encoder outputs.

Applications

Encoders are used in tasks requiring bidirectional context, like BERT.
Decoders are used for generation tasks, like GPT.
Encoder-decoder architectures are ideal for tasks like machine translation.

Sidenote

Xcs224n Module 5