Module 5: Transformers
I watched and am summarizing the lecture content on self-attention and Transformer architectures from CS224N.
Neural Architectures and Their Properties
Progress in NLP has been driven by general-purpose techniques like Hidden Markov Models, Conditional Random Fields, RNNs, CNNs, and Support Vector Machines. The limitations of recurrent models, especially in parallelization and capturing long-term dependencies, motivated the development of self-attention and Transformer architectures.
Limitations of RNNs
-
Parallelization Issues: RNNs process sequentially, where each step depends on the previous one. This dependency hinders efficient parallel computation. For instance, the hidden state \( h_t \) is computed as: \[ h_t = \sigma(W h_{t-1} + U x_t), \] where \( \sigma \) is a nonlinearity, \( W \) and \( U \) are weight matrices, and \( x_t \) is the input at time \( t \).
-
Linear Interaction Distance: Capturing relationships between distant tokens requires traversing multiple layers, making long-term dependencies difficult to learn.
A Minimal Self-Attention Architecture
Self-attention replaces recurrence with attention mechanisms, enabling parallelization and better contextual representations.
Key-Query-Value Mechanism
Self-attention computes a weighted sum of values based on the similarity between queries and keys:
- Definitions:
- Query: \( q_i = Q x_i \),
- Key: \( k_j = K x_j \),
- Value: \( v_j = V x_j \), where \( Q, K, V \) are learned weight matrices.
- Contextual Representation: \[ h_i = \sum_{j=1}^n \alpha_{ij} v_j, \] where the weights \( \alpha_{ij} \) are computed using the softmax function: \[ \alpha_{ij} = \frac{\exp(q_i^\top k_j)}{\sum_{j’=1}^n \exp(q_i^\top k_{j’})}. \]
Position Representations
Self-attention lacks an inherent sense of order. Position embeddings are added to the token embeddings: \[ \tilde{x}_i = x_i + P_i, \] where \( P_i \) encodes positional information.
Feed-Forward Layers
Nonlinear transformations are applied after self-attention: \[ h_{\text{FF}} = W_2 \text{ReLU}(W_1 h + b_1) + b_2, \] with \( W_1, W_2 \) being weight matrices and \( b_1, b_2 \) biases.
Future Masking
For autoregressive tasks, masking ensures tokens only attend to preceding ones:
\[
\alpha_{ij} =
\begin{cases}
\alpha_{ij} & j \leq i,
0 & j > i.
\end{cases}
\]
The Transformer Architecture
Transformers stack attention and feed-forward layers with additional components.
Multi-Head Attention
Instead of a single attention mechanism, multiple heads independently compute: \[ h_i^{(\ell)} = \sum_{j=1}^n \alpha_{ij}^{(\ell)} v_j^{(\ell)}, \] and their outputs are concatenated and linearly transformed: \[ h_i = O \cdot [h_i^{(1)}; h_i^{(2)}; \ldots; h_i^{(k)}], \] where \( O \) is a learned projection matrix.
Layer Normalization
Layer norm stabilizes activations by normalizing each token’s hidden states: \[ \text{LN}(h_i) = \frac{h_i - \mu_i}{\sigma_i}, \] where \( \mu_i \) and \( \sigma_i \) are the mean and standard deviation.
Residual Connections
Residual connections enhance gradient flow: \[ f_{\text{residual}}(h) = f(h) + h. \]
Scaled Dot-Product Attention
To mitigate large dot products when dimensions grow, attention scores are scaled: \[ \alpha = \text{softmax}\left(\frac{x_1:n Q K^\top x_{1:n}^\top}{\sqrt{d}}\right). \]
Encoder and Decoder Structures
- Encoder: Processes input sequences with no masking.
- Decoder: Uses future masking for autoregressive tasks. Decoders also incorporate cross-attention with encoder outputs.
Applications
- Encoders are used in tasks requiring bidirectional context, like BERT.
- Decoders are used for generation tasks, like GPT.
- Encoder-decoder architectures are ideal for tasks like machine translation.