Summary of Attention in Transformers, Visually Explained
I watched and summarized the video Attention in Transformers, Visually Explained. The video breaks down the attention mechanism, a critical component of transformers, widely used in modern AI systems.
What is Attention?
Transformers, introduced in the paper Attention is All You Need, are designed to predict the next token in a sequence. The key innovation is the attention mechanism, which adjusts word embeddings based on the surrounding context. Initially, each token is associated with a high-dimensional vector, known as an embedding. But without attention, these embeddings lack context.
For example, the word mole has different meanings based on the surrounding words: “American true mole”, “one mole of carbon dioxide”, and “take a biopsy of the mole”. The attention mechanism helps refine these embeddings by pulling in relevant context.
Queries, Keys, and Values
Attention operates using three primary matrices:
- Query (Q): Represents the current word’s request for context.
- Key (K): Contains information from the surrounding words.
- Value (V): Holds the information that will be transferred if there is a match.
Each word’s embedding generates a query vector
The alignment between queries and keys is computed through the dot product:
The Attention Pattern
These dot products are normalized using a softmax function to ensure that the attention weights sum to 1:
Masking and Scaling
During training, transformers predict multiple next tokens in parallel, so we mask out future tokens to avoid “leaking” information from later tokens. This is done by setting certain values in the attention grid to negative infinity before applying softmax, ensuring that the future doesn’t influence the past.
Multi-Headed Attention
Rather than applying a single attention mechanism, transformers use multi-headed attention, which runs multiple attention mechanisms in parallel. Each head has its own query, key, and value matrices, allowing the model to focus on different aspects of the context simultaneously. GPT-3, for instance, uses 96 attention heads per block.
Counting Parameters
The video provides a parameter count for the attention mechanism. Each query, key, and value matrix in a single attention head has dimensions of
In practice, the value matrix is factored into two smaller matrices:
- Value down matrix: Projects the value vector to a smaller space.
- Value up matrix: Maps the smaller vector back to the original embedding space.
This reduces the number of parameters while maintaining flexibility.