NLP with Deep Learning: Lecture 1 - Introduction & Word Vectors
These notes cover key concepts from XCS224N, including human language, word meaning, and the word2vec algorithm, along with foundational math models in NLP.
Human Language and Word Meaning
Human language is inherently complex due to its social nature. People interpret and construct language based on context, making it challenging for computers to understand and generate. Despite this complexity, deep learning has enabled impressive advancements in modeling language, specifically in representing word meaning using vectors.
Word2Vec Algorithm
One key breakthrough in NLP is the word2vec algorithm, which learns word meaning through distributional semantics. Instead of relying on formal, dictionary-style definitions, word2vec assumes that a word’s meaning is captured by the context in which it frequently appears. This idea follows J.R. Firth’s well-known principle: “You shall know a word by the company it keeps.”
Vector Representation of Words
Each word in a corpus is assigned a dense, real-valued vector that represents its meaning. These vectors are often 300-dimensional, though simplified examples may use fewer dimensions. Word vectors are also known as neural word representations or word embeddings because they map words into a high-dimensional space.
Word2vec uses a predictive model to learn word embeddings. The model scans through a text corpus, identifying a central word \( C \) and context words \( O \) within a fixed window around \( C \). The objective is to adjust the word vectors so that the model assigns high probabilities to context words that frequently occur around the center word.
Objective Function
The objective function in word2vec seeks to maximize the likelihood of predicting context words given a center word. Specifically, the model minimizes the negative log-likelihood of the observed data:
\[J(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{j=-M, j \neq 0}^{M} \log P(O_j | C_t)\]where \( T \) is the number of words in the corpus, and \( M \) defines the size of the context window.
Calculating Probabilities with Softmax
To predict the probability of a context word given a center word, word2vec uses the softmax function. For a center word \( C \) and a context word \( O \), the probability is calculated as follows:
\[P(O | C) = \frac{\exp(\mathbf{v}_O \cdot \mathbf{v}_C)}{\sum_{w \in V} \exp(\mathbf{v}_w \cdot \mathbf{v}_C)}\]Here, \( \mathbf{v}_O \) and \( \mathbf{v}_C \) are the vectors for the context word and center word, and \( V \) is the vocabulary. The dot product between the vectors \( \mathbf{v}_O \) and \( \mathbf{v}_C \) measures their similarity, and exponentiating ensures non-negative probabilities. The softmax normalizes the scores so that they sum to 1.
Gradient Descent and Optimization
The word vectors are optimized using gradient descent. The gradient of the loss function is computed with respect to the word vectors, guiding the model to update the vectors in the direction that maximizes the likelihood of the observed context words. The gradient for a center word \( C \) is calculated as follows:
\[\frac{\partial J(\theta)}{\partial \mathbf{v}_C} = \mathbf{v}_O - \sum_{w \in V} P(w | C) \mathbf{v}_w\]This can be interpreted as the difference between the observed context vector \( \mathbf{v}_O \) and the expected context vectors based on the model’s predictions.
Two Vectors Per Word
In word2vec, each word is assigned two separate vectors: one for when the word is used as a center word, and another for when it is used as a context word. Although this might seem redundant, it simplifies optimization and gradient computation. At the end of training, these vectors are often averaged to form a single word vector used in downstream tasks.
Skip-Gram Model and CBOW
Word2Vec has two architectures: Skip-Gram and Continuous Bag of Words (CBOW).
- Skip-Gram: Given a center word, predicts the surrounding context words.
- CBOW: Given the surrounding context words, predicts the center word.
The Skip-Gram model maximizes the probability of the context words given the center word by iterating over each word in a sentence, while the CBOW model does the reverse.
Negative Sampling and Hierarchical Softmax
Training word2vec models can be computationally expensive, especially with large vocabularies. Word2Vec addresses this with two techniques: Negative Sampling and Hierarchical Softmax.
Negative Sampling samples a few “negative” examples (words that don’t appear in the context) to simplify training. Instead of calculating probabilities over the entire vocabulary, only a subset of words is used for efficiency. This drastically reduces the computational overhead, allowing faster model training.
Hierarchical Softmax organizes the vocabulary into a binary tree. In this tree, each word is a leaf node, and probabilities are calculated based on traversing the tree. This method reduces the time complexity significantly, making it more efficient for large vocabularies.
Hierarchical Softmax: Binary Tree Example
The hierarchical softmax uses a binary tree structure where the leaf nodes represent words, and the probability of a word is the probability of a path from the root to the word’s leaf. Let \( n(w, i) \) be the i-th node on the path from root to word \( w \), and \( \mathbf{v}_{n(w, i)} \) the vector associated with node \( n(w, i) \). The probability of word \( w \) given input vector \( \mathbf{v}_i \) is:
\[P(w | \mathbf{v}_i) = \prod_{j=1}^{L(w)-1} \sigma \left( [n(w, j+1) = \text{ch}(n(w, j))] \cdot \mathbf{v}_{n(w, j)}^\top \mathbf{v}_i \right)\]Here, \( \sigma(x) \) is the sigmoid function, and \( [n(w, j+1) = \text{ch}(n(w, j))] \) is 1 if the path goes to the left child, and -1 otherwise.
Applications of Word Vectors
Word vectors learned by word2vec capture rich semantic information. For example, similar words (like “king” and “queen”) cluster together in the vector space, and analogies (such as king - man + woman = queen) can be solved through simple vector arithmetic.
The model’s ability to understand analogies and semantic relationships in this way was a major milestone in NLP, leading to more sophisticated language models, such as GPT-3.