Dependency Parsing
I watched and am summarizing the content of Module 3 - Dependency Parsing. The focus is on the concepts and techniques involved in parsing the syntactic dependencies of sentences.
Dependency Grammar and Structure
In dependency grammar, parse trees represent the syntactic structure of sentences. Unlike constituency structures, which use nested constituents, dependency structures emphasize binary, asymmetric relationships between words. These relationships are called dependencies. Dependencies typically form tree structures where:
- Head (Governor): The superior word in the dependency relationship.
- Dependent (Modifier): The subordinate word.
Example
A dependency tree represents a sentence like:
Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.
Here, arrows point from heads to dependents, and a fake ROOT node may be added to maintain a single dependency for each word.
Dependency Parsing
Dependency parsing involves analyzing the syntactic dependency structure of a sentence, \( S \), to produce a dependency tree. Formally, it maps the input sentence \( S = w_0 w_1 \ldots w_n \) (where \( w_0 \) is the ROOT) to a dependency graph ( G ). This process has two main subproblems:
- Learning: Train a parsing model \( M \) using a dataset \( D \) of annotated dependency graphs.
- Parsing: Use \( M \) to generate the optimal dependency graph \( D \) for a new sentence \( S \).
Transition-Based Dependency Parsing
Transition-based dependency parsing uses a state machine to map sentences into dependency trees. A state \( c \) is represented as:
- A stack \( \sigma \) of words \( w_i \) from \( S \).
- A buffer \( \beta \) of remaining words.
- A set of dependency arcs \( A \) where each arc is \( (w_i, r, w_j) \), with \( r \) as the dependency relation.
States
- Initial State: \(([w_0]{\sigma}, [w_1, \ldots, w_n]{\beta}, \emptyset)\), where the stack contains only the ROOT.
- Terminal State: \((\sigma, []_{\beta}, A)\), where the buffer is empty, and \( A \) contains all dependencies.
Transitions
Three types of transitions govern the state machine:
- Shift: Moves the first word of the buffer to the top of the stack.
- Left-Arcr: Adds an arc \( (w_j, r, w_i) \) where \( w_i \) is second-to-top of the stack, and \( w_j \) is at the top. Removes \( w_i \) from the stack.
- Right-Arcr: Adds an arc \( (w_i, r, w_j) \) where \( w_i \) is second-to-top of the stack, and \( w_j \) is at the top. Removes \( w_j \) from the stack.
Neural Dependency Parsing
Neural dependency parsers rely on dense feature representations, improving efficiency and accuracy over traditional methods. The arc-standard system guides transitions in this greedy, transition-based parsing method.
Features
The parser considers dense embeddings for:
- Words (Sword): Top 3 stack words, top 3 buffer words, and their dependents (e.g., leftmost/rightmost children).
- Part-of-Speech Tags (Stag): POS tags of the relevant words.
- Arc Labels (Slabel): Dependency relations of the words.
Each feature type has an embedding matrix:
- \( E_w \in \mathbb{R}^{d \times N_w} \): Word embeddings.
- \( E_t \in \mathbb{R}^{d \times N_t} \): POS tag embeddings.
- \( E_l \in \mathbb{R}^{d \times N_l} \): Arc label embeddings.
Feedforward Neural Network
The model processes the input embeddings through:
- Input Layer: Concatenation of selected features, \( [x_w, x_t, x_l] \).
- Hidden Layer: Computes \( h = \tanh(W_w x_w + W_t x_t + W_l x_l + b_1) \).
- Softmax Layer: Produces probabilities for transitions \( p = \text{softmax}(W_2 h) \).
The network minimizes cross-entropy loss, and backpropagation updates both embedding matrices and network parameters.
Summary of Greedy Approach
The parser predicts transitions sequentially to build a parse tree, relying on dense features extracted from current configurations.
For more details on neural dependency parsers, see:
“A Fast and Accurate Dependency Parser using Neural Networks” by Chen and Manning (2014).