Neural Machine Translation, Seq2Seq, and Attention

This summary captures my understanding of the lecture on neural machine translation, focusing on Seq2Seq models and the attention mechanism.

Neural Machine Translation (NMT)

NMT aims to produce sequential outputs like translations, conversations, or summaries. Traditional approaches relied on separate models for translation and language representation, which struggled with syntax and long-term dependencies. Seq2Seq models revolutionized this field by using neural networks to handle these limitations.

Sequence-to-Sequence (Seq2Seq)

Seq2Seq is an end-to-end framework using two recurrent neural networks (RNNs):

  1. Encoder: Processes the input sequence into a fixed-size context vector \(C\), often utilizing stacked LSTMs for better context representation.
  2. Decoder: Generates the output sequence using \(C\) as an initial hidden state. It employs an autoregressive process where each generated token serves as input for subsequent predictions.

Key features include:

  • Reversed Input Processing: Improves translation by aligning the first output token with the last input token.
  • Bidirectional RNNs: Encode input sequences by concatenating forward and backward hidden states, capturing dependencies in both directions.

Loss and Training

The model uses cross-entropy loss to evaluate predictions, optimized through backpropagation. Both encoder and decoder weights are updated jointly to align input and output contexts.

Attention Mechanism

The Seq2Seq architecture’s fixed-size context vector limits its performance on long sequences. The attention mechanism addresses this by dynamically weighting input tokens based on their relevance to each decoding step.

Bahdanau Attention

For a target word \(y_i\), the context vector \(c_i\) is computed as: \[ c_i = \sum_{j=1}^n \alpha_{i,j} h_j \] where \(h_j\) are encoder outputs, and \(\alpha_{i,j}\) are attention weights derived from: \[ \alpha_{i,j} = \frac{\exp(e_{i,j})}{\sum_{k=1}^n \exp(e_{i,k})} \] The score \(e_{i,j}\) is typically computed via a feedforward network: \[ e_{i,j} = a(s_{i-1}, h_j) \]

Attention provides alignment between source and target sequences, improving translation quality, especially for longer inputs.

Variants of Attention

  • Global Attention: Focuses on the entire input sequence.
  • Local Attention: Limits the focus to a subset of the input sequence for efficiency.
  • Input-Feeding: Uses previous attention vectors as inputs to the decoder for better context propagation.

Decoding Methods

Decoding translates a source sentence \(s\) into a target sentence \(s’\) by optimizing: \[ s^* = \arg\max_{s’} P(s’|s) \]

Common strategies include:

  • Greedy Search: Selects the most probable token at each step.
  • Beam Search: Maintains multiple candidate sequences, balancing efficiency and precision.

Evaluation Metrics

Evaluating translations is complex due to subjectivity. Common metrics include:

  • Human Evaluation: The gold standard but expensive.
  • BLEU: Measures n-gram overlap between candidate and reference translations: \[ BLEU = \text{BP} \cdot \exp\left(\sum_{n=1}^N w_n \log p_n\right) \] where \(p_n\) is the precision for n-grams, \(w_n\) are weights, and BP is a brevity penalty.

Handling Large Vocabularies

Large vocabularies challenge Seq2Seq models due to the high computational cost of softmax over all tokens. Solutions include:

  1. Noise-Contrastive Estimation (NCE): Approximates softmax by contrasting with negative samples.
  2. Hierarchical Softmax: Structures vocabulary into a binary tree, reducing complexity to \(O(\log V )\).
  3. Reduced Vocabulary: Limits vocabulary size, replacing rare words with \(\\).

Subword and Character-Level Models

To address unknown or rare words:

  • Byte Pair Encoding (BPE): Breaks words into frequent subword units.
  • Character-Based Models: Builds embeddings from character sequences using bi-LSTMs.

Hybrid models combine word- and character-level representations to handle rare words without sacrificing efficiency.

Conclusion

The integration of attention mechanisms and innovations like hybrid models has greatly advanced NMT systems, making them more accurate and robust to real-world linguistic challenges.