Ahmad Badary

Introduction

Specialized Attention Varieties

Introduction

Motivation:
In Vanilla Seq2Seq models, the only representation of the input is the fixed-dimensional vector representation $(y)$, that we need to carry through the entire decoding process.

This presents a bottleneck in condensing all of the information of the entire input sequence into just one fixed-length vector representation.
Attention:
Attention is a mechanism that allows DNNs to focus on (view) certain local or global features of the input sequence as a whole or in part.

Attention involves focus on certain parts of the input, while having a low-resolution view of the rest of the input – similar to human attention in vision/audio.

An Attention Unit considers all sub regions and contexts as its input and it outputs the weighted arithmetic mean of these regions.

The arithmetic mean is the inner product of actual values and their probabilities.

$$m_i = \tanh (x_iW_{x_i} + CW_C)$$

These probabilities are calculated using the context.
The Context $C$ represents everything the RNN has outputted until now.

The difference between using the hyperbolic tanh and a dot product is the granularity of the output regions of interest - tanh is more fine-grained with less choppy and smoother sub-regions chosen.

The probabilities are interpreted as corresponding to the relevance of the sub-region $x_i$ given context $C$.
Types of Attention:
- Soft Attention: we consider different parts of different subregions
  - Soft Attention is deterministic
- Hard Attention: we consider only one subregion
  - Hard Attention is a stochastic process
Strategy:
- Encode each word in the sentence into a vector (representation)
- When decoding, perform a linear combination of these vectors, weighted by attention weights
- Use this combination in picking the next word (subregion)
Calculating Attention:
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
- Use query vector (decoder state) and key vectors (all encoder states)
- For each query-key pair, calculate weight
- Normalize to add to one using softmax
- Combine together value vectors (usually encoder states, like key vectors) by taking the weighted sum
- Use this in any part of the model
Attention Score Functions:
$q$ is the query, $k$ is the key:
- Multi-Layer Perceptron (Bahdanau et al. 2015):
  - Flexible, often very good with large data
  $$a(q,k) = w_2^T \tanh (W_1[q;k])$$
- Bilinear (luong et al. 2015):
  - Not used widely in Seq2Seq models
  - Results are inconsistent
  $$a(q,k) = q^TWk$$
- Dot Product (luong et al. 2015):
  - No parameters
  - Requires the sizes to be the same
  $$a(q,k) = q^Tk$$
- Scaled Dot Product (Vaswani et al. 2017):
  - Solves the scale problem of the dot-product: the scale of the dot product increases as dimensions get larger
  $$a(q,k) = \dfrac{q^Tk}{\sqrt{\vert k \vert}}$$
What to Attend to?
- Input Sentence:
  - A previous word for translation - Neural Machine Translation
  - Copying Mechanism - Gu et al. 2016
  - Lexicon bias Arthur et al. 2016
- Previously Generated Things:
  - In language modeling: attend to the previous words - Merity et al. 2016
    
    Attend to the previous words that you generated and decide whether to use them again (copy)
  - In translation: attend to either input or previous output - Vaswani et al. 2017
Modalities:
- Images (Xu et al. 2015)
- Speech (Chan et al. 2015)
- Hierarchical Structures (Yang et al. 2016):
  - Encode with attention over each sentence then attention over each sentence in the document
- Multiple Sources:
  - Attend to multiple sentences in different languages to be translated to one target language (Zoph et al. 2015)
  - Attend to a sentence and an image (Huang et al. 2016)
Intra-Attention/Self-Attention:
Each element in the sentence attends to other elements – context sensitive encodings.

It behaves similar to a Bi-LSTM in that it tries to encode information about the context (words around the current input) into the representation of the word.
It differs however:
1. Intra-Attention is much more direct, as it takes the context directly without being influenced by many steps inside the RNN
2. It is much faster as it is only a dot/matrix product
Improvement to Attention:
- The Coverage Problem: Neural models tend to drop or repeat content when tested on data not very similar to the training set
- Solution: Model how many times words have been covered
  - Impose a penalty if attention is not $\approx 1$ for each word (Cohn et al. 2015)
    It forces the system to translate each word at least once.
  - Add embeddings indicating coverage (Mi.. et al. 2016)
  - Incorporating Markov Properties (Cohn et al. 2015)
    - Intuition: attention from last time tends to be correlated with attention this time
    - Strategy: Add information about the last attention when making the next decision
  - Bidirectional Training (Cohn et al. 2015):
    - Intuition: Our attention should be roughly similar in forward and backward directions
    - Method: Train so that we get a bonus based on the trace of the matrix product for training in both directions
      $\mathrm{Tr} (A_{X \rightarrow Y}A^T_{Y \rightarrow X})$
  - Supervised Training (Mi et al. 2016):
    - Sometimes we can get “gold standard” alignments a-priori:
      - Manual alignments
      - Pre-trained with strong alignment model
    - Train the model to match these strong alignments (bias the model)
Attention is not Alignment (Koehn and Knowles 2017):
- Attention is often blurred
- Attention is often off by one:
  Since the DNN has already seen parts of the information required to generate previous outputs, it might not need all of the information from the word that is actually matched with its current output.
Thus, even if Supervised training is used to increase alignment accuracy, the overall error rate of the task might not actually decrease.

Specialized Attention Varieties

Hard Attention (Xu et al. 2015):
- Instead of a soft interpolation, make a Zero-One decision about where to attend (Xu et al. 2015)
  - Harder to train - requires reinforcement learning methods
- It helps interpretability (Lei et al. 2016)
Monotonic Attention (Yu et al. 2016):
- In some cases, we might know the output will be the same order as the input:
  - Speech Recognition
  - Incremental Translation
  - Morphological Inflection - sometimes
  - Summarization - sometimes
- Hard decisions about whether to read more
Convolutional Attention (Allamanis et al. 2016):
- Intuition: we might want to be able to attend to “the word after ‘Mr.’”
Multi-headed Attention:
- Idea: multiple attention heads focus on different parts of the sentence
- Different heads for “copy” vs regular (Allamanis et al. 2016)
- Multiple independently learned heads (Vaswani et al. 2017)
Tips:
- Don’t use attention with very long sequences - especially those you want to summarize and process efficiently
- Fertility: we impose the following heuristic “It is bad to pay attention to the same subregion many times”
Notes:
- Attention is a mean field approximation of sampling from a categorical distribution over source word embeddings (or the rnn state aligned with a source word, etc)
- Additive VS Multiplicative Attention:
  Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.
  Multiplicative attention uses the dot-product.
  While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

Representing Sentences: Solving the Vector Problem

The Problem: Conditioning with Vectors:
The Problem: We are compressing a lot of information into a finite-sized vector.
Moreover, gradients flow a very long time/distance; making, even, LSTMs forget.

Sentences are of different sizes but vectors are of the same size; making the compression inherently, very lossy.
The Solution: Representing Sentences as Matrices:
We represent a source sentence as a matrix, and generate the target sentence from a matrix:
- Fixed number of rows, but number of columns depends on the number of words.
This will:
- Solve the capacity problem
- Solve the gradient flow problem
How to build the Matrices?
1. Concatenation:
  - Each word type is represented by an n-dimensional vector.
  - Take all the vectors for the sentence and concatenate them into a matrix
  - This is the simplest possible model: that there are no published results on it…
2. Convolutional Networks:
  - Apply CNNs to transform the naive concatenated matrix to obtain a context-dependent matrix
  - Remove the pooling layer at the end to ensure variable sized output
3. BiRNNs:
  - Most widely used in NMT (Bahdanau et al 2015)
  - One column per word
  - Each column (word) has two halves concatenated together:
    - A “forward representation” (word and its LEFT context)
    - A “reverse representation” (word and its RIGHT context)

Attention Mechanism for DNNs

Table of Contents

Introduction

Specialized Attention Varieties

Representing Sentences: Solving the Vector Problem