- Important Blog on Attention
- Attention and Augmented RNNs (Distill)
- Transformer Implementation TF G-Colab
- A Guide to Attention Mechanisms and Memory Networks (skymind)
- Soft and Hard Attention
- Attention (WildML)
- A Critical Review of Neural Attention Models in Natural Language Processing
- Attention Mechanism(s) (d2l)
- Intuitive Understanding of Attention Mechanism in Deep Learning (medium + code)
- What is the rationale behind self-attention equation and how did they came up with the concept query, key and value? (Reddit!)
- Stanford CS224N: NLP with Deep Learning | Winter 2020 | BERT and Other Pre-trained Language Models - by Jacob Devlin (Lec!)
-
How to get meaning from text with language model BERT | AI Explained (Vid!!)
- All about attention in neural networks described as colab notebooks (Code!)
Introduction
-
Motivation:
In Vanilla Seq2Seq models, the only representation of the input is the fixed-dimensional vector representation \((y)\), that we need to carry through the entire decoding process.This presents a bottleneck in condensing all of the information of the entire input sequence into just one fixed-length vector representation.
-
Attention:
Attention is a mechanism that allows DNNs to focus on (view) certain local or global features of the input sequence as a whole or in part.Attention involves focus on certain parts of the input, while having a low-resolution view of the rest of the input – similar to human attention in vision/audio.
An Attention Unit considers all sub regions and contexts as its input and it outputs the weighted arithmetic mean of these regions.
The arithmetic mean is the inner product of actual values and their probabilities.
$$m_i = \tanh (x_iW_{x_i} + CW_C)$$
These probabilities are calculated using the context.
The Context \(C\) represents everything the RNN has outputted until now.The difference between using the hyperbolic tanh and a dot product is the granularity of the output regions of interest - tanh is more fine-grained with less choppy and smoother sub-regions chosen.
The probabilities are interpreted as corresponding to the relevance of the sub-region \(x_i\) given context \(C\).
- Types of Attention:
- Soft Attention: we consider different parts of different subregions
- Soft Attention is deterministic
- Hard Attention: we consider only one subregion
- Hard Attention is a stochastic process
- Hard Attention is a stochastic process
- Soft Attention: we consider different parts of different subregions
- Strategy:
- Encode each word in the sentence into a vector (representation)
- When decoding, perform a linear combination of these vectors, weighted by attention weights
- Use this combination in picking the next word (subregion)
- Calculating Attention:
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.- Use query vector (decoder state) and key vectors (all encoder states)
- For each query-key pair, calculate weight
- Normalize to add to one using softmax
- Combine together value vectors (usually encoder states, like key vectors) by taking the weighted sum
- Use this in any part of the model
- Attention Score Functions:
\(q\) is the query, \(k\) is the key:- Multi-Layer Perceptron (Bahdanau et al. 2015):
- Flexible, often very good with large data
$$a(q,k) = w_2^T \tanh (W_1[q;k])$$
- Bilinear (luong et al. 2015):
- Not used widely in Seq2Seq models
- Results are inconsistent
$$a(q,k) = q^TWk$$
- Dot Product (luong et al. 2015):
- No parameters
- Requires the sizes to be the same
$$a(q,k) = q^Tk$$
- Scaled Dot Product (Vaswani et al. 2017):
- Solves the scale problem of the dot-product: the scale of the dot product increases as dimensions get larger
$$a(q,k) = \dfrac{q^Tk}{\sqrt{\vert k \vert}}$$
- Multi-Layer Perceptron (Bahdanau et al. 2015):
- What to Attend to?
- Input Sentence:
- A previous word for translation - Neural Machine Translation
- Copying Mechanism - Gu et al. 2016
- Lexicon bias Arthur et al. 2016
- Previously Generated Things:
- In language modeling: attend to the previous words - Merity et al. 2016
Attend to the previous words that you generated and decide whether to use them again (copy)
- In translation: attend to either input or previous output - Vaswani et al. 2017
- In language modeling: attend to the previous words - Merity et al. 2016
- Input Sentence:
- Modalities:
- Images (Xu et al. 2015)
- Speech (Chan et al. 2015)
- Hierarchical Structures (Yang et al. 2016):
- Encode with attention over each sentence then attention over each sentence in the document
- Multiple Sources:
- Attend to multiple sentences in different languages to be translated to one target language (Zoph et al. 2015)
- Attend to a sentence and an image (Huang et al. 2016)
-
Intra-Attention/Self-Attention:
Each element in the sentence attends to other elements – context sensitive encodings.It behaves similar to a Bi-LSTM in that it tries to encode information about the context (words around the current input) into the representation of the word.
It differs however:- Intra-Attention is much more direct, as it takes the context directly without being influenced by many steps inside the RNN
- It is much faster as it is only a dot/matrix product
- Improvement to Attention:
- The Coverage Problem: Neural models tend to drop or repeat content when tested on data not very similar to the training set
- Solution: Model how many times words have been covered
- Impose a penalty if attention is not \(\approx 1\) for each word (Cohn et al. 2015)
It forces the system to translate each word at least once. - Add embeddings indicating coverage (Mi.. et al. 2016)
- Incorporating Markov Properties (Cohn et al. 2015)
- Intuition: attention from last time tends to be correlated with attention this time
- Strategy: Add information about the last attention when making the next decision
- Bidirectional Training (Cohn et al. 2015):
- Intuition: Our attention should be roughly similar in forward and backward directions
- Method: Train so that we get a bonus based on the trace of the matrix product for training in both directions
\(\mathrm{Tr} (A_{X \rightarrow Y}A^T_{Y \rightarrow X})\)
- Supervised Training (Mi et al. 2016):
- Sometimes we can get “gold standard” alignments a-priori:
- Manual alignments
- Pre-trained with strong alignment model
- Train the model to match these strong alignments (bias the model)
- Sometimes we can get “gold standard” alignments a-priori:
- Impose a penalty if attention is not \(\approx 1\) for each word (Cohn et al. 2015)
- Attention is not Alignment (Koehn and Knowles 2017):
- Attention is often blurred
- Attention is often off by one:
Since the DNN has already seen parts of the information required to generate previous outputs, it might not need all of the information from the word that is actually matched with its current output.
Thus, even if Supervised training is used to increase alignment accuracy, the overall error rate of the task might not actually decrease.
Specialized Attention Varieties
- Hard Attention (Xu et al. 2015):
- Instead of a soft interpolation, make a Zero-One decision about where to attend (Xu et al. 2015)
- Harder to train - requires reinforcement learning methods
- It helps interpretability (Lei et al. 2016)
- Instead of a soft interpolation, make a Zero-One decision about where to attend (Xu et al. 2015)
- Monotonic Attention (Yu et al. 2016):
- In some cases, we might know the output will be the same order as the input:
- Speech Recognition
- Incremental Translation
- Morphological Inflection - sometimes
- Summarization - sometimes
- Hard decisions about whether to read more
- In some cases, we might know the output will be the same order as the input:
- Convolutional Attention (Allamanis et al. 2016):
- Intuition: we might want to be able to attend to “the word after ‘Mr.’”
- Intuition: we might want to be able to attend to “the word after ‘Mr.’”
- Multi-headed Attention:
- Idea: multiple attention heads focus on different parts of the sentence
- Different heads for “copy” vs regular (Allamanis et al. 2016)
- Multiple independently learned heads (Vaswani et al. 2017)
- Tips:
- Don’t use attention with very long sequences - especially those you want to summarize and process efficiently
- Fertility: we impose the following heuristic “It is bad to pay attention to the same subregion many times”
- Notes:
- Attention is a mean field approximation of sampling from a categorical distribution over source word embeddings (or the rnn state aligned with a source word, etc)
- Additive VS Multiplicative Attention:
Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.
Multiplicative attention uses the dot-product.
While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
Representing Sentences: Solving the Vector Problem
-
The Problem: Conditioning with Vectors:
The Problem: We are compressing a lot of information into a finite-sized vector.
Moreover, gradients flow a very long time/distance; making, even, LSTMs forget.Sentences are of different sizes but vectors are of the same size; making the compression inherently, very lossy.
- The Solution: Representing Sentences as Matrices:
We represent a source sentence as a matrix, and generate the target sentence from a matrix:- Fixed number of rows, but number of columns depends on the number of words.
This will:
- Solve the capacity problem
- Solve the gradient flow problem
- How to build the Matrices?
- Concatenation:
- Each word type is represented by an n-dimensional vector.
- Take all the vectors for the sentence and concatenate them into a matrix
- This is the simplest possible model: that there are no published results on it…
- Convolutional Networks:
- Apply CNNs to transform the naive concatenated matrix to obtain a context-dependent matrix
- Remove the pooling layer at the end to ensure variable sized output
- BiRNNs:
- Most widely used in NMT (Bahdanau et al 2015)
- One column per word
- Each column (word) has two halves concatenated together:
- A “forward representation” (word and its LEFT context)
- A “reverse representation” (word and its RIGHT context)
- Concatenation: