Table of Contents
Sentence Representations
-
What?:
Sentence Representation/Embedding Learning is focused on producing one feature vector to represent a sentence in a latent (semantic) space, while preserving linear properties (distances, angles).
- Tasks:
- Sentence Classification
- Paraphrase Identification
- Semantic Similarity
- Textual Entailment (i.e. Natural Language Inference)
- Retrieval
-
Methods:
Multi-Task Learning:
In particular, people do Pre-Training on other tasks, and then use the pre-trained weights and fine-tune them on a new task. - End-To-End VS Pre-Training:
We can always use End-To-End objectives, however, there are two problems that arise, and can be mitigated by pre-training:- Paucity of Training Data
- Weak Feedback from end of sentence only for text classification (explain?)
Training Sentence Representations
- Language Model Transfer (Dai and Le 2015):
- Model: LSTM
- Objective: Language modeling objective
- Data: Classification data itself, or Amazon reviews
- Downstream: On text classification, initialize weights and continue training
- Unidirectional Training + Transformer - OpenAI GPT (Radford et al. 2018):
- Model: Masked self-attention
- Objective: Predict the next word left->right
- Data: BooksCorpus
- Downstream: Some task fine-tuning, other tasks additional multi-sentence training
- Auto-encoder Transfer (Dai and Le 2015):
- Model: LSTM
- Objective: From single sentence vector, reconstruct the sentence
- Data: Classification data itself, or Amazon reviews
- Downstream: On text classification, initialize weights and continue training
- Context Prediction Transfer - SkipThought Vectors (Kiros et al. 2015):
- Model: LSTM
- Objective: Predict the surrounding sentences
- Data: Books, important because of context
- Downstream: Train logistic regression on \([\|u-v\|; u * v]\) (component-wise)
- Paraphrase ID Transfer (Wieting et al. 2015):
- Model: Try many different ones
- Objective: Predict whether two phrases are paraphrases or not from
- Data: Paraphrase database (http://paraphrase.org), created from bilingual data
- Large Scale Paraphrase Data - ParaNMT-50MT (Wieting and Gimpel 2018):
- Automatic construction of large paraphrase DB:
- Get large parallel corpus (English-Czech)
- Translate the Czech side using a SOTA NMT system
- Get automated score and annotate a sample
- Corpus is huge but includes noise, 50M sentences (about 30M are high quality)
- Trained representations work quite well and generalize
- Automatic construction of large paraphrase DB:
- Large Scale Paraphrase Data - ParaNMT-50MT (Wieting and Gimpel 2018):
- Downstream Usage: Sentence similarity, classification, etc.
- Result: Interestingly, LSTMs work well on in-domain data, but word averaging generalizes better
- Entailment Transfer - InferSent (Conneau et al. 2017):
- Previous objectives use no human labels, but what if:
- Objective: supervised training for a task such as entailment learn generalizable embeddings?
- Task is more difficult and requires capturing nuance → yes?, or data is much smaller → no?
- Model: Bi-LSTM + max pooling
- Data: Stanford NLI, MultiNLI
- Results: Tends to be better than unsupervised objectives such as SkipThought