Ahmad Badary

Sentence Representations

Training Sentence Representations

Contextualized Word Representations

Sentence Representations

What?:
Sentence Representation/Embedding Learning is focused on producing one feature vector to represent a sentence in a latent (semantic) space, while preserving linear properties (distances, angles).
Tasks:
- Sentence Classification
- Paraphrase Identification
- Semantic Similarity
- Textual Entailment (i.e. Natural Language Inference)
- Retrieval
Methods:
Multi-Task Learning:
In particular, people do Pre-Training on other tasks, and then use the pre-trained weights and fine-tune them on a new task.
End-To-End VS Pre-Training:
We can always use End-To-End objectives, however, there are two problems that arise, and can be mitigated by pre-training:
- Paucity of Training Data
- Weak Feedback from end of sentence only for text classification (explain?)

Training Sentence Representations

Language Model Transfer (Dai and Le 2015):
- Model: LSTM
- Objective: Language modeling objective
- Data: Classification data itself, or Amazon reviews
- Downstream: On text classification, initialize weights and continue training
Unidirectional Training + Transformer - OpenAI GPT (Radford et al. 2018):
- Model: Masked self-attention
- Objective: Predict the next word left->right
- Data: BooksCorpus
- Downstream: Some task fine-tuning, other tasks additional multi-sentence training
Auto-encoder Transfer (Dai and Le 2015):
- Model: LSTM
- Objective: From single sentence vector, reconstruct the sentence
- Data: Classification data itself, or Amazon reviews
- Downstream: On text classification, initialize weights and continue training
Context Prediction Transfer - SkipThought Vectors (Kiros et al. 2015):
- Model: LSTM
- Objective: Predict the surrounding sentences
- Data: Books, important because of context
- Downstream: Train logistic regression on \([\|u-v\|; u * v]\) (component-wise)
Paraphrase ID Transfer (Wieting et al. 2015):
- Model: Try many different ones
- Objective: Predict whether two phrases are paraphrases or not from
- Data: Paraphrase database (http://paraphrase.org), created from bilingual data
  - Large Scale Paraphrase Data - ParaNMT-50MT (Wieting and Gimpel 2018):
    - Automatic construction of large paraphrase DB:
      - Get large parallel corpus (English-Czech)
      - Translate the Czech side using a SOTA NMT system
      - Get automated score and annotate a sample
    - Corpus is huge but includes noise, 50M sentences (about 30M are high quality)
    - Trained representations work quite well and generalize
- Downstream Usage: Sentence similarity, classification, etc.
- Result: Interestingly, LSTMs work well on in-domain data, but word averaging generalizes better
Entailment Transfer - InferSent (Conneau et al. 2017):
- Previous objectives use no human labels, but what if:
- Objective: supervised training for a task such as entailment learn generalizable embeddings?
  - Task is more difficult and requires capturing nuance → yes?, or data is much smaller → no?
- Model: Bi-LSTM + max pooling
- Data: Stanford NLI, MultiNLI
- Results: Tends to be better than unsupervised objectives such as SkipThought

Sentence and Contextualized Word Representations
(Multi-Task/Transfer Learning)

Table of Contents

Sentence Representations

Training Sentence Representations

Contextualized Word Representations

Sentence and Contextualized Word Representations (Multi-Task/Transfer Learning)

Table of Contents

Sentence Representations

Training Sentence Representations

Contextualized Word Representations

Sentence and Contextualized Word Representations
(Multi-Task/Transfer Learning)