Table of Contents



Sentence Representations

  1. What?:
    Sentence Representation/Embedding Learning is focused on producing one feature vector to represent a sentence in a latent (semantic) space, while preserving linear properties (distances, angles).

  2. Tasks:
    • Sentence Classification
    • Paraphrase Identification
    • Semantic Similarity
    • Textual Entailment (i.e. Natural Language Inference)
    • Retrieval

  3. Methods:
    Multi-Task Learning:
    In particular, people do Pre-Training on other tasks, and then use the pre-trained weights and fine-tune them on a new task.

  4. End-To-End VS Pre-Training:
    We can always use End-To-End objectives, however, there are two problems that arise, and can be mitigated by pre-training:
    • Paucity of Training Data
    • Weak Feedback from end of sentence only for text classification (explain?)

Training Sentence Representations

  1. Language Model Transfer (Dai and Le 2015):
    • Model: LSTM
    • Objective: Language modeling objective
    • Data: Classification data itself, or Amazon reviews
    • Downstream: On text classification, initialize weights and continue training
  2. Unidirectional Training + Transformer - OpenAI GPT (Radford et al. 2018):
    • Model: Masked self-attention
    • Objective: Predict the next word left->right
    • Data: BooksCorpus
    • Downstream: Some task fine-tuning, other tasks additional multi-sentence training
  3. Auto-encoder Transfer (Dai and Le 2015):
    • Model: LSTM
    • Objective: From single sentence vector, reconstruct the sentence
    • Data: Classification data itself, or Amazon reviews
    • Downstream: On text classification, initialize weights and continue training
  4. Context Prediction Transfer - SkipThought Vectors (Kiros et al. 2015):
    • Model: LSTM
    • Objective: Predict the surrounding sentences
    • Data: Books, important because of context
    • Downstream: Train logistic regression on \([\|u-v\|; u * v]\) (component-wise)
  5. Paraphrase ID Transfer (Wieting et al. 2015):
    • Model: Try many different ones
    • Objective: Predict whether two phrases are paraphrases or not from
    • Data: Paraphrase database (http://paraphrase.org), created from bilingual data
      • Large Scale Paraphrase Data - ParaNMT-50MT (Wieting and Gimpel 2018):
        • Automatic construction of large paraphrase DB:
          • Get large parallel corpus (English-Czech)
          • Translate the Czech side using a SOTA NMT system
          • Get automated score and annotate a sample
        • Corpus is huge but includes noise, 50M sentences (about 30M are high quality)
        • Trained representations work quite well and generalize
    • Downstream Usage: Sentence similarity, classification, etc.
    • Result: Interestingly, LSTMs work well on in-domain data, but word averaging generalizes better
  6. Entailment Transfer - InferSent (Conneau et al. 2017):
    • Previous objectives use no human labels, but what if:
    • Objective: supervised training for a task such as entailment learn generalizable embeddings?
      • Task is more difficult and requires capturing nuance → yes?, or data is much smaller → no?
    • Model: Bi-LSTM + max pooling
    • Data: Stanford NLI, MultiNLI
    • Results: Tends to be better than unsupervised objectives such as SkipThought

Contextualized Word Representations