Table of Contents



Notes (Move Inside):

Representation Learning

  1. Representation Learning:
    Representation Learning (Feature Learning) is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data.
    This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

    Hypothesis - Main Idea:
    The core hypothesis for representation learning is that the unlabeled data can be used to learn a good representation.

    Types:
    Representation learning can be either supervised or unsupervised.

    Representation Learning Approaches:
    There are various ways of learning different representations:

    • Probabilistic Models: the goal is to learn a representation that captures the probability distribution of the underlying explanatory features for the observed input. Such a learnt representation can then be used for prediction.
    • Deep Learning: the representations are formed by composition of multiple non-linear transformations of the input data with the goal of yielding abstract and useful representations for tasks like classification, prediction etc.

    Representation Learning Tradeoff:
    Most representation learning problems face a tradeoff between preserving as much information about the input as possible and attaining nice properties (such as independence).

    The Problem of Data (Semi-Supervised Learning*):
    We often have very large amounts of unlabeled training data and relatively little labeled training data. Training with supervised learning techniques on the labeled subset often results in severe overfitting. Semi-supervised learning offers the chance to resolve this overfitting problem by also learning from the unlabeled data. Specifically, we can learn good representations for the unlabeled data, and then use these representations to solve the supervised learning task.

    Learning from Limited Data:
    Humans and animals are able to learn from very few labeled examples.
    Many factors could explain improved human performance — for example, the brain may use very large ensembles of classifiers or Bayesian inference techniques.
    One popular hypothesis is that the brain is able to leverage unsupervised or semi-supervised learning.

    Motivation/Applications:

    1. ML tasks such as classification often require input that is mathematically and computationally convenient to process.
      However, real-world data such as images, video, and sensor data has not yielded to attempts to algorithmically define specific features.
    2. Learning good representations enables us to perform certain (specific) tasks in a more optimal manner.
      • E.g. linked lists \(\implies\) \(\mathcal{O}(n)\) insertion | red-black tree \(\implies\) \(\mathcal{O}(\log n)\) insertion.
    3. Representation Learning is particularly interesting because it provides (one) way to perform unsupervised and semi-supervised learning.
    4. Feature Engineering is hard. Representation Learning allows us to avoid having to engineer features, manually.
    5. In general, representation learning can allow us to achieve multi-task learning, transfer learning, and domain adaptation through shared representations.

    The Quality of Representations:
    Generally speaking, a good representation is one that makes a subsequent learning task easier.
    The choice of representation will usually depend on the choice of the subsequent learning task.

    Success of Representation Learning:
    The success of representation learning can be attributed to many factors, including:

    • Theoretical advantages of distributed representations (Hinton et al., 1986)
    • Theoretical advantages of deep representations (Hinton et al., 1986)
    • The Causal Factors Hypothesis: a general idea of underlying assumptions about the data generating process, in particular about underlying causes of the observed data.

    Representation Learning Domain Applications:

    • Computer Vision: CNNs.
    • Natural Language Processing: Word-Embeddings.
    • Speech Recognition: Speech-Embeddings.

    Notes:

    • Representation Learning can be done with both, generative and discriminative models.
    • In DL, representation learning uses a composition of transformations of the input data (features) to create learned features.

  2. Distributed Representation:
    Distributed Representations of concepts are representations composed of many elements that can be set separately from each other.

    Distributed representations of concepts are one of the most important tools for representation learning:

    • Distributed representations are powerful because they can use \(n\) features with \(k\) values to describe \(k^{n}\) different concepts.
    • Both neural networks with multiple hidden units and probabilistic models with multiple latent variables make use of the strategy of distributed representation.
    • Motivation for using Distributed Representations:
      Many deep learning algorithms are motivated by the assumption that the hidden units can learn to represent the underlying causal factors that explain the data.
      Distributed representations are natural for this approach, because each direction in representation space can correspond to the value of a different underlying configuration variable.
    • Distributed vs Symbolic Representations:
      • Number of “Representable” Configurations - by example:
        • An example of a distributed representation is a vector of \(n\) binary features.
          It can take \(2^{n}\) configurations, each potentially corresponding to a different region in input space.
        • An example of a symbolic representation, is one-hot representation1 where the input is associated with a single symbol or category.
          If there are \(n\) symbols in the dictionary, one can imagine \(n\) feature detectors, each corresponding to the detection of the presence of the associated category.
          In that case only \(n\) different configurations of the representation space are possible, carving \(n\) different regions in input space.

          A symbolic representation is a specific example of the broader class of non-distributed representations, which are representations that may contain many entries but without significant meaningful separate control over each entry.
      • Generalization:
        An important related concept that distinguishes a distributed representation from a symbolic one is that generalization arises due to shared attributes between different concepts.
        • Distributed representations induce a rich similarity space, in which semantically close concepts (or inputs) are close in distance, a property that is absent from purely symbolic representations.
          [Analysis: Generalization of Distributed Representations]

    • For some of these non-distributed algorithms, the output is not constant by parts but instead interpolates between neighboring regions. The relationship between the number of parameters (or examples) and the number of regions they can define remains linear.

    Generalization of Distributed Representations:
    We know that for distributed representations, Generalization arises due to shared attributes between different concepts.

    But an important question is:
    “When and why can there be a statistical advantage from using a distributed representation as part of a learning algorithm?”

    • Distributed representations can have a statistical advantage when an apparently complicated structure can be compactly represented using a small number of parameters.
    • Some traditional nondistributed learning algorithms generalize only due to the smoothness assumption, which states that if \(u \approx v,\) then the target function \(f\) to be learned has the property that \(f(u) \approx f(v),\) in general.
      There are many ways of formalizing such an assumption, but the end result is that if we have an example \((x, y)\) for which we know that \(f(x) \approx y,\) then we choose an estimator \(\hat{f}\) that approximately satisfies these constraints while changing as little as possible when we move to a nearby input \(x+\epsilon\).
      • This assumption is clearly very useful, but it suffers from the curse of dimensionality: in order to learn a target function that increases and decreases many times in many different regions,1 we may need a number of examples that is at least as large as the number of distinguishable regions.
        One can think of each of these regions as a category or symbol: by having a separate degree of freedom for each symbol (or region), we can learn an arbitrary decoder mapping from symbol to value.
        However, this does not allow us to generalize to new symbols for new regions.
    • If we are lucky, there may be some regularity in the target function, besides being smooth.
      For example, a convolutional network with max-pooling can recognize an object regardless of its location in the image, even though spatial translation of the object may not correspond to smooth transformations in the input space.

    Justifying Generalization in distributed representations:

    • Geometric justification (by analyzing binary, linear feature extractors (units)):
      Let us examine a special case of a distributed representation learning algorithm, that extracts binary features by thresholding linear functions of the input:
    • VC-Theory justification - Fixed Capacity:
    • Experimental justification:
      Though the above ideas have been abstract, they may be experimentally validated:

    Notes:

    • page 542 As a counter-example, recent research from DeepMind (Morcos et al., 2018) suggests that while some hidden units might appear to learn an interpretable feature, ‘these interpretable neurons are no more important than confusing neurons with difficult-to-interpret activity’.
      Moreover, ‘networks which generalise well are much less reliant on single directions [ie. hidden units] than those which memorise’. See more in the DeepMind blog post.
    • Distributed representations based on latent variables can obtain all of the advantages of representation learning that we have seen with deep feedforward and recurrent networks.
    • Food for Thought (F2T):
      “since feature engineering was made obsolete by deep learning, algorithm engineering will be made obsolete by meta-learning” - Sohl-Dickstein

  3. Deep Representations - Exponential Gain from Depth:

    Exponential Gain in MLPs:
    We have seen in (section 6.4.1) that multilayer perceptrons are universal approximators, and that some functions can be represented by exponentially smaller deep networks compared to shallow networks.
    This decrease in model size leads to improved statistical efficiency.

    Similar results apply, more generally, to other kinds of models with distributed hidden representations.

    Justification/Motivation:
    In this and other AI tasks, the factors that can be chosen almost independently from each other yet still correspond to meaningful inputs are more likely to be very high-level and related in highly nonlinear ways to the input.
    Goodfellow et al. argue that this demands deep distributed representations, where the higher level features (seen as functions of the input) or factors (seen as generative causes) are obtained through the composition of many nonlinearities.

    E.g. the example of a generative model that learned about the explanatory factors underlying images of faces, including the person’s gender and whether they are wearing glasses.
    It would not be reasonable to expect a shallow network, such as a linear network, to learn the complicated relationship between these abstract explanatory factors and the pixels in the image.

    Universal Approximation property in Models (from Depth):

    • It has been proven in many different settings that organizing computation through the composition of many nonlinearities and a hierarchy of reused features can give an exponential boost to statistical efficiency, on top of the exponential boost given by using a distributed representation.
    • Many kinds of networks (e.g., with saturating nonlinearities, Boolean gates, sum/products, or RBF units) with a single hidden layer can be shown to be universal approximators.
      A model family that is a universal approximator can approximate a large class of functions (including all continuous functions) up to any non-zero tolerance level, given enough hidden units.
      However, the required number of hidden units may be very large.
    • Theoretical results concerning the expressive power of deep architectures state that there are families of functions that can be represented efficiently by an architecture of depth \(k\), but would require an exponential number of hidden units (wrt. input size) with insufficient depth (depth \(2\) or depth \(k − 1\)).

    Exponential Gains in Structured Probabilistic Models:

    • PGMs as Universal Approximators:
      • Just like deterministic feedforward networks are universal approximators of functions.
        Many structured probabilistic models with a single hidden layer of latent variables, including restricted Boltzmann machines and deep belief networks, are universal approximators of probability distributions (Le Roux and Bengio, 2008, 2010; Montúfar and Ay, 2011; Montúfar, 2014; Krause et al., 2013).
    • Exponential Gain from Depth in PGMs:
      • Just like a sufficiently deep feedforward network can have an exponential advantage over a network that is too shallow.
        Such results can also be obtained for other models such as probabilistic models.
        • E.g. The sum-product network (SPN) (Poon and Domingos, 2011).
          These models use polynomial circuits to compute the probability distribution over a set of random variables.
          • Delalleau and Bengio (2011) showed that there exist probability distributions for which a minimum depth of SPN is required to avoid needing an exponentially large model.
          • Later, Martens and Medabalimi (2014) showed that there are significant differences between every two finite depths of SPN, and that some of the constraints used to make SPNs tractable may limit their representational power.

    Expressiveness of Convolutional Networks:
    Another interesting development is a set of theoretical results for the expressive power of families of deep circuits related to convolutional nets:
    They highlight an exponential advantage for the deep circuit even when the shallow circuit is allowed to only approximate the function computed by the deep circuit (Cohen et al., 2015).
    By comparison, previous theoretical work made claims regarding only the case where the shallow circuit must exactly replicate particular functions.

    Notes:


Unsupervised Representation Learning

  1. Unsupervised Representation Learning:
    In Unsupervised feature learning, features are learned with unlabeled data.

    The Goal of unsupervised feature learning is often to discover low-dimensional features that captures some structure underlying the high-dimensional input data.

    Learning:
    Unsupervised deep learning algorithms have a main training objective but also learn a representation as a side effect.

    Unsupervised Learning for Semisupervised Learning: xw When the feature learning is performed in an unsupervised way, it enables a form of semisupervised learning where features learned from an unlabeled dataset are then employed to improve performance in a supervised setting with labeled data.

  2. Greedy Layer-Wise Unsupervised Pretraining:
    (Greedy Layer-Wise) Unsupervised Pretraining

    • Greedy: it is a greedy algorithm.
      It optimizes each piece of the solution independently, one piece at a time, rather than jointly optimizing all pieces.
    • Layer-Wise: the independent pieces are the layers of the network2.
    • Unsupervised: each layer is trained with an unsupervised representation learning algorithm.
    • Pretraining3: it is supposed to be only a first step before a joint training algorithm is applied to fine-tune all the layers together.

    This procedure is a canonical example of how a representation learned for one task (unsupervised learning, trying to capture the shape of the input distribution) can sometimes be useful for another task (supervised learning with the same input domain).

    Algorithm/Procedure:

    • Supervised Learning Phase:
      It may involve:
      1. Training a simple classifier on top of the features learned in the pretraining phase.
      2. Supervised fine-tuning of the entire network learned in the pretraining phase.

    Interpretation in Supervised Settings:
    In the context of a supervised learning task, the procedure can be viewed as:

    • A Regularizer.
      In some experiments, pretraining decreases test error without decreasing training error.
    • A form of Parameter Initialization.

    Applications:

    • Training Deep Models:
      Greedy layer-wise training procedures based on unsupervised criteria have long been used to sidestep the difficulty of jointly training the layers of a deep neural net for a supervised task.
      The deep learning renaissance of 2006 began with the discovery that this greedy learning procedure could be used to find a good initialization for a joint learning procedure over all the layers, and that this approach could be used to successfully train even fully connected architectures.
      Prior to this discovery, only convolutional deep networks or networks whose depth resulted from recurrence were regarded as feasible to train.
    • Parameter Initialization:
      THey can also be used as initialization for other unsupervised learning algorithms, such as:
      • Deep Autoencoders (Hinton and Salakhutdinov, 2006)
      • Probabilistic mModels with many layers of latent variables:
        E.g. deep belief networks (DBNs) (Hinton et al., 2006) and deep Boltzmann machines (DBMs) (Salakhutdinov and Hinton, 2009a).


  3. Clustering | K-Means:
  4. Local Linear Embeddings:
  5. Principal Components Analysis (PCA):
  6. Independent Components Analysis (ICA):
  7. (Unsupervised) Dictionary Learning:

Supervised Representation Learning

  1. Supervised Representation Learning:
    In Supervised feature learning, features are learned using labeled data.

    Learning:
    The data label allows the system to compute an error term, the degree to which the system fails to produce the label, which can then be used as feedback to correct the learning process (reduce/minimize the error).

    Examples:

    • Supervised Neural Networks
    • Supervised Dictionary Learning

    FFNs as Representation Learning Algorithms:

  2. Greedy Layer-Wise Supervised Pretraining:
    As discussed in section 8.7.4, it is also possible to have greedy layer-wise supervised pretraining.
    This builds on the premise that training a shallow network is easier than training a deep one, which seems to have been validated in several contexts (Erhan et al., 2010).

  3. Neural Networks:
  4. Supervised Dictionary Learning:

Transfer Learning and Domain Adaptation

img

  1. Introduction - Transfer Learning and Domain Adaptation:
    Transfer Learning and Domain Adaptation refer to the situation where what has been learned in one setting (i.e., distribution \(P_{1}\)) is exploited to improve generalization in another setting (say distribution \(P_{2}\)).

    This is a generalization of unsupervised pretraining, where we transferred representations between an unsupervised learning task and a supervised learning task.

    In Supervised Learning: transfer learning, domain adaptation, and concept drift can be viewed as particular forms of Multi-Task Learning.

    However, Transfer Learning is a more general term that applies to both Supervised and Unsupervised Learning, as well as, Reinforcement Learning.

    Goal/Objective and Relation to Representation Learning:
    In the cases of Transfer Learning, Multi-Task Learning, and Domain Adaptation: The Objective/Goal is to take advantage of data from the first setting to extract information that may be useful when learning or even when directly making predictions in the second setting.

    The core idea of Representation Learning is that the same representation may be useful in both settings.

    Thus, we can use shared representations to accomplish Transfer Learning etc.
    Shared Representations are useful to handle multiple modalities or domains, or to transfer learned knowledge to tasks for which few or no examples are given but a task representation exists.

    img

  2. Transfer Learning:
    Transfer Learning (in ML) is the problem of storing knowledge gained while solving one problem and applying it to a different but related problem.

    Definition:
    Formally, the definition of transfer learning is given in terms of:

    • A Domain \(\mathcal{D}=\{\mathcal{X}, P(X)\}\), \(\:\:\) consisting of:
      • Feature Space \(\mathcal{X}\)
      • Marginal Probability Distribution \(P(X)\),
        where \(X=\left\{x_{1}, \ldots, x_{n}\right\} \in \mathcal{X}\).
    • A Task \(\mathcal{T}=\{\mathcal{Y}, f(\cdot)\}\),
      (given a specific domain \(\mathcal{D}=\{\mathcal{X}, P(X)\}\)) consisting of:
      • A label space \(\mathcal{Y}\)
      • An objective predictive function \(f(\cdot)\)
        It is learned from the training data, which consist of pairs \(\left\{x_ {i}, y_{i}\right\}\), where \(x_{i} \in X\) and \(y_{i} \in \mathcal{Y}\).
        It can be used to predict the corresponding label, \(f(x)\), of a new instance \(x\).

    Given a source domain \(\mathcal{D}_ {S}\) and learning task \(\mathcal{T}_ {S}\), a target domain \(\mathcal{D}_ {T}\) and learning task \(\mathcal{T}_ {T}\), transfer learning aims to help improve the learning of the target predictive function \(f_ {T}(\cdot)\) in \(\mathcal{D}_ {T}\) using the knowledge in \(\mathcal{D}_ {S}\) and \(\mathcal{T}_ {S}\), where \(\mathcal{D}_ {S} \neq \mathcal{D}_ {T}\), or \(\mathcal{T}_ {S} \neq \mathcal{T}_ {T}\).

    In Transfer Learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in \(P_1\) are relevant to the variations that need to be captured for learning \(P_2\). This is typically understood in a supervised learning context, where the input is the same but the target may be of a different nature.

    Types of Transfer Learning:

    • Inductive Transfer Learning:
      \(\mathcal{D}_ {S} = \mathcal{D}_ {T} \:\:\: \text{ and }\:\:\: \mathcal{T}_ {S} \neq \mathcal{T}_ {T}\)
      e.g. \(\left(\mathcal{D}_ {S} = \text{ Wikipedia } = \mathcal{D}_ {T}\right) \:\: \text{ and } \:\: \left(\mathcal{T}_ {S} = \text{ Skip-Gram }\right) \neq \left(\mathcal{T}_ {T} = \text{ Classification }\right)\)
    • Transductive Transfer Learning (Domain Adaptation):
      \(\mathcal{D}_ {S} \neq \mathcal{D}_ {T} \:\:\: \text{ and }\:\:\: \mathcal{T}_ {S} = \mathcal{T}_ {T}\)
      e.g. \(\left(\mathcal{D}_ {S} = \text{ Reviews }\right) \neq \left(\mathcal{D}_ {T} = \text{ Tweets }\right) \:\: \text{ and } \:\: \left(\mathcal{T}_ {S} = \text{ Sentiment Analysis } = \mathcal{T}_ {T}\right)\)
    • Unsupervised Transfer Learning:
      \(\mathcal{D}_ {S} \neq \mathcal{D}_ {T} \:\:\: \text{ and }\:\:\: \mathcal{T}_ {S} \neq \mathcal{T}_ {T}\)
      e.g. \(\left(\mathcal{D}_ {S} = \text{ Animals}\right) \neq \left(\mathcal{D}_ {T} = \text{ Cars}\right) \: \text{ and } \: \left(\mathcal{T}_ {S} = \text{ Recog.}\right) \neq \left(\mathcal{T}_ {T} = \text{ Detection}\right)\)

    Concept Drift:
    Concept Drift is a phenomena where the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.

    It can be viewed as a form of transfer learning due to gradual changes in the data distribution over time.

    Unsupervised Deep Learning for Transfer Learning:


  3. Domain Adaptation:
    Domain Adaptation is a form of transfer learning where we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution.

    It is a sequential process.

    In domain adaptation, the task (and the optimal input-to output mapping) remains the same between each setting, but the input distribution is slightly different.

  4. Multitask Learning:
    Multitask Learning is a transfer learning where multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks.

    In particular, it is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.

    It is a parallel process.

    Multitask vs Transfer Learning:
    img

    1. Multi-Task Learning: general term for training on multiple tasks
      1. Joint Learning: by choosing mini-batches from two different tasks simultaneously/alternately
      2. Pre-Training: first train on one task, then train on another
        widely used for word embeddings.
    2. Transfer Learning:
      a type of multi-task learning where we are focused on one task; by learning on another task then applying those models to our main task

  5. Representation Learning for the Transfer of Knowledge:
    We can use Representation Learning to achieve Multi-Task Learning, Transfer Learning, and Domain Adaptation.

    In general, Representation Learning can be used to achieve Multi-Task Learning, Transfer Learning, and Domain Adaptation, when there exist features that are useful for the different settings or tasks, corresponding to underlying factors that appear in more than one setting.
    This applies in two cases:

    • Shared Input Semantics:

      In this case, we share the lower layers, and have a task-dependent upper layers.

    • Shared Output Semantics:

      In cases like these, it makes more sense to share the upper layers (near the output) of the neural network, and have a task-specific preprocessing.

  6. K-Shot Learning:
    K-Shot (Few-Shot) Learning is a supervised learning setting (problem) where the goal is to learn from an extremely small number \(k\) of labeled examples (called shots).

    General Setting:
    We first train a model on a large dataset \(\mathcal{D}=\left\{\widetilde{\mathbf{x}}_ {i}, \widetilde{\gamma}_ {i}\right\}_ {i=1}^{N}\) of inputs \(\widetilde{\mathbf{x}}_ {i}\) and labels \(\widetilde{y}_ {i} \in\{1, \ldots, \widetilde{C}\}\) that indicate which of the \(\widetilde{C}\) classes each input belongs to.
    Then, using knowledge from the model trained on the large dataset, we perform \(\mathrm{k}\)-shot learning with a small dataset \(\mathcal{D}=\left\{\mathbf{x}_ {i}, y_ {i}\right\}_ {i=1}^{N}\) with \(C\) new classes, labels \(y_ {i} \in\{\widetilde{C}+1, \widetilde{C}+C\}\) and \(k\) examples (inputs) from each new class.
    During test time we classify unseen examples (inputs) \(\mathbf{x}^{* }\) from the new classes \(C\) and evaluate the predictions against ground truth labels \(y^{* }\).

    Comparison to alternative Learning Paradigms:
    img

    As Transfer Learning:
    Two extreme forms of transfer learning are One-Shot Learning and Zero-Shot Learning; they provide only one and zero labeled examples of the transfer task, respectively.

    One-Shot Learning:
    One-Shot Learning (Fei-Fei et al., 2006) is a form of k-shot learning where \(k=1\).

    It is possible because the representation learns to cleanly separate the underlying classes during the first stage.
    During the transfer learning stage, only one labeled example is needed to infer the label of many possible test examples that all cluster around the same point in representation space.
    This works to the extent that the factors of variation corresponding to these invariances have been cleanly separated from the other factors, in the learned representation space, and we have somehow learned which factors do and do not matter when discriminating objects of certain categories.

    Zero-Shot Learning:
    Zero-Shot Learning (Palatucci et al., 2009; Socher et al., 2013b) or Zero-data learning (Larochelle et al., 2008) is a form of k-shot learning where \(k=0\).

    Example: Zero-Shot Learning Setting
    Consider the problem of having a learner read a large collection of text and then solve object recognition problems.
    It may be possible to recognize a specific object class even without having seen an image of that object, if the text describes the object well enough.
    For example, having read that a cat has four legs and pointy ears, the learner might be able to guess that an image is a cat, without having seen a cat before.

    Justification and Interpretation:
    Zero-Shot Learning is only possible because additional information has been exploited during training.

    We can think of think of the zero-data learning scenario as including three random variables:

    1. (Traditional) Inputs \(x\)
    2. (Traditional) Outputs or Targets \(\boldsymbol{y}\)
    3. (Additional) Random Variable describing the task, \(T\)

    The model is trained to estimate the conditional distribution \(p(\boldsymbol{y} \vert \boldsymbol{x}, T)\).

    Representing the task \(T\):
    Zero-shot learning requires \(T\) to be represented in a way that allows some sort of generalization.
    For example, \(T\) cannot be just a one-hot code indicating an object category.

    Socher et al. (2013 b) provide instead a distributed representation of object categories by using a learned word embedding for the word associated with each category.

    Representation Learning for Zero-Shot Learning:
    The principle, underlying zero-shot learning as a form of transfer learning: capturing a representation in one modality, a representation in another modality, and the relationship (in general a joint distribution) between pairs \((\boldsymbol{x}, \boldsymbol{y})\) consisting of one observation \(\boldsymbol{x}\) in one modality and another observation \(\boldsymbol{y}\) in the other modality, (Srivastava and Salakhutdinov, 2012).
    By learning all three sets of parameters (from \(\boldsymbol{x}\) to its representation, from \(\boldsymbol{y}\) to its representation, and the relationship between the two representations), concepts in one representation are anchored in the other, and vice-versa, allowing one to meaningfully generalize to new pairs.

    In particular, Transfer learning between two domains \(x\) and \(y\) enables zero-shot learning.

    Zero-Shot Learning in Machine Translation:

    Relation to Multi-modal Learning:
    Zero-Shot Learning can be performed using Multi-model Learning, and vice-versa.
    The same principle of transfer learning with representation learning explain how one can perform either tasks.

    Notes:

    • K-Shot Learning (Thesis!)
    • One Shot Learning and Siamese Networks in Keras (Code - Tutorial)
    • Zero-Shot Learning: is a form of extending supervised learning to a setting of solving for example a classification problem when not enough labeled examples are available for all classes.

      “Zero-shot learning is being able to solve a task despite not having received any training examples of that task.” - Goodfellow

    • Detecting Gravitational Waves is a form of Zero-Shot Learning
    • Few-shot, one-shot or zero-shot learning are encompassed in a recently emerging field known as meta-learning.
      While traditionally including mainly classification, recent works in meta-learning have included regression and reinforcement learning (Vinyals et al., 2016) (Andrychowicz et al., 2016) (Ravi & Larochelle, 2017) (Duan et al., 2017) (Finn et al., 2017).
      Works in this area seems to be primarily motivated by the notion of human-level AI, since humans appear to be able to require far fewer training data than most deep learning models.

  7. Multi-Modal Learning:
    Multi-Modal Learning

    Representation Learning for Multi-modal Learning:
    The same principle, underlying zero-shot learning as a form of transfer learning, explains how one can perform multi-modal learning; capturing a representation in one modality, a representation in the other, and the relationship (in general a joint distribution) between pairs \((\boldsymbol{x}, \boldsymbol{y})\) consisting of one observation \(\boldsymbol{x}\) in one modality and another observation \(\boldsymbol{y}\) in the other modality (Srivastava and Salakhutdinov, 2012).
    By learning all three sets of parameters (from \(\boldsymbol{x}\) to its representation, from \(\boldsymbol{y}\) to its representation, and the relationship between the two representations), concepts in one representation are anchored in the other, and vice-versa, allowing one to meaningfully generalize to new pairs.


Causal Factor Learning

  1. Semi-Supervised Disentangling of Causal Factors:

    Quality of Representations:
    - An important question in Representation Learning is:
    “what makes one representation better than another?”

    1. One answer to that is the Causal Factors Hypothesis:
      An ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another.
      • This hypothesis motivates approaches in which we first seek a good representation for \(p(\boldsymbol{x})\).
        This representation may also be a good representation for computing \(p(\boldsymbol{y} \vert \boldsymbol{x})\) if \(\boldsymbol{y}\) is among the most salient causes of \(\boldsymbol{x}\)4 5.
    2. Ease of Modeling:
      In many approaches to representation learning, we are often concerned with a representation that is easy to model (e.g. sparse entries, independent entries etc.).
      It is not directly observed, however, that a representation that cleanly separates the underlying causal factors is, also, one that is easy to model.
      The answer to that is an extension of the Causal Factor Hypothesis:
      For many AI tasks the two properties coincide: once we are able to obtain the underlying explanations for the observations, it generally becomes easy to isolate individual attributes from the others.
      Specifically, if a representation \(\boldsymbol{h}\) represents many of the underlying causes of the observed \(\boldsymbol{x}\), and the outputs \(\boldsymbol{y}\) are among the most salient causes, then it is easy to predict \(\boldsymbol{y}\) from \(\boldsymbol{h}\).

    The complete Causal Factors Hypothesis motivates Semi-Supervised Learning via Unsupervised Representation Learning.

    Analysis - When does Semi-Supervised Learning Work:

    Justifying the setting where Semi-Supervised Learning Works:

    • Semi-Supervised Learning6 Works when: \(p(\mathbf{y} \vert \mathbf{x})\) and \(p(\mathbf{x})\) are tied together.
    • \(p(\mathbf{y} \vert \mathbf{x})\) and \(p(\mathbf{x})\) are Tied when: \(\mathbf{y}\) is closely associated with one of the causal factors of \(\mathbf{x}\), or it is a causal factor itself.
      • Let \(\mathbf{h}\) represent all the causal factors of \(\mathbf{x}\), and let \(\mathbf{y} \in \mathbf{h}\) (be a causal factor of \(\mathbf{x}\)), then:
        The “true” generative process can be conceived as structured according to this directed graphical model, with \(\mathbf{h}\) as the parent of \(\mathbf{x}\):

        $$p(\mathbf{h}, \mathbf{x})=p(\mathbf{x} \vert \mathbf{h}) p(\mathbf{h})$$

        • Thus, the Marginal Probability of the Data \(p(\mathbf{x})\) is:
          1. Tied to the conditional \(p(\mathbf{x} \vert \mathbf{h})\) as:

            $$p(\boldsymbol{x})=\mathbb{E}_ {\mathbf{h}} p(\boldsymbol{x} \vert \boldsymbol{h})$$

            \(\implies\)

            • The best possible model of \(\mathbf{x}\) (wrt. generalization) is the one that uncovers the above “true” structure, with \(\boldsymbol{h}\) as a latent variable that explains the observed variations in \(\boldsymbol{x}\).
              I.E. The “ideal” representation learning discussed above should thus recover these latent factors.
          2. (intimately) Tied to the conditional \(p(\mathbf{y} \vert \mathbf{x})\) (by Bayes’ rule) as:

            $$p(\mathbf{y} \vert \mathbf{x})=\frac{p(\mathbf{x} \vert \mathbf{y}) p(\mathbf{y})}{p(\mathbf{x})}$$

    Therefore, in situations respecting these assumptions, semi-supervised learning should improve performance.

    Encoding/Learning Causal Factors:

    • Problem - Number of Causal Factors:
      An important research problem regards the fact that most observations are formed by an extremely large number of underlying causes.
      • Suppose \(\mathbf{y}=\mathrm{h}_ {i}\), but the unsupervised learner does not know which \(\mathrm{h}_ {i}\):
        • The brute force solution is for an unsupervised learner to learn a representation that captures all the reasonably salient generative factors \(\mathrm{h}_ {j}\) and disentangles them from each other, thus making it easy to predict \(\mathbf{y}\) from \(\mathbf{h}\), regardless of which \(\mathrm{h}_ {i}\) is associated with \(\mathbf{y}\).
          • In practice, the brute force solution is not feasible because it is not possible to capture all or most of the factors of variation that influence an observation.
            For example, in a visual scene, should the representation always encode all of the smallest objects in the background?
            It is a well-documented psychological phenomenon that human beings fail to perceive changes in their environment that are not immediately relevant to the task they are performing Simons and Levin (1998).
    • Solution - Determining which causal factor to encode/learn:
      An important research frontier in semi-supervised learning is determining “what to encode in each situation”.
      • Currently, there are two main strategies for dealing with a large number of underlying causes:
        1. Use a supervised learning signal at the same time as the (“plus”) unsupervised learning signal,
          so that the model will choose to capture the most relevant factors of variation.
        2. Use much larger representations if using purely unsupervised learning.
      • New (Emerging) Strategy for unsupervised learning:
        Redefining the definition of “salient” factors.

    The definition of “Salient“:

    • The current definition of “salient” factors:
      In practice, we encode the definition of “salient” by using the objective criterion (e.g. MSE).

      Historically, autoencoders and generative models have been trained to optimize a fixed criterion, often similar to MSE.

      • Problem with current definition:
        Since these fixed criteria determine which causes are considered salient, they will be emphasizing different factors depending on their e.g. effects on the error:
        • E.g. MSE applied to the pixels of an image implicitly specifies that an underlying cause is only salient if it significantly changes the brightness of a large number of pixels.
          This can be problematic if the task we wish to solve involves interacting with small objects.
    • Learned (pattern-based) “Saliency”:
      Certain factors could be considered “salient” if they follow a highly recognizable pattern.
      E.g. if a group of pixels follow a highly recognizable pattern, even if that pattern does not involve extreme brightness or darkness, then that pattern could be considered extremely salient.

      • This definition is implemented by Generative Adversarial Networks (GANs).
        In this approach, a generative model is trained to fool a feedforward classifier. The feedforward classifier attempts to recognize all samples from the generative model as being fake, and all samples from the training set as being real.
        In this framework, any structured pattern that the feedforward network can recognize is highly salient.
        They learn how to determine what is salient.

    Generative adversarial networks are only one step toward determining which factors should be represented.
    We expect that future research will discover better ways of determining which factors to represent, and develop mechanisms for representing different factors depending on the task.

    Robustness to Change - Causal Invariance:
    A benefit of learning the underlying causal factors (Schölkopf et al. (2012)) is that:
    if the true generative process has \(\mathbf{x}\) as an effect and \(\mathbf{y}\) as a cause, then modeling \(p(\mathbf{x} \vert \mathbf{y})\) is robust to changes in \(p(\mathbf{y})\).

    If the cause-effect relationship was reversed, this would not be true, since by Bayes’ rule, \(p(\mathbf{x} \vert \mathbf{y})\) would be sensitive to changes in \(p(\mathbf{y})\).

    Very often, when we consider changes in distribution due to different domains, temporal non-stationarity, or changes in the nature of the task, the causal mechanisms remain invariant (the laws of the universe are constant) while the marginal distribution over the underlying causes can change.
    Hence, better generalization and robustness to all kinds of changes can be expected via learning a generative model that attempts to recover the causal factors \(\mathbf{h}\) and \(p(\mathbf{x} \vert \mathbf{h})\).

  2. Providing Clues to Discover Underlying Causes:
    Quality of Representations:
    The answer to the following question:
    “what makes one representation better than another?”
    was the Causal Factors Hypothesis:
    An ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another, especially those factors that are relevant to our applications.

    Clues for Finding the Causal Factors of Variation:
    Most strategies for representation learning are based on:
    Introducing clues that help the learning to find these underlying factors of variations.
    The clues can help the learner separate these observed factors from the others.

    Supervised learning provides a very strong clue: a label \(\boldsymbol{y},\) presented with each \(\boldsymbol{x},\) that usually specifies the value of at least one of the factors of variation directly.

    More generally, to make use of abundant unlabeled data, representation learning makes use of other, less direct, hints about the underlying factors.
    These hints take the form of implicit prior beliefs that we, the designers of the learning algorithm, impose in order to guide the learner.

    Clues in the form of Regularization:
    Results such as the no free lunch theorem show that regularization strategies are necessary to obtain good generalization.
    While it is impossible to find a universally superior regularization strategy, one goal of deep learning is to find a set of fairly generic regularization strategies that are applicable to a wide variety of AI tasks, similar to the tasks that people and animals are able to solve.

    We can use generic regularization strategies to encourage learning algorithms to discover features that correspond to underlying factors, E.G. (Bengio et al. (2013d)):

    • Smoothness: This is the assumption that \(f(\boldsymbol{x}+\epsilon \boldsymbol{d}) \approx f(\boldsymbol{x})\) for unit \(\boldsymbol{d}\) and small \(\epsilon\). This assumption allows the learner to generalize from training examples to nearby points in input space. Many machine learning algorithms leverage this idea, but it is insufficient to overcome the curse of dimensionality.
    • Linearity: Many learning algorithms assume that relationships between some variables are linear. This allows the algorithm to make predictions even very far from the observed data, but can sometimes lead to overly extreme predictions. Most simple machine learning algorithms that do not make the smoothness assumption instead make the linearity assumption. These are in fact different assumptions—linear functions with large weights applied to high-dimensional spaces may not be very smooth7.
    • Multiple explanatory factors: Many representation learning algorithms are motivated by the assumption that the data is generated by multiple underlying explanatory factors, and that most tasks can be solved easily given the state of each of these factors. Section 15.3 describes how this view motivates semisupervised learning via representation learning. Learning the structure of \(p(\boldsymbol{x})\) requires learning some of the same features that are useful for modeling \(p(\boldsymbol{y} \vert \boldsymbol{x})\) because both refer to the same underlying explanatory factors. Section 15.4 describes how this view motivates the use of distributed representations, with separate directions in representation space corresponding to separate factors of variation.
    • Causal factors: the model is constructed in such a way that it treats the factors of variation described by the learned representation \(\boldsymbol{h}\) as the causes of the observed data \(\boldsymbol{x}\), and not vice-versa. As discussed in section 15.3, this is advantageous for semi-supervised learning and makes the learned model more robust when the distribution over the underlying causes changes or when we use the model for a new task.
    • Depth, or a hierarchical organization of explanatory factors: High-level, abstract concepts can be defined in terms of simple concepts, forming a hierarchy. From another point of view, the use of a deep architecture expresses our belief that the task should be accomplished via a multi-step program, with each step referring back to the output of the processing accomplished via previous steps.
    • Shared factors across tasks: In the context where we have many tasks, corresponding to different \(y_{i}\) variables sharing the same input \(\mathbf{x}\) or where each task is associated with a subset or a function \(f^{(i)}(\mathbf{x})\) of a global input \(\mathbf{x},\) the assumption is that each \(\mathbf{y}_ {i}\) is associated with a different subset from a common pool of relevant factors \(\mathbf{h}\). Because these subsets overlap, learning all the \(P\left(y_{i} \vert \mathbf{x}\right)\) via a shared intermediate representation \(P(\mathbf{h} \vert \mathbf{x})\) allows sharing of statistical strength between the tasks.
    • Manifolds: Probability mass concentrates, and the regions in which it concentrates are locally connected and occupy a tiny volume. In the continuous case, these regions can be approximated by low-dimensional manifolds with a much smaller dimensionality than the original space where the data lives. Many machine learning algorithms behave sensibly only on this manifold (Goodfellow et al., 2014b). Some machine learning algorithms, especially autoencoders, attempt to explicitly learn the structure of the manifold.
    • Natural clustering: Many machine learning algorithms assume that each connected manifold in the input space may be assigned to a single class. The data may lie on many disconnected manifolds, but the class remains constant within each one of these. This assumption motivates a variety of learning algorithms, including tangent propagation, double backprop, the manifold tangent classifier and adversarial training.
    • Temporal and spatial coherence: Slow feature analysis and related algorithms make the assumption that the most important explanatory factors change slowly over time, or at least that it is easier to predict the true underlying explanatory factors than to predict raw observations such as pixel values. See section 13.3 for further description of this approach.
    • Sparsity: Most features should presumably not be relevant to describing most inputs—there is no need to use a feature that detects elephant trunks when representing an image of a cat. It is therefore reasonable to impose a prior that any feature that can be interpreted as “present” or “absent” should be absent most of the time.
    • Simplicity of Factor Dependencies: In good high-level representations, the factors are related to each other through simple dependencies. The simplest possible is marginal independence, \(P(\mathbf{h})=\prod_{i} P\left(\mathbf{h}_ {i}\right)\), but linear dependencies or those captured by a shallow autoencoder are also reasonable assumptions. This can be seen in many laws of physics, and is assumed when plugging a linear predictor or a factorized prior on top of a learned representation.
      • Consciousness Prior:
    • Causal/Mechanism Independence:
      • Controllable Factors.

    The concept of representation learning ties together all of the many forms of deep learning.
    Feedforward and recurrent networks, autoencoders and deep probabilistic models all learn and exploit representations. Learning the best possible representation remains an exciting avenue of research.

  3. Distribution Shift:

    </div>

  1. It is also called a one-hot representation, since it can be captured by a binary vector with \(n\) bits that are mutually exclusive (only one of them can be active). 

  2. It proceeds one layer at a time, training the k -th layer while keeping the previous ones fixed. In particular, the lower layers (which are trained first) are not adapted after the upper layers are introduced. 

  3. Commonly, “pretraining” to refer not only to the pretraining stage itself but to the entire two phase protocol that combines the pretraining phase and a supervised learning phase. 

  4. This idea has guided a large amount of deep learning research since at least the 1990s (Becker and Hinton, 1992; Hinton and Sejnowski, 1999), in more detail. 

  5. For other arguments about when semi-supervised learning can outperform pure supervised learning, we refer the reader to section 1.2 of Chapelle et al. (2006)

  6. Using unsupervised representation learning that tries to disentangle the underlying factors of variation

  7. See Goodfellow et al. (2014b) for a further discussion of the limitations of the linearity assumption.