Ahmad Badary

Representation Learning

Unsupervised Representation Learning

Supervised Representation Learning

Transfer Learning and Domain Adaptation

Causal Factor Learning

Notes (Move Inside):

Representation Learning (Feature Learning) is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data.
This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.
Hypothesis - Main Idea:
The core hypothesis for representation learning is that the unlabeled data can be used to learn a good representation.
Types:
Representation learning can be either supervised or unsupervised.

Representation Learning

Representation Learning:
Representation Learning (Feature Learning) is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data.
This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Hypothesis - Main Idea:
The core hypothesis for representation learning is that the unlabeled data can be used to learn a good representation.

Types:
Representation learning can be either supervised or unsupervised.

Representation Learning Approaches:
There are various ways of learning different representations:
- Probabilistic Models: the goal is to learn a representation that captures the probability distribution of the underlying explanatory features for the observed input. Such a learnt representation can then be used for prediction.
- Deep Learning: the representations are formed by composition of multiple non-linear transformations of the input data with the goal of yielding abstract and useful representations for tasks like classification, prediction etc.
Representation Learning Tradeoff:
Most representation learning problems face a tradeoff between preserving as much information about the input as possible and attaining nice properties (such as independence).

The Problem of Data (Semi-Supervised Learning*):
We often have very large amounts of unlabeled training data and relatively little labeled training data. Training with supervised learning techniques on the labeled subset often results in severe overfitting. Semi-supervised learning offers the chance to resolve this overfitting problem by also learning from the unlabeled data. Specifically, we can learn good representations for the unlabeled data, and then use these representations to solve the supervised learning task.

Learning from Limited Data:
Humans and animals are able to learn from very few labeled examples.
Many factors could explain improved human performance — for example, the brain may use very large ensembles of classifiers or Bayesian inference techniques.
One popular hypothesis is that the brain is able to leverage unsupervised or semi-supervised learning.

Motivation/Applications:
1. ML tasks such as classification often require input that is mathematically and computationally convenient to process.
  However, real-world data such as images, video, and sensor data has not yielded to attempts to algorithmically define specific features.
2. Learning good representations enables us to perform certain (specific) tasks in a more optimal manner.
  - E.g. linked lists $\implies$ $\mathcal{O}(n)$ insertion | red-black tree $\implies$ $\mathcal{O}(\log n)$ insertion.
  - - Goal: Learn Portuguese
    - For 1 month you listen to Portuguese on the radio (this is unlabeled data)
    - You develop an intuition for the language, phrases, and grammar (a model in your head)
    - It is easier to learn now from a tutor because you have a better (higher representation) of the data/language
3. Representation Learning is particularly interesting because it provides (one) way to perform unsupervised and semi-supervised learning.
4. Feature Engineering is hard. Representation Learning allows us to avoid having to engineer features, manually.
5. In general, representation learning can allow us to achieve multi-task learning, transfer learning, and domain adaptation through shared representations.
The Quality of Representations:
Generally speaking, a good representation is one that makes a subsequent learning task easier.
The choice of representation will usually depend on the choice of the subsequent learning task.

Success of Representation Learning:
The success of representation learning can be attributed to many factors, including:
- Theoretical advantages of distributed representations (Hinton et al., 1986)
- Theoretical advantages of deep representations (Hinton et al., 1986)
- The Causal Factors Hypothesis: a general idea of underlying assumptions about the data generating process, in particular about underlying causes of the observed data.
Representation Learning Domain Applications:
- Computer Vision: CNNs.
- Natural Language Processing: Word-Embeddings.
- Speech Recognition: Speech-Embeddings.
- “What is a good representation?”
  - Generally speaking, a good representation is one that makes a subsequent learning task easier.
    The choice of representation will usually depend on the choice of the subsequent learning task.
- “What makes one representation better than another?”
  - Causal Factors Hypothesis:
    An ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another.
    - Why:
      - Ease of Modeling: A representation that cleanly separates the underlying causal factors is, also, one that is easy to model.
        
        For many AI tasks the two properties coincide: once we are able to obtain the underlying explanations for the observations, it generally becomes easy to isolate individual attributes from the others.
        
        Specifically, if a representation $\boldsymbol{h}$ represents many of the underlying causes of the observed $\boldsymbol{x}$, and the outputs $\boldsymbol{y}$ are among the most salient causes, then it is easy to predict $\boldsymbol{y}$ from $\boldsymbol{h}$.
  - Summary of the Causal Factors Hypothesis:
    An ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another, especially those factors that are relevant to our applications.
- “What is a “salient factor”?”
  - A “salient factor” is a causal factor (latent variable) that explains, well, the observed variations in $X$.
    - What makes a feature “salient” for humans?
      It could be something really simple like correlation or predictive power.
      Ears are a salient feature of Humans because in a majority of cases, presence of one implies presence of another.
    - Discriminative features as salient features:
      Note that in object detection case, the predictive power is only measured in:
      (ear $\rightarrow$ person) direction, not (person $\rightarrow$ ear) direction.
      E.g. if your task was to discriminate between males and females, presence of ears would not be a useful feature even though all humans have ears. Compare this to the pimples case: in human vs dog classification, pimples are a really good predictor of ‘human’, even though they are not a salient feature of Humans.
      Basically I think discriminative =/= salient
Notes:
- Representation Learning can be done with both, generative and discriminative models.
- In DL, representation learning uses a composition of transformations of the input data (features) to create learned features.
Distributed Representation:
Distributed Representations of concepts are representations composed of many elements that can be set separately from each other.

Distributed representations of concepts are one of the most important tools for representation learning:
- Distributed representations are powerful because they can use $n$ features with $k$ values to describe $k^{n}$ different concepts.
- Both neural networks with multiple hidden units and probabilistic models with multiple latent variables make use of the strategy of distributed representation.
- Motivation for using Distributed Representations:
  Many deep learning algorithms are motivated by the assumption that the hidden units can learn to represent the underlying causal factors that explain the data.
  Distributed representations are natural for this approach, because each direction in representation space can correspond to the value of a different underlying configuration variable.
- Distributed vs Symbolic Representations:
  - Number of “Representable” Configurations - by example:
    - An example of a distributed representation is a vector of $n$ binary features.
      It can take $2^{n}$ configurations, each potentially corresponding to a different region in input space.
    - An example of a symbolic representation, is one-hot representation¹ where the input is associated with a single symbol or category.
      If there are $n$ symbols in the dictionary, one can imagine $n$ feature detectors, each corresponding to the detection of the presence of the associated category.
      In that case only $n$ different configurations of the representation space are possible, carving $n$ different regions in input space.
      
      A symbolic representation is a specific example of the broader class of non-distributed representations, which are representations that may contain many entries but without significant meaningful separate control over each entry.
  - Generalization:
    An important related concept that distinguishes a distributed representation from a symbolic one is that generalization arises due to shared attributes between different concepts.
    - - As pure symbols, “cat” and “dog” are as far from each other as any other two symbols.
        However, if one associates them with a meaningful distributed representation, then many of the things that can be said about cats can generalize to dogs and vice-versa.
        
        For example, our distributed representation may contain entries such as “has_fur” or “number_of_legs” that have the same value for the embedding of both “cat ” and “dog.”
        Neural language models that operate on distributed representations of words generalize much better than other models that operate directly on one-hot representations of words (section 12.4).
        Distributed representations induce a rich similarity space, in which semantically close concepts (or inputs) are close in distance, a property that is absent from purely symbolic representations.
      Distributed representations induce a rich similarity space, in which semantically close concepts (or inputs) are close in distance, a property that is absent from purely symbolic representations.
      [Analysis: Generalization of Distributed Representations]
- - Clustering methods, including the $k$-means algorithm: each input point is assigned to exactly one cluster.
  - k-nearest neighbors algorithms: one or a few templates or prototype examples are associated with a given input. In the case of $k>1$, there are multiple values describing each input, but they can not be controlled separately from each other, so this does not qualify as a true distributed representation.
  - Decision trees: only one leaf (and the nodes on the path from root to leaf) is activated when an input is given.
  - Gaussian mixtures and mixtures of experts: the templates (cluster centers) or experts are now associated with a degree of activation. As with the k-nearest neighbors algorithm, each input is represented with multiple values, but those values cannot readily be controlled separately from each other.
  - Kernel machines with a Gaussian kernel (or other similarly local kernel): although the degree of activation of each “support vector” or template example is now continuous-valued, the same issue arises as with Gaussian mixtures.
  - Language or translation models based on n-grams: The set of contexts (sequences of symbols) is partitioned according to a tree structure of suffixes. A leaf may correspond to the last two words being w1 and w2, for example. Separate parameters are estimated for each leaf of the tree (with some sharing being possible).
  For some of these non-distributed algorithms, the output is not constant by parts but instead interpolates between neighboring regions. The relationship between the number of parameters (or examples) and the number of regions they can define remains linear.
Generalization of Distributed Representations:
We know that for distributed representations, Generalization arises due to shared attributes between different concepts.

But an important question is:
“When and why can there be a statistical advantage from using a distributed representation as part of a learning algorithm?”
- Distributed representations can have a statistical advantage when an apparently complicated structure can be compactly represented using a small number of parameters.
- Some traditional nondistributed learning algorithms generalize only due to the smoothness assumption, which states that if $u \approx v,$ then the target function $f$ to be learned has the property that $f(u) \approx f(v),$ in general.
  There are many ways of formalizing such an assumption, but the end result is that if we have an example $(x, y)$ for which we know that $f(x) \approx y,$ then we choose an estimator $\hat{f}$ that approximately satisfies these constraints while changing as little as possible when we move to a nearby input $x+\epsilon$.
  - This assumption is clearly very useful, but it suffers from the curse of dimensionality: in order to learn a target function that increases and decreases many times in many different regions,1 we may need a number of examples that is at least as large as the number of distinguishable regions.
    One can think of each of these regions as a category or symbol: by having a separate degree of freedom for each symbol (or region), we can learn an arbitrary decoder mapping from symbol to value.
    However, this does not allow us to generalize to new symbols for new regions.
- If we are lucky, there may be some regularity in the target function, besides being smooth.
  For example, a convolutional network with max-pooling can recognize an object regardless of its location in the image, even though spatial translation of the object may not correspond to smooth transformations in the input space.
Justifying Generalization in distributed representations:
- Geometric justification (by analyzing binary, linear feature extractors (units)):
  Let us examine a special case of a distributed representation learning algorithm, that extracts binary features by thresholding linear functions of the input:
  - Each binary feature in this representation divides $\mathbb{R}^{d}$ into a pair of half-spaces.
  - The exponentially large number of intersections of $n$ of the corresponding half-spaces determines the number of regions this distributed representation learner can distinguish.
  - The number of regions generated by an arrangement of $n$ hyperplanes in $\mathbb{R}^{d}$:
    By applying a general result concerning the intersection of hyperplanes (Zaslavsky, } 1975), one can show (Pascanu et al, 2014b) that the number of regions this binary feature representation can distinguish is:
    $$\sum_{j=0}^{d}\left(\begin{array}{l}{n} \\ {j}\end{array}\right)=O\left(n^{d}\right)$$
  - Therefore, we see a growth that is exponential in the input size and polynomial in the number of hidden units.
  - This provides a geometric argument to explain the generalization power of distributed representation:
    with $\mathcal{O}(n d)$ parameters (for $n$ linear-threshold features in $\mathbb{R}^{d}$) we can distinctly represent $\mathcal{O}\left(n^{d}\right)$ regions in input space.
    - If instead we made no assumption at all about the data, and used a representation with unique symbol for each region, and separate parameters for each symbol to recognize its corresponding portion of $\mathbb{R}^{d},$ then,
      specifying $\mathcal{O}\left(n^{d}\right)$ regions would require $\mathcal{O}\left(n^{d}\right)$ examples.
  - More generally, the argument in favor of the distributed representation could be extended to the case where instead of using linear threshold units we use nonlinear, possibly continuous, feature extractors for each of the attributes in the distributed representation.
    The argument in this case is that if a parametric transformation with $k$ parameters can learn about $r$ regions in input space, with $k \ll r,$ and if obtaining such a representation was useful to the task of interest, then we could potentially generalize much better in this way than in a non-distributed setting where we would need $\mathcal{O}(r)$ examples to obtain the same features and associated partitioning of the input space into $r$ regions.
    Using fewer parameters to represent the model means that we have fewer parameters to fit, and thus require far fewer training examples to generalize well.
- VC-Theory justification - Fixed Capacity:
  The capacity remains limited despite being able to distinctly encode so many different regions.
  For example, the VC-dimension of a neural network of linear threshold units is only $\mathcal{O}(w \log w),$ where $w$ is the number of weights (Sontag, 1998.
  
  This limitation arises because, while we can assign very many unique codes to representation space, we cannot:
  - Use absolutely all of the code space
  - Learn arbitrary functions mapping from the representation space $h$ to the output $y$ using a linear classifier.
  The use of a distributed representation combined with a linear classifier thus expresses a prior belief that the classes to be recognized are linearly separable as a function of the underlying causal factors captured by $h$.
  
  We will typically want to learn categories such as the set of all images of all green objects or the set of all images of cars, but not categories that require nonlinear, $\mathrm{XOR}$ logic. For example, we typically do not want to partition the data into the set of all red cars and green trucks as one class and the set of all green cars and red trucks as another class.
- Experimental justification:
  Though the above ideas have been abstract, they may be experimentally validated:
  - Zhou et al. (2015) find that hidden units in a deep convolutional network trained on the ImageNet and Places benchmark datasets learn features that are very often interpretable, corresponding to a label that humans would naturally assign.
    In practice it is certainly not always the case that hidden units learn something that has a simple linguistic name, but it is interesting to see this emerge near the top levels of the best computer vision deep networks. What such features have in common is that one could imagine learning about each of them without having to see all the configurations of all the others.
  - Radford et al. (2015) demonstrated that a generative model can learn a representation of images of faces, with separate directions in representation space capturing different underlying factors of variation.
    The following illustration demonstrates that one direction in representation space corresponds to whether the person is male or female, while another corresponds to whether the person is wearing glasses.
    
    These features were discovered automatically, not fixed a priori.
    There is no need to have labels for the hidden unit classifiers: gradient descent on an objective function of interest naturally learns semantically interesting features, so long as the task requires such features.
    We can learn about the distinction between male and female, or about the presence or absence of glasses, without having to characterize all of the configurations of the $n − 1$ other features by examples covering all of these combinations of values.
    This form of statistical separability is what allows one to generalize to new configurations of a person’s features that have never been seen during training.
Notes:
- page 542 As a counter-example, recent research from DeepMind (Morcos et al., 2018) suggests that while some hidden units might appear to learn an interpretable feature, ‘these interpretable neurons are no more important than confusing neurons with difficult-to-interpret activity’.
  Moreover, ‘networks which generalise well are much less reliant on single directions [ie. hidden units] than those which memorise’. See more in the DeepMind blog post.
- Distributed representations based on latent variables can obtain all of the advantages of representation learning that we have seen with deep feedforward and recurrent networks.
- Food for Thought (F2T):
  “since feature engineering was made obsolete by deep learning, algorithm engineering will be made obsolete by meta-learning” - Sohl-Dickstein
Deep Representations - Exponential Gain from Depth:

Exponential Gain in MLPs:
We have seen in (section 6.4.1) that multilayer perceptrons are universal approximators, and that some functions can be represented by exponentially smaller deep networks compared to shallow networks.
This decrease in model size leads to improved statistical efficiency.

Similar results apply, more generally, to other kinds of models with distributed hidden representations.

Justification/Motivation:
In this and other AI tasks, the factors that can be chosen almost independently from each other yet still correspond to meaningful inputs are more likely to be very high-level and related in highly nonlinear ways to the input.
Goodfellow et al. argue that this demands deep distributed representations, where the higher level features (seen as functions of the input) or factors (seen as generative causes) are obtained through the composition of many nonlinearities.

E.g. the example of a generative model that learned about the explanatory factors underlying images of faces, including the person’s gender and whether they are wearing glasses.
It would not be reasonable to expect a shallow network, such as a linear network, to learn the complicated relationship between these abstract explanatory factors and the pixels in the image.

Universal Approximation property in Models (from Depth):
- It has been proven in many different settings that organizing computation through the composition of many nonlinearities and a hierarchy of reused features can give an exponential boost to statistical efficiency, on top of the exponential boost given by using a distributed representation.
- Many kinds of networks (e.g., with saturating nonlinearities, Boolean gates, sum/products, or RBF units) with a single hidden layer can be shown to be universal approximators.
  A model family that is a universal approximator can approximate a large class of functions (including all continuous functions) up to any non-zero tolerance level, given enough hidden units.
  However, the required number of hidden units may be very large.
- Theoretical results concerning the expressive power of deep architectures state that there are families of functions that can be represented efficiently by an architecture of depth $k$, but would require an exponential number of hidden units (wrt. input size) with insufficient depth (depth $2$ or depth $k − 1$).
Exponential Gains in Structured Probabilistic Models:
- PGMs as Universal Approximators:
  - Just like deterministic feedforward networks are universal approximators of functions.
    Many structured probabilistic models with a single hidden layer of latent variables, including restricted Boltzmann machines and deep belief networks, are universal approximators of probability distributions (Le Roux and Bengio, 2008, 2010; Montúfar and Ay, 2011; Montúfar, 2014; Krause et al., 2013).
- Exponential Gain from Depth in PGMs:
  - Just like a sufficiently deep feedforward network can have an exponential advantage over a network that is too shallow.
    Such results can also be obtained for other models such as probabilistic models.
    - E.g. The sum-product network (SPN) (Poon and Domingos, 2011).
      These models use polynomial circuits to compute the probability distribution over a set of random variables.
      - Delalleau and Bengio (2011) showed that there exist probability distributions for which a minimum depth of SPN is required to avoid needing an exponentially large model.
      - Later, Martens and Medabalimi (2014) showed that there are significant differences between every two finite depths of SPN, and that some of the constraints used to make SPNs tractable may limit their representational power.
Expressiveness of Convolutional Networks:
Another interesting development is a set of theoretical results for the expressive power of families of deep circuits related to convolutional nets:
They highlight an exponential advantage for the deep circuit even when the shallow circuit is allowed to only approximate the function computed by the deep circuit (Cohen et al., 2015).
By comparison, previous theoretical work made claims regarding only the case where the shallow circuit must exactly replicate particular functions.

Notes:
- Universal Approximation Theorem (wiki)
- Stone–Weierstrass Approximation Theorem

Unsupervised Representation Learning

Unsupervised Representation Learning:
In Unsupervised feature learning, features are learned with unlabeled data.

The Goal of unsupervised feature learning is often to discover low-dimensional features that captures some structure underlying the high-dimensional input data.
- (Unsupervised) Dictionary Learning
- ICA/PCA
- AutoEncoders
- Matrix Factorization
- Clustering Algorithms
Learning:
Unsupervised deep learning algorithms have a main training objective but also learn a representation as a side effect.

Unsupervised Learning for Semisupervised Learning: xw When the feature learning is performed in an unsupervised way, it enables a form of semisupervised learning where features learned from an unlabeled dataset are then employed to improve performance in a supervised setting with labeled data.
Greedy Layer-Wise Unsupervised Pretraining:
(Greedy Layer-Wise) Unsupervised Pretraining
- Greedy: it is a greedy algorithm.
  It optimizes each piece of the solution independently, one piece at a time, rather than jointly optimizing all pieces.
- Layer-Wise: the independent pieces are the layers of the network².
- Unsupervised: each layer is trained with an unsupervised representation learning algorithm.
- Pretraining³: it is supposed to be only a first step before a joint training algorithm is applied to fine-tune all the layers together.
This procedure is a canonical example of how a representation learned for one task (unsupervised learning, trying to capture the shape of the input distribution) can sometimes be useful for another task (supervised learning with the same input domain).

Algorithm/Procedure:
- Supervised Learning Phase:
  It may involve:
  1. Training a simple classifier on top of the features learned in the pretraining phase.
  2. Supervised fine-tuning of the entire network learned in the pretraining phase.
Interpretation in Supervised Settings:
In the context of a supervised learning task, the procedure can be viewed as:
- A Regularizer.
  In some experiments, pretraining decreases test error without decreasing training error.
- A form of Parameter Initialization.
Applications:
- Training Deep Models:
  Greedy layer-wise training procedures based on unsupervised criteria have long been used to sidestep the difficulty of jointly training the layers of a deep neural net for a supervised task.
  The deep learning renaissance of 2006 began with the discovery that this greedy learning procedure could be used to find a good initialization for a joint learning procedure over all the layers, and that this approach could be used to successfully train even fully connected architectures.
  Prior to this discovery, only convolutional deep networks or networks whose depth resulted from recurrence were regarded as feasible to train.
- Parameter Initialization:
  THey can also be used as initialization for other unsupervised learning algorithms, such as:
  - Deep Autoencoders (Hinton and Salakhutdinov, 2006)
  - Probabilistic mModels with many layers of latent variables:
    E.g. deep belief networks (DBNs) (Hinton et al., 2006) and deep Boltzmann machines (DBMs) (Salakhutdinov and Hinton, 2009a).
Clustering | K-Means:
Local Linear Embeddings:
Principal Components Analysis (PCA):
Independent Components Analysis (ICA):
(Unsupervised) Dictionary Learning:

Supervised Representation Learning

Supervised Representation Learning:
In Supervised feature learning, features are learned using labeled data.

Learning:
The data label allows the system to compute an error term, the degree to which the system fails to produce the label, which can then be used as feedback to correct the learning process (reduce/minimize the error).

Examples:
- Supervised Neural Networks
- Supervised Dictionary Learning
FFNs as Representation Learning Algorithms:
- We can think of Feed-Forward Neural Networks trained by supervised learning as performing a kind of representation learning.
- All the layers except the last layer (usually a linear classifier), are basically producing representations (featurizing) of the input.
- Training with a supervised criterion naturally leads to the representation at every hidden layer (but more so near the top hidden layer) taking on properties that make the classification task easier:
  E.g. Making classes linearly separable in the latent space.
- The features in the penultimate layer should learn different properties depending on the type of the last layer.
- Supervised training of feedforward networks does not involve explicitly imposing any condition on the learned intermediate features.
- We can, however, explicitly impose certain desirable conditions.
Greedy Layer-Wise Supervised Pretraining:
As discussed in section 8.7.4, it is also possible to have greedy layer-wise supervised pretraining.
This builds on the premise that training a shallow network is easier than training a deep one, which seems to have been validated in several contexts (Erhan et al., 2010).
Neural Networks:
Supervised Dictionary Learning:

Transfer Learning and Domain Adaptation

Introduction - Transfer Learning and Domain Adaptation:
Transfer Learning and Domain Adaptation refer to the situation where what has been learned in one setting (i.e., distribution $P_{1}$) is exploited to improve generalization in another setting (say distribution $P_{2}$).

This is a generalization of unsupervised pretraining, where we transferred representations between an unsupervised learning task and a supervised learning task.

In Supervised Learning: transfer learning, domain adaptation, and concept drift can be viewed as particular forms of Multi-Task Learning.

However, Transfer Learning is a more general term that applies to both Supervised and Unsupervised Learning, as well as, Reinforcement Learning.

Goal/Objective and Relation to Representation Learning:
In the cases of Transfer Learning, Multi-Task Learning, and Domain Adaptation: The Objective/Goal is to take advantage of data from the first setting to extract information that may be useful when learning or even when directly making predictions in the second setting.

The core idea of Representation Learning is that the same representation may be useful in both settings.

Thus, we can use shared representations to accomplish Transfer Learning etc.
Shared Representations are useful to handle multiple modalities or domains, or to transfer learned knowledge to tasks for which few or no examples are given but a task representation exists.
Transfer Learning:
Transfer Learning (in ML) is the problem of storing knowledge gained while solving one problem and applying it to a different but related problem.

Definition:
Formally, the definition of transfer learning is given in terms of:
- A Domain $\mathcal{D}=\{\mathcal{X}, P(X)\}$, $\:\:$ consisting of:
  - Feature Space $\mathcal{X}$
  - Marginal Probability Distribution $P(X)$,
    where $X=\left\{x_{1}, \ldots, x_{n}\right\} \in \mathcal{X}$.
- A Task $\mathcal{T}=\{\mathcal{Y}, f(\cdot)\}$,
  (given a specific domain $\mathcal{D}=\{\mathcal{X}, P(X)\}$) consisting of:
  - A label space $\mathcal{Y}$
  - An objective predictive function $f(\cdot)$
    It is learned from the training data, which consist of pairs $\left\{x_ {i}, y_{i}\right\}$, where $x_{i} \in X$ and $y_{i} \in \mathcal{Y}$.
    It can be used to predict the corresponding label, $f(x)$, of a new instance $x$.
Given a source domain $\mathcal{D}_ {S}$ and learning task $\mathcal{T}_ {S}$, a target domain $\mathcal{D}_ {T}$ and learning task $\mathcal{T}_ {T}$, transfer learning aims to help improve the learning of the target predictive function $f_ {T}(\cdot)$ in $\mathcal{D}_ {T}$ using the knowledge in $\mathcal{D}_ {S}$ and $\mathcal{T}_ {S}$, where $\mathcal{D}_ {S} \neq \mathcal{D}_ {T}$, or $\mathcal{T}_ {S} \neq \mathcal{T}_ {T}$.

In Transfer Learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in $P_1$ are relevant to the variations that need to be captured for learning $P_2$. This is typically understood in a supervised learning context, where the input is the same but the target may be of a different nature.

We may learn about one set of visual categories, such as cats and dogs, in the first setting, then learn about a different set of visual categories, such as ants and wasps, in the second setting.
If there is significantly more data in the first setting (sampled from $P_1$), then that may help to learn representations that are useful to quickly generalize from only very few examples drawn from $P_2$.
Many visual categories share low-level notions of edges and visual shapes, the effects of geometric changes, changes in lighting, etc.

Types of Transfer Learning:
- Inductive Transfer Learning:
  $\mathcal{D}_ {S} = \mathcal{D}_ {T} \:\:\: \text{ and }\:\:\: \mathcal{T}_ {S} \neq \mathcal{T}_ {T}$
  e.g. $\left(\mathcal{D}_ {S} = \text{ Wikipedia } = \mathcal{D}_ {T}\right) \:\: \text{ and } \:\: \left(\mathcal{T}_ {S} = \text{ Skip-Gram }\right) \neq \left(\mathcal{T}_ {T} = \text{ Classification }\right)$
- Transductive Transfer Learning (Domain Adaptation):
  $\mathcal{D}_ {S} \neq \mathcal{D}_ {T} \:\:\: \text{ and }\:\:\: \mathcal{T}_ {S} = \mathcal{T}_ {T}$
  e.g. $\left(\mathcal{D}_ {S} = \text{ Reviews }\right) \neq \left(\mathcal{D}_ {T} = \text{ Tweets }\right) \:\: \text{ and } \:\: \left(\mathcal{T}_ {S} = \text{ Sentiment Analysis } = \mathcal{T}_ {T}\right)$
- Unsupervised Transfer Learning:
  $\mathcal{D}_ {S} \neq \mathcal{D}_ {T} \:\:\: \text{ and }\:\:\: \mathcal{T}_ {S} \neq \mathcal{T}_ {T}$
  e.g. $\left(\mathcal{D}_ {S} = \text{ Animals}\right) \neq \left(\mathcal{D}_ {T} = \text{ Cars}\right) \: \text{ and } \: \left(\mathcal{T}_ {S} = \text{ Recog.}\right) \neq \left(\mathcal{T}_ {T} = \text{ Detection}\right)$
Concept Drift:
Concept Drift is a phenomena where the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.

It can be viewed as a form of transfer learning due to gradual changes in the data distribution over time.

Another example is in reinforcement learning. Since the agent’s policy affects the environment, the agent learning and updating its policy directly results in a changing environment with shifting data distribution.

Unsupervised Deep Learning for Transfer Learning:
Unsupervised Deep Learning for Transfer Learning has seen success in some machine learning competitions (Mesnil et al., 2011; Goodfellow et al., 2011).
In the first of these competitions, the experimental setup is the following:
- Each participant is first given a dataset from the first setting (from distribution $P_1$), illustrating examples of some set of categories.
- The participants must use this to learn a good feature space (mapping the raw input to some representation), such that when we apply this learned transformation to inputs from the transfer setting (distribution $P_2$ ), a linear classifier can be trained and generalize well from very few labeled examples.
One of the most striking results found in this competition is that as an architecture makes use of deeper and deeper representations (learned in a purely unsupervised way from data collected in the first setting, $P_1$), the learning curve on the new categories of the second (transfer) setting $P_2$ becomes much better.
For deep representations, fewer labeled examples of the transfer tasks are necessary to achieve the apparently asymptotic generalization performance.
Domain Adaptation:
Domain Adaptation is a form of transfer learning where we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution.

It is a sequential process.

In domain adaptation, the task (and the optimal input-to output mapping) remains the same between each setting, but the input distribution is slightly different.
Multitask Learning:
Multitask Learning is a transfer learning where multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks.

In particular, it is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.

It is a parallel process.

Multitask vs Transfer Learning:
1. Multi-Task Learning: general term for training on multiple tasks
  1. Joint Learning: by choosing mini-batches from two different tasks simultaneously/alternately
  2. Pre-Training: first train on one task, then train on another
    widely used for word embeddings.
2. Transfer Learning:
  a type of multi-task learning where we are focused on one task; by learning on another task then applying those models to our main task
Representation Learning for the Transfer of Knowledge:
We can use Representation Learning to achieve Multi-Task Learning, Transfer Learning, and Domain Adaptation.

In general, Representation Learning can be used to achieve Multi-Task Learning, Transfer Learning, and Domain Adaptation, when there exist features that are useful for the different settings or tasks, corresponding to underlying factors that appear in more than one setting.
This applies in two cases:
- Shared Input Semantics:
  
  In this case, we share the lower layers, and have a task-dependent upper layers.
- Shared Output Semantics:
  
  In cases like these, it makes more sense to share the upper layers (near the output) of the neural network, and have a task-specific preprocessing.
K-Shot Learning:
K-Shot (Few-Shot) Learning is a supervised learning setting (problem) where the goal is to learn from an extremely small number $k$ of labeled examples (called shots).

General Setting:
We first train a model on a large dataset $\mathcal{D}=\left\{\widetilde{\mathbf{x}}_ {i}, \widetilde{\gamma}_ {i}\right\}_ {i=1}^{N}$ of inputs $\widetilde{\mathbf{x}}_ {i}$ and labels $\widetilde{y}_ {i} \in\{1, \ldots, \widetilde{C}\}$ that indicate which of the $\widetilde{C}$ classes each input belongs to.
Then, using knowledge from the model trained on the large dataset, we perform $\mathrm{k}$-shot learning with a small dataset $\mathcal{D}=\left\{\mathbf{x}_ {i}, y_ {i}\right\}_ {i=1}^{N}$ with $C$ new classes, labels $y_ {i} \in\{\widetilde{C}+1, \widetilde{C}+C\}$ and $k$ examples (inputs) from each new class.
During test time we classify unseen examples (inputs) $\mathbf{x}^{* }$ from the new classes $C$ and evaluate the predictions against ground truth labels $y^{* }$.

Comparison to alternative Learning Paradigms:

As Transfer Learning:
Two extreme forms of transfer learning are One-Shot Learning and Zero-Shot Learning; they provide only one and zero labeled examples of the transfer task, respectively.

One-Shot Learning:
One-Shot Learning (Fei-Fei et al., 2006) is a form of k-shot learning where $k=1$.

It is possible because the representation learns to cleanly separate the underlying classes during the first stage.
During the transfer learning stage, only one labeled example is needed to infer the label of many possible test examples that all cluster around the same point in representation space.
This works to the extent that the factors of variation corresponding to these invariances have been cleanly separated from the other factors, in the learned representation space, and we have somehow learned which factors do and do not matter when discriminating objects of certain categories.

Zero-Shot Learning:
Zero-Shot Learning (Palatucci et al., 2009; Socher et al., 2013b) or Zero-data learning (Larochelle et al., 2008) is a form of k-shot learning where $k=0$.

Example: Zero-Shot Learning Setting
Consider the problem of having a learner read a large collection of text and then solve object recognition problems.
It may be possible to recognize a specific object class even without having seen an image of that object, if the text describes the object well enough.
For example, having read that a cat has four legs and pointy ears, the learner might be able to guess that an image is a cat, without having seen a cat before.

Justification and Interpretation:
Zero-Shot Learning is only possible because additional information has been exploited during training.

We can think of think of the zero-data learning scenario as including three random variables:
1. (Traditional) Inputs $x$
2. (Traditional) Outputs or Targets $\boldsymbol{y}$
3. (Additional) Random Variable describing the task, $T$
The model is trained to estimate the conditional distribution $p(\boldsymbol{y} \vert \boldsymbol{x}, T)$.
Representing the task $T$:
Zero-shot learning requires $T$ to be represented in a way that allows some sort of generalization.
For example, $T$ cannot be just a one-hot code indicating an object category.

Socher et al. (2013 b) provide instead a distributed representation of object categories by using a learned word embedding for the word associated with each category.

Representation Learning for Zero-Shot Learning:
The principle, underlying zero-shot learning as a form of transfer learning: capturing a representation in one modality, a representation in another modality, and the relationship (in general a joint distribution) between pairs $(\boldsymbol{x}, \boldsymbol{y})$ consisting of one observation $\boldsymbol{x}$ in one modality and another observation $\boldsymbol{y}$ in the other modality, (Srivastava and Salakhutdinov, 2012).
By learning all three sets of parameters (from $\boldsymbol{x}$ to its representation, from $\boldsymbol{y}$ to its representation, and the relationship between the two representations), concepts in one representation are anchored in the other, and vice-versa, allowing one to meaningfully generalize to new pairs.

In particular, Transfer learning between two domains $x$ and $y$ enables zero-shot learning.

Zero-Shot Learning in Machine Translation:

A similar phenomenon happens in machine translation (Klementiev et al., 2012; Mikolov et al., 2013b; Gouws et al., 2014):
we have words in one language, and the relationships between words can be learned from unilingual corpora; on the other hand, we have translated sentences which relate words in one language with words in the other. Even though we may not have labeled examples translating word $A$ in language $X$ to word $B$ in language $Y$, we can generalize and guess a translation for word $A$ because we have learned a distributed representation for words in language $X$, a distributed representation for words in language $Y$, and created a link (possibly two-way) relating the two spaces, via training examples consisting of matched pairs of sentences in both languages.
This transfer will be most successful if all three ingredients (the two representations and the relations between them) are learned jointly.

Relation to Multi-modal Learning:
Zero-Shot Learning can be performed using Multi-model Learning, and vice-versa.
The same principle of transfer learning with representation learning explain how one can perform either tasks.

Notes:
- K-Shot Learning (Thesis!)
- One Shot Learning and Siamese Networks in Keras (Code - Tutorial)
- Zero-Shot Learning: is a form of extending supervised learning to a setting of solving for example a classification problem when not enough labeled examples are available for all classes.
  
  “Zero-shot learning is being able to solve a task despite not having received any training examples of that task.” - Goodfellow
- Detecting Gravitational Waves is a form of Zero-Shot Learning
- Few-shot, one-shot or zero-shot learning are encompassed in a recently emerging field known as meta-learning.
  While traditionally including mainly classification, recent works in meta-learning have included regression and reinforcement learning (Vinyals et al., 2016) (Andrychowicz et al., 2016) (Ravi & Larochelle, 2017) (Duan et al., 2017) (Finn et al., 2017).
  Works in this area seems to be primarily motivated by the notion of human-level AI, since humans appear to be able to require far fewer training data than most deep learning models.
Multi-Modal Learning:
Multi-Modal Learning

Representation Learning for Multi-modal Learning:
The same principle, underlying zero-shot learning as a form of transfer learning, explains how one can perform multi-modal learning; capturing a representation in one modality, a representation in the other, and the relationship (in general a joint distribution) between pairs $(\boldsymbol{x}, \boldsymbol{y})$ consisting of one observation $\boldsymbol{x}$ in one modality and another observation $\boldsymbol{y}$ in the other modality (Srivastava and Salakhutdinov, 2012).
By learning all three sets of parameters (from $\boldsymbol{x}$ to its representation, from $\boldsymbol{y}$ to its representation, and the relationship between the two representations), concepts in one representation are anchored in the other, and vice-versa, allowing one to meaningfully generalize to new pairs.

Causal Factor Learning

Semi-Supervised Disentangling of Causal Factors:

Quality of Representations:
- An important question in Representation Learning is:
“what makes one representation better than another?”
1. One answer to that is the Causal Factors Hypothesis:
  An ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another.
  - This hypothesis motivates approaches in which we first seek a good representation for $p(\boldsymbol{x})$.
    This representation may also be a good representation for computing $p(\boldsymbol{y} \vert \boldsymbol{x})$ if $\boldsymbol{y}$ is among the most salient causes of $\boldsymbol{x}$⁴ ⁵.
2. Ease of Modeling:
  In many approaches to representation learning, we are often concerned with a representation that is easy to model (e.g. sparse entries, independent entries etc.).
  It is not directly observed, however, that a representation that cleanly separates the underlying causal factors is, also, one that is easy to model.
  The answer to that is an extension of the Causal Factor Hypothesis:
  For many AI tasks the two properties coincide: once we are able to obtain the underlying explanations for the observations, it generally becomes easy to isolate individual attributes from the others.
  Specifically, if a representation $\boldsymbol{h}$ represents many of the underlying causes of the observed $\boldsymbol{x}$, and the outputs $\boldsymbol{y}$ are among the most salient causes, then it is easy to predict $\boldsymbol{y}$ from $\boldsymbol{h}$.
The complete Causal Factors Hypothesis motivates Semi-Supervised Learning via Unsupervised Representation Learning.

Analysis - When does Semi-Supervised Learning Work:
- When does Semi-Supervised Disentangling of Causal Factors Work?
  Let’s start by considering two scenarios where Semi-Supervised Learning via Unsupervised Representation Learning Fails and Succeeds:
  Thus, we conclude that semi-supervised learning works when $p(\mathbf{y} \vert \mathbf{x})$ and $p(\mathbf{x})$ are tied together.
- When are $p(\mathbf{y} \vert \mathbf{x})$ and $p(\mathbf{x})$ tied?
  This happens when $\mathbf{y}$ is closely associated with one of the causal factors of $\mathbf{x}$, then $p(\mathbf{x})$ and $p(\mathbf{y} \vert \mathbf{x})$ will be strongly tied.
  - Thus, unsupervised representation learning that tries to disentangle the underlying factors of variation is likely to be useful as a semi-supervised learning strategy.
Now, Consider the assumption that $\mathbf{y}$ is one of the causal factors of $\mathbf{x}$, and let $\mathbf{h}$ represent all those factors:
- The “true” generative process can be conceived as structured according to this directed graphical model, with $\mathbf{h}$ as the parent of $\mathbf{x}$:
  $$p(\mathbf{h}, \mathbf{x})=p(\mathbf{x} \vert \mathbf{h}) p(\mathbf{h})$$
  - As a consequence, the data has marginal probability:
    $$p(\boldsymbol{x})=\mathbb{E}_ {\mathbf{h}} p(\boldsymbol{x} \vert \boldsymbol{h})$$
  From this straightforward observation, we conclude that:
  
  The best possible model of $\mathbf{x}$ (wrt. generalization) is the one that uncovers the above “true” structure, with $\boldsymbol{h}$ as a latent variable that explains the observed variations in $\boldsymbol{x}$.
  
  I.E. The “ideal” representation learning discussed above should thus recover these latent factors.
  If $\mathbf{y}$ is one of these (or closely related to one of them), then it will be very easy to learn to predict $\mathbf{y}$ from such a representation.
- We also see that the conditional distribution of $\mathbf{y}$ given $\mathbf{x}$ is tied by Bayes’ rule to the components in the above equation:
  $$p(\mathbf{y} \vert \mathbf{x})=\frac{p(\mathbf{x} \vert \mathbf{y}) p(\mathbf{y})}{p(\mathbf{x})}$$
  
  Thus the marginal $p(\mathbf{x})$ is intimately tied to the conditional $p(\mathbf{y} \vert \mathbf{x})$, and knowledge of the structure of the former should be helpful to learn the latter.
Therefore, in situations respecting these assumptions, semi-supervised learning should improve performance.
Justifying the setting where Semi-Supervised Learning Works:
- Semi-Supervised Learning⁶ Works when: $p(\mathbf{y} \vert \mathbf{x})$ and $p(\mathbf{x})$ are tied together.
- $p(\mathbf{y} \vert \mathbf{x})$ and $p(\mathbf{x})$ are Tied when: $\mathbf{y}$ is closely associated with one of the causal factors of $\mathbf{x}$, or it is a causal factor itself.
  - Let $\mathbf{h}$ represent all the causal factors of $\mathbf{x}$, and let $\mathbf{y} \in \mathbf{h}$ (be a causal factor of $\mathbf{x}$), then:
    The “true” generative process can be conceived as structured according to this directed graphical model, with $\mathbf{h}$ as the parent of $\mathbf{x}$:
    $$p(\mathbf{h}, \mathbf{x})=p(\mathbf{x} \vert \mathbf{h}) p(\mathbf{h})$$
    - Thus, the Marginal Probability of the Data $p(\mathbf{x})$ is:
      1. Tied to the conditional $p(\mathbf{x} \vert \mathbf{h})$ as:
        $$p(\boldsymbol{x})=\mathbb{E}_ {\mathbf{h}} p(\boldsymbol{x} \vert \boldsymbol{h})$$
        
        $\implies$
        
        The best possible model of $\mathbf{x}$ (wrt. generalization) is the one that uncovers the above “true” structure, with $\boldsymbol{h}$ as a latent variable that explains the observed variations in $\boldsymbol{x}$.
        I.E. The “ideal” representation learning discussed above should thus recover these latent factors.
      2. (intimately) Tied to the conditional $p(\mathbf{y} \vert \mathbf{x})$ (by Bayes’ rule) as:
        $$p(\mathbf{y} \vert \mathbf{x})=\frac{p(\mathbf{x} \vert \mathbf{y}) p(\mathbf{y})}{p(\mathbf{x})}$$
Therefore, in situations respecting these assumptions, semi-supervised learning should improve performance.

Encoding/Learning Causal Factors:
- Problem - Number of Causal Factors:
  An important research problem regards the fact that most observations are formed by an extremely large number of underlying causes.
  - Suppose $\mathbf{y}=\mathrm{h}_ {i}$, but the unsupervised learner does not know which $\mathrm{h}_ {i}$:
    - The brute force solution is for an unsupervised learner to learn a representation that captures all the reasonably salient generative factors $\mathrm{h}_ {j}$ and disentangles them from each other, thus making it easy to predict $\mathbf{y}$ from $\mathbf{h}$, regardless of which $\mathrm{h}_ {i}$ is associated with $\mathbf{y}$.
      - In practice, the brute force solution is not feasible because it is not possible to capture all or most of the factors of variation that influence an observation.
        For example, in a visual scene, should the representation always encode all of the smallest objects in the background?
        It is a well-documented psychological phenomenon that human beings fail to perceive changes in their environment that are not immediately relevant to the task they are performing Simons and Levin (1998).
- Solution - Determining which causal factor to encode/learn:
  An important research frontier in semi-supervised learning is determining “what to encode in each situation”.
  - Currently, there are two main strategies for dealing with a large number of underlying causes:
    1. Use a supervised learning signal at the same time as the (“plus”) unsupervised learning signal,
      so that the model will choose to capture the most relevant factors of variation.
    2. Use much larger representations if using purely unsupervised learning.
  - New (Emerging) Strategy for unsupervised learning:
    Redefining the definition of “salient” factors.
The definition of “Salient“:
- The current definition of “salient” factors:
  In practice, we encode the definition of “salient” by using the objective criterion (e.g. MSE).
  
  Historically, autoencoders and generative models have been trained to optimize a fixed criterion, often similar to MSE.
  - Problem with current definition:
    Since these fixed criteria determine which causes are considered salient, they will be emphasizing different factors depending on their e.g. effects on the error:
    - E.g. MSE applied to the pixels of an image implicitly specifies that an underlying cause is only salient if it significantly changes the brightness of a large number of pixels.
      This can be problematic if the task we wish to solve involves interacting with small objects.
- Learned (pattern-based) “Saliency”:
  Certain factors could be considered “salient” if they follow a highly recognizable pattern.
  E.g. if a group of pixels follow a highly recognizable pattern, even if that pattern does not involve extreme brightness or darkness, then that pattern could be considered extremely salient.
  - This definition is implemented by Generative Adversarial Networks (GANs).
    In this approach, a generative model is trained to fool a feedforward classifier. The feedforward classifier attempts to recognize all samples from the generative model as being fake, and all samples from the training set as being real.
    In this framework, any structured pattern that the feedforward network can recognize is highly salient.
    They learn how to determine what is salient.
Generative adversarial networks are only one step toward determining which factors should be represented.
We expect that future research will discover better ways of determining which factors to represent, and develop mechanisms for representing different factors depending on the task.

Robustness to Change - Causal Invariance:
A benefit of learning the underlying causal factors (Schölkopf et al. (2012)) is that:
if the true generative process has $\mathbf{x}$ as an effect and $\mathbf{y}$ as a cause, then modeling $p(\mathbf{x} \vert \mathbf{y})$ is robust to changes in $p(\mathbf{y})$.

If the cause-effect relationship was reversed, this would not be true, since by Bayes’ rule, $p(\mathbf{x} \vert \mathbf{y})$ would be sensitive to changes in $p(\mathbf{y})$.

Very often, when we consider changes in distribution due to different domains, temporal non-stationarity, or changes in the nature of the task, the causal mechanisms remain invariant (the laws of the universe are constant) while the marginal distribution over the underlying causes can change.
Hence, better generalization and robustness to all kinds of changes can be expected via learning a generative model that attempts to recover the causal factors $\mathbf{h}$ and $p(\mathbf{x} \vert \mathbf{h})$.
Providing Clues to Discover Underlying Causes:
Quality of Representations:
The answer to the following question:
“what makes one representation better than another?”
was the Causal Factors Hypothesis:
An ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another, especially those factors that are relevant to our applications.

Clues for Finding the Causal Factors of Variation:
Most strategies for representation learning are based on:
Introducing clues that help the learning to find these underlying factors of variations.
The clues can help the learner separate these observed factors from the others.

Supervised learning provides a very strong clue: a label $\boldsymbol{y},$ presented with each $\boldsymbol{x},$ that usually specifies the value of at least one of the factors of variation directly.

More generally, to make use of abundant unlabeled data, representation learning makes use of other, less direct, hints about the underlying factors.
These hints take the form of implicit prior beliefs that we, the designers of the learning algorithm, impose in order to guide the learner.

Clues in the form of Regularization:
Results such as the no free lunch theorem show that regularization strategies are necessary to obtain good generalization.
While it is impossible to find a universally superior regularization strategy, one goal of deep learning is to find a set of fairly generic regularization strategies that are applicable to a wide variety of AI tasks, similar to the tasks that people and animals are able to solve.

We can use generic regularization strategies to encourage learning algorithms to discover features that correspond to underlying factors, E.G. (Bengio et al. (2013d)):
- Smoothness: This is the assumption that $f(\boldsymbol{x}+\epsilon \boldsymbol{d}) \approx f(\boldsymbol{x})$ for unit $\boldsymbol{d}$ and small $\epsilon$. This assumption allows the learner to generalize from training examples to nearby points in input space. Many machine learning algorithms leverage this idea, but it is insufficient to overcome the curse of dimensionality.
- Linearity: Many learning algorithms assume that relationships between some variables are linear. This allows the algorithm to make predictions even very far from the observed data, but can sometimes lead to overly extreme predictions. Most simple machine learning algorithms that do not make the smoothness assumption instead make the linearity assumption. These are in fact different assumptions—linear functions with large weights applied to high-dimensional spaces may not be very smooth⁷.
- Multiple explanatory factors: Many representation learning algorithms are motivated by the assumption that the data is generated by multiple underlying explanatory factors, and that most tasks can be solved easily given the state of each of these factors. Section 15.3 describes how this view motivates semisupervised learning via representation learning. Learning the structure of $p(\boldsymbol{x})$ requires learning some of the same features that are useful for modeling $p(\boldsymbol{y} \vert \boldsymbol{x})$ because both refer to the same underlying explanatory factors. Section 15.4 describes how this view motivates the use of distributed representations, with separate directions in representation space corresponding to separate factors of variation.
- Causal factors: the model is constructed in such a way that it treats the factors of variation described by the learned representation $\boldsymbol{h}$ as the causes of the observed data $\boldsymbol{x}$, and not vice-versa. As discussed in section 15.3, this is advantageous for semi-supervised learning and makes the learned model more robust when the distribution over the underlying causes changes or when we use the model for a new task.
- Depth, or a hierarchical organization of explanatory factors: High-level, abstract concepts can be defined in terms of simple concepts, forming a hierarchy. From another point of view, the use of a deep architecture expresses our belief that the task should be accomplished via a multi-step program, with each step referring back to the output of the processing accomplished via previous steps.
- Shared factors across tasks: In the context where we have many tasks, corresponding to different $y_{i}$ variables sharing the same input $\mathbf{x}$ or where each task is associated with a subset or a function $f^{(i)}(\mathbf{x})$ of a global input $\mathbf{x},$ the assumption is that each $\mathbf{y}_ {i}$ is associated with a different subset from a common pool of relevant factors $\mathbf{h}$. Because these subsets overlap, learning all the $P\left(y_{i} \vert \mathbf{x}\right)$ via a shared intermediate representation $P(\mathbf{h} \vert \mathbf{x})$ allows sharing of statistical strength between the tasks.
- Manifolds: Probability mass concentrates, and the regions in which it concentrates are locally connected and occupy a tiny volume. In the continuous case, these regions can be approximated by low-dimensional manifolds with a much smaller dimensionality than the original space where the data lives. Many machine learning algorithms behave sensibly only on this manifold (Goodfellow et al., 2014b). Some machine learning algorithms, especially autoencoders, attempt to explicitly learn the structure of the manifold.
- Natural clustering: Many machine learning algorithms assume that each connected manifold in the input space may be assigned to a single class. The data may lie on many disconnected manifolds, but the class remains constant within each one of these. This assumption motivates a variety of learning algorithms, including tangent propagation, double backprop, the manifold tangent classifier and adversarial training.
- Temporal and spatial coherence: Slow feature analysis and related algorithms make the assumption that the most important explanatory factors change slowly over time, or at least that it is easier to predict the true underlying explanatory factors than to predict raw observations such as pixel values. See section 13.3 for further description of this approach.
- Sparsity: Most features should presumably not be relevant to describing most inputs—there is no need to use a feature that detects elephant trunks when representing an image of a cat. It is therefore reasonable to impose a prior that any feature that can be interpreted as “present” or “absent” should be absent most of the time.
- Simplicity of Factor Dependencies: In good high-level representations, the factors are related to each other through simple dependencies. The simplest possible is marginal independence, $P(\mathbf{h})=\prod_{i} P\left(\mathbf{h}_ {i}\right)$, but linear dependencies or those captured by a shallow autoencoder are also reasonable assumptions. This can be seen in many laws of physics, and is assumed when plugging a linear predictor or a factorized prior on top of a learned representation.
  - Consciousness Prior:
    - - Key Ideas:
        (1) Seek Objective Functions defined purely in abstract space (no decoders)
        (2) “Conscious” thoughts are low-dimensional.
        
        Conscious thoughts are very low-dimensional objects compared to the full state of the (unconscious) brain
        
        Yet they have unexpected predictive value or usefulness
        $\rightarrow$ strong constraint or prior on the underlying representation
        
        e.g. we can plan our lives by only thinking of simple/short sentences at a time, that can be expressed with few variables (words); short-term memory is only 7 words (underutilization? no, rather, prior).
        
        Thought: composition of few selected factors / concepts (key/value) at the highest level of abstraction of our brain.
        
        Richer than but closely associated with short verbal expression such as a sentence or phrase, a rule or fact (link to classical symbolic Al & knowledge representation)
        
        Thus, true statements about the very complex world, could be conveyed with very low-dimensional representations.
      - How to select a few relevant abstract concepts making a thought:
        Content-based Attention.
        
        Thus, Abstraction is related to Attention:
      - Two Levels of Representations:
        
        High-dimensional abstract representation space (all known concepts and factors) $h$
        
        Low-dimensional conscious thought $c,$ extracted from $h$
        
        $c$ includes names (keys) and values of factors
        
        The Goal of using attention on the unconscious states:
        is to put pressure (constraint) on the mapping between input and representations (Encoder) and the unconscious states representations $h$ such that the Encoder is encouraged to learn representations that have the property that that if I pick just a few elements of it, I can make a true statement or very highly probable statement about the world, (e.g. a highly probable prediction).
- Causal/Mechanism Independence:
  - Controllable Factors.
The concept of representation learning ties together all of the many forms of deep learning.
Feedforward and recurrent networks, autoencoders and deep probabilistic models all learn and exploit representations. Learning the best possible representation remains an exciting avenue of research.
Distribution Shift:
</div>

It is also called a one-hot representation, since it can be captured by a binary vector with $n$ bits that are mutually exclusive (only one of them can be active). ↩
It proceeds one layer at a time, training the k -th layer while keeping the previous ones fixed. In particular, the lower layers (which are trained first) are not adapted after the upper layers are introduced. ↩
Commonly, “pretraining” to refer not only to the pretraining stage itself but to the entire two phase protocol that combines the pretraining phase and a supervised learning phase. ↩
This idea has guided a large amount of deep learning research since at least the 1990s (Becker and Hinton, 1992; Hinton and Sejnowski, 1999), in more detail. ↩
For other arguments about when semi-supervised learning can outperform pure supervised learning, we refer the reader to section 1.2 of Chapelle et al. (2006). ↩
Using unsupervised representation learning that tries to disentangle the underlying factors of variation. ↩
See Goodfellow et al. (2014b) for a further discussion of the limitations of the linearity assumption. ↩

Representation Learning

Table of Contents

Representation Learning

Unsupervised Representation Learning

Supervised Representation Learning

Transfer Learning and Domain Adaptation

Causal Factor Learning