Ahmad Badary

Latent Variable Models

Linear Factor Models

Latent Variable Models

Latent Variable Models:
Latent Variable Models are statistical models that relate a set of observable variables (so-called manifest variables) to a set of latent variables.

Core Assumption - Local Independence:
Local Independence:
The observed items are conditionally independent of each other given an individual score on the latent variable(s). This means that the latent variable explains why the observed items are related to another.

In other words, the targets/labels on the observations are the result of an individual’s position on the latent variable(s), and that the observations have nothing in common after controlling for the latent variable.

$$p(A,B\vert z) = p(A\vert z) \times (B\vert z)$$

Methods for inferring Latent Variables:
- Hidden Markov models (HMMs)
- Factor analysis
- Principal component analysis (PCA)
- Partial least squares regression
- Latent semantic analysis and probabilistic latent semantic analysis
- EM algorithms
- Pseudo-Marginal Metropolis-Hastings algorithm
- Bayesian Methods: LDA
Notes:
- Latent Variables encode information about the data
  e.g. in compression, a 1-bit latent variable can encode if a face is Male/Female.
- Data Projection:
  You “hypothesis” how the data might have been generated (by LVs).
  Then, the LVs generate the data/observations.
- Latent Variable Models/Gaussian Mixture Models
- Expectation-Maximization/EM-Algorithm for Latent Variable Models

Linear Factor Models

Linear Factor Models:
Linear Factor Models are generative models that are the simplest class of latent variable models¹.
A linear factor model is defined by the use of a stochastic, linear decoder function that generates $\boldsymbol{x}$ by adding noise to a linear transformation of $\boldsymbol{h}$.

Applications/Motivation:
- Building blocks of mixture models (Hinton et al., 1995a; Ghahramani and Hinton, 1996; Roweis et al., 2002)
- Building blocks of larger, deep probabilistic models (Tang et al., 2012)
- They also show many of the basic approaches necessary to build generative models that the more advanced deep models will extend further.
- These models are interesting because they allow us to discover explanatory factors that have a simple joint distribution.
- The simplicity of using a linear decoder made these models some of the first latent variable models to be extensively studied.
LFTs as Generative Models:
Linear factor models are some of the simplest generative models and some of the simplest models that learn a representation of data.

Data Generation Process:
A linear factor model describes the data generation process as follows:
1. Sample the explanatory factors $\boldsymbol{h}$ from a distribution:
  $$\mathbf{h} \sim p(\boldsymbol{h}) \tag{1}$$
  
  where $p(\boldsymbol{h})$ is a factorial distribution, with $p(\boldsymbol{h})=\prod_{i} p\left(h_{i}\right),$ so that it is easy to sample from.
2. Sample the real-valued observable variables given the factors:
  $$\boldsymbol{x}=\boldsymbol{W} \boldsymbol{h}+\boldsymbol{b}+ \text{ noise} \tag{2}$$
  
  where the noise is typically Gaussian and diagonal (independent across dimensions).
Factor Analysis:
Probabilistic PCA (principal components analysis), Factor Analysis and other linear factor models are special cases of the above equations (1 and 2) and only differ in the choices made for the noise distribution and the model’s prior over latent variables $\boldsymbol{h}$ before observing $\boldsymbol{x}$.

Factor Analysis:
In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable prior is just the unit variance Gaussian:

$$\mathbf{h} \sim \mathcal{N}(\boldsymbol{h} ; \mathbf{0}, \boldsymbol{I})$$

while the observed variables $x_i$ are assumed to be conditionally independent, given $\boldsymbol{h}$.
Specifically, the noise is assumed to be drawn from a diagonal covariance Gaussian distribution, with covariance matrix $\boldsymbol{\psi}=\operatorname{diag}\left(\boldsymbol{\sigma}^{2}\right),$ with $\boldsymbol{\sigma}^{2}=\left[\sigma_{1}^{2}, \sigma_{2}^{2}, \ldots, \sigma_{n}^{2}\right]^{\top}$ a vector of per-variable variances.

The role of the latent variables is thus to capture the dependencies between the different observed variables $x_i$.
Indeed, it can easily be shown that $\boldsymbol{x}$ is just a multivariate normal random variable, with:

$$\mathbf{x} \sim \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{b}, \boldsymbol{W} \boldsymbol{W}^{\top}+\boldsymbol{\psi}\right)$$
Probabilistic PCA:
Probabilistic PCA (principal components analysis), Factor Analysis and other linear factor models are special cases of the above equations (1 and 2) and only differ in the choices made for the noise distribution and the model’s prior over latent variables $\boldsymbol{h}$ before observing $\boldsymbol{x}$.
- - Addresses limitations of regular PCA
  - PCA can be used as a general Gaussian density model in addition to reducing dimensions
  - Maximum-likelihood estimates can be computed for elements associated with principal components
  - Captures dominant correlations with few parameters
  - Multiple PCA models can be combined as a probabilistic mixture
  - Can be used as a base for Bayesian PCA
Probabilistic PCA:
In order to cast PCA in a probabilistic framework, we can make a slight modification to the factor analysis model, making the conditional variances $\sigma_i^2$ equal to each other.
In that case the covariance of $\boldsymbol{x}$ is just $\boldsymbol{W} \boldsymbol{W}^{\top}+\sigma^{2} \boldsymbol{I}$, where $\sigma^2$ is now a scalar.
This yields the conditional distribution:

$$\mathbf{x} \sim \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{b}, \boldsymbol{W} \boldsymbol{W}^{\top}+\sigma^{2} \boldsymbol{I}\right)$$

or, equivalently,

$$\mathbf{x}=\boldsymbol{W} \mathbf{h}+\boldsymbol{b}+\sigma \mathbf{z}$$

where $\mathbf{z} \sim \mathcal{N}(\boldsymbol{z} ; \mathbf{0}, \boldsymbol{I})$ is Gaussian noise.

Notice that $\boldsymbol{b}$ is the mean value (over all data) on the directions that are not captured/represented.

This probabilistic PCA model takes advantage of the observation that most variations in the data can be captured by the latent variables $\boldsymbol{h},$ up to some small residual reconstruction error $\sigma^2$.

Learning (parameter estimation):
Tipping and Bishop (1999) then show an iterative EM algorithm for estimating the parameters $\boldsymbol{W}$ and $\sigma^{2}$.

Relation to PCA - Limit Analysis:
Tipping and Bishop (1999) show that probabilistic PCA becomes $\mathrm{PCA}$ as $\sigma \rightarrow 0$.
In that case, the conditional expected value of $\boldsymbol{h}$ given $\boldsymbol{x}$ becomes an orthogonal projection of $\boldsymbol{x} - \boldsymbol{b}$ onto the space spanned by the $d$ columns of $\boldsymbol{W}$, like in PCA.

As $\sigma \rightarrow 0,$ the density model defined by probabilistic PCA becomes very sharp around these $d$ dimensions spanned by the columns of $\boldsymbol{W}$.
This can make the model assign very low likelihood to the data if the data does not actually cluster near a hyperplane.

PPCA vs Factor Analysis:
- Covariance
  - PPCA (& PCA) is covariant under rotation of the original data axes
  - Factor analysis is covariant under component-wise rescaling
- Principal components (or factors)
  - PPCA: different principal components (axes) can be found incrementally
  - Factor analysis: factors from a two-factor model may not correspond to those from a one-factor model
Manifold Interpretation of PCA:
Linear factor models including PCA and factor analysis can be interpreted as learning a manifold (Hinton et al., 1997).
We can view PPCA as defining a thin pancake-shaped region of high probability—a Gaussian distribution that is very narrow along some axes, just as a pancake is very flat along its vertical axis, but is elongated along other axes, just as a pancake is wide along its horizontal axes.

PCA can be interpreted as aligning this pancake with a linear manifold in a higher-dimensional space.
This interpretation applies not just to traditional PCA but also to any linear autoencoder that learns matrices $\boldsymbol{W}$ and $\boldsymbol{V}$ with the goal of making the reconstruction of $x$ lie as close to $x$ as possible:
- Let the Encoder be:
  $$\boldsymbol{h}=f(\boldsymbol{x})=\boldsymbol{W}^{\top}(\boldsymbol{x}-\boldsymbol{\mu})$$
  
  The encoder computes a low-dimensional representation of $h$.
- With the autoencoder view, we have a decoder computing the reconstruction:
  $$\hat{\boldsymbol{x}}=g(\boldsymbol{h})=\boldsymbol{b}+\boldsymbol{V} \boldsymbol{h}$$
- The choices of linear encoder and decoder that minimize reconstruction error:
  $$\mathbb{E}\left[\|\boldsymbol{x}-\hat{\boldsymbol{x}}\|^{2}\right]$$
  
  correspond to $\boldsymbol{V}=\boldsymbol{W}, \boldsymbol{\mu}=\boldsymbol{b}=\mathbb{E}[\boldsymbol{x}]$ and the columns of $\boldsymbol{W}$ form an orthonormal basis which spans the same subspace as the principal eigenvectors of the covariance matrix:
  
  $$\boldsymbol{C}=\mathbb{E}\left[(\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\right]$$
- In the case of PCA, the columns of $\boldsymbol{W}$ are these eigenvectors, ordered by the magnitude of the corresponding eigenvalues (which are all real and non-negative).
- Variances:
  One can also show that eigenvalue $\lambda_{i}$ of $\boldsymbol{C}$ corresponds to the variance of $x$ in the direction of eigenvector $\boldsymbol{v}^{(i)}$.
- Optimal Reconstruction:
  - If $\boldsymbol{x} \in \mathbb{R}^{D}$ and $\boldsymbol{h} \in \mathbb{R}^{d}$ with $d<D$, then the optimal reconstruction error (choosing $\mu, b, V$ and $$W$ as above) is:
    $$\min \mathbb{E}\left[\|\boldsymbol{x}-\hat{\boldsymbol{x}}\|^{2}\right]=\sum_{i=d+1}^{D} \lambda_{i}$$
  - Hence, if the covariance has rank $d,$ the eigenvalues $\lambda_{d+1}$ to $\lambda_{D}$ are $0$ and reconstruction error is $$0$.
  - Furthermore, one can also show that the above solution can be obtained by maximizing the variances of the elements of $\boldsymbol{h},$ under orthogonal $\boldsymbol{W}$, instead of minimizing reconstruction error.
Notes:
- PPCA - Probabilistic PCA Slides / PPCA Better Slides
- Probabilistic PCA (Original Paper!)
- EM Algorithm for PCA is more advantageous than MLE (closed form).
- Mixtures of probabilistic PCAs: can be defined and are a combination of local probabilistic PCA models.
- PCA can be generalized to the nonlinear Autoencoders.
- ICA can be generalized to a nonlinear generative model, in which we use a nonlinear function $f$ to generate the observed data.
Independent Component Analysis (ICA):
Slow Feature Analysis:
Sparse Coding:
Sparse Coding (Olshausen and Field, 1996) is a linear factor model that has been heavily studied as an unsupervised feature learning and feature extraction mechanism.
In Sparse Coding the noise distribution is Gaussian noise with isotropic precision $\beta$:

$$p(\boldsymbol{x} \vert \boldsymbol{h})=\mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{W} \boldsymbol{h}+\boldsymbol{b}, \frac{1}{\beta} \boldsymbol{I}\right)$$

The latent variable prior $p(\boldsymbol{h})$ is chosen to be one with sharp peaks near $0$.
Common choices include:
- factorized Laplace:
  $$p\left(h_{i}\right)=$ Laplace $\left(h_{i} ; 0, \frac{2}{\lambda}\right)=\frac{\lambda}{4} e^{-\frac{1}{2} \lambda\left|h_{i}\right|}$$
- factorized Student-t distributions:
  $$p\left(h_{i}\right) \propto \frac{1}{\left(1+\frac{h_{i}^{2}}{\nu}\right)^{\frac{\nu+1}{2}}}$$
- Cauchy
Learning/Training:
- Training sparse coding with maximum likelihood is intractable.
- Instead, the training alternates between encoding the data and training the decoder to better reconstruct the data given the encoding.
  This is a principled approximation to Maximum-Likelihood.
  - Minimization wrt. $\boldsymbol{h}$
  - Minimization wrt. $\boldsymbol{W}$
Architecture:
- Encoder:
  - Non-parametric.
  - It is an optimization algorithm that solves an optimization problem in which we seek the single most likely code value:
    $$\boldsymbol{h}^{* }=f(\boldsymbol{x})=\underset{\boldsymbol{h}}{\arg \max } p(\boldsymbol{h} vert \boldsymbol{x})$$
    - Assuming a Laplace Prior on $p(\boldsymbol{h})$:
      $$\boldsymbol{h}^{* }=\underset{h}{\arg \min } \lambda\|\boldsymbol{h}\|_{1}+\beta\|\boldsymbol{x}-\boldsymbol{W h}\|_{2}^{2}$$
      
      where we have taken a log, dropped terms not depending on $\boldsymbol{h}$, and divided by positive scaling factors to simplify the equation.
    - Hyperparameters:
      Both $\beta$ and $\lambda$ are hyperparameters.
      However, $\beta$ is usually set to $1$ because its role is shared with $\lambda$.
      It could also be treated as a parameter of the model and “learned”².
Variations:
Not all approaches to sparse coding explicitly build a $p(\boldsymbol{h})$ and a $p(\boldsymbol{x} \vert \boldsymbol{h})$.
Often we are just interested in learning a dictionary of features with activation values that will often be zero when extracted using this inference procedure.

Sparsity:
- Due to the imposition of an $L^{1}$ norm on $\boldsymbol{h},$ this procedure will yield a sparse $\boldsymbol{h}^{* }$.
- If we sample $\boldsymbol{h}$ from a Laplace prior, it is in fact a zero probability event for an element of $\boldsymbol{h}$ to actually be zero.
  The generative model itself is not especially sparse, only the feature extractor is.
  - Goodfellow et al. (2013d) describe approximate inference in a different model family, the spike and slab sparse coding model, for which samples from the prior usually contain true zeros.
Properties:
- Advantages:
  - The sparse coding approach combined with the use of the non-parametric encoder can in principle minimize the combination of reconstruction error and log-prior better than any specific parametric encoder.
  - Another advantage is that there is no generalization error to the encoder.
    Thus, resulting in better generalization when sparse coding is used as a feature extractor for a classifier than when a parametric function is used to predict the code.
    - A parametric encoder must learn how to map $\boldsymbol{x}$ to $\boldsymbol{h}$ in a way that generalizes. For unusual $\boldsymbol{x}$ that do not resemble the training data, a learned, parametric encoder may fail to find an $\boldsymbol{h}$ that results in accurate reconstruction or a sparse code.
    - For the vast majority of formulations of sparse coding models, where the inference problem is convex, the optimization procedure will always find the optimal code (unless degenerate cases such as replicated weight vectors occur).
    - Obviously, the sparsity and reconstruction costs can still rise on unfamiliar points, but this is due to generalization error in the decoder weights, rather than generalization error in the encoder.
    - Thus, the lack of generalization error in sparse coding’s optimization-based encoding process may result in better generalization when sparse coding is used as a feature extractor for a classifier than when a parametric function is used to predict the code.
      - Coates and Ng (2011) demonstrated that sparse coding features generalize better for object recognition tasks than the features of a related model based on a parametric encoder, the linear-sigmoid autoencoder.
      - Goodfellow et al. (2013d) showed that a variant of sparse coding generalizes better than other feature extractors in the regime where extremely few labels are available (twenty or fewer labels per class).
- Disadvantages:
  - The primary disadvantage of the non-parametric encoder is that it requires greater time to compute $\boldsymbol{h}$ given $\boldsymbol{x}$ because the non-parametric approach requires running an iterative algorithm.
    - The parametric autoencoder approach uses only a fixed number of layers, often only one.
  - It is not straight-forward to back-propagate through the non-parametric encoder: which makes it difficult to pretrain a sparse coding model with an unsupervised criterion and then fine-tune it using a supervised criterion.
    - Modified versions of sparse coding that permit approximate derivatives do exist but are not widely used (Bagnell and Bradley, 2009).
Generation (Sampling):
- Sparse coding, like other linear factor models, often produces poor samples.
- This happens even when the model is able to reconstruct the data well and provide useful features for a classifier.
  - The reason is that each individual feature may be learned well, but the factorial prior on the hidden code results in the model including random subsets of all of the features in each generated sample.
- Motivating Deep Models:
  This motivates the development of deeper models that can impose a non-factorial distribution on the deepest code layer, as well as the development of more sophisticated shallow models.
Notes:
- Sparse Coding (Hugo Larochelle!)

Probabilistic Models, with latent variables. ↩
some terms depending on $\beta$ omitted from above equation* which are needed to learn $\beta$. ↩

Latent Variable Models

Table of Contents

Latent Variable Models

Linear Factor Models