Ahmad Badary

Introduction - The Partition Function

Estimating the Partition Function

A Thorough Introduction to Boltzmann Machines
Strategies for Confronting the Partition Function (Blog! + code)
Approximating the Softmax (Ruder)
Confronting the partition function (Slides)
Graphical Models, Exponential Families, and Variational Inference (M Jordan)

Towards Biologically Plausible Deep Learning (paper!)

Introduction - The Partition Function

The Partition Function:
The Partition Function is the normalization constant of an unnormalized probability distribution $\tilde{p}(\mathbf{x} ; \boldsymbol{\theta})$.

Formally, it is the (possibly infinite) sum over the unnormalized probability $\tilde{p}(\mathbf{x} ; \boldsymbol{\theta})$ of all the states/events $\boldsymbol{x} \in X$,
- Discrete Variables:
  $$Z(\boldsymbol{\theta}) = \sum_{\boldsymbol{x}} \tilde{p}(\boldsymbol{x})$$
- Continuous Variables:
  $$Z(\boldsymbol{\theta}) = \int \tilde{p}(\boldsymbol{x}) d \boldsymbol{x}$$
It is defined such that:

$$\sum_\mathbf{x} p(\mathbf{x} ; \boldsymbol{\theta}) = \sum_\mathbf{x} \dfrac{\tilde{p}(\mathbf{x} ; \boldsymbol{\theta})}{Z(\boldsymbol{\theta})} = 1$$

Notes:
- The Partition Function contains an explicit Temperature
- The Partition Function is a generating function
- Statistical Mechanics of Learning from Examples
  Sompolinsky et al. confront the partition function for a Perceptron using statistical mechanics methods developed for spin glasses and simple nets (Garder, Derrida) and applied it to Perceptrons and, later, to something like MLPs.
- Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes
  Uses old techniques from non-equilibrium statistical mechanics to address the modern problems of inference.
Handling the Partition Function - Motivation:
Many Undirected Probabilistic Graphical Models (PGMs) are defined by an unnormalized probability distribution $\tilde{p}(\mathbf{x} ; \boldsymbol{\theta})$.
To obtain a valid probability distribution, we need to normalize $\tilde{p}$ by dividing by a partition function $Z(\boldsymbol{\theta})$:
$$p(\mathbf{x} ; \boldsymbol{\theta})=\dfrac{1}{Z(\boldsymbol{\theta})} \tilde{p}(\mathbf{x} ; \boldsymbol{\theta})$$

Calculating the partition function can be intractable for many interesting models.

The Partition Function in Deep Probabilistic Models:
Deep Probabilistic Models are usually designed with the partition function in mind. There a few approaches taken in the designs:
- Some models are designed to have a tractable normalizing constant.
- Others are designed to be used in ways (training/inference) that avoid computing the normalized probability altogether.
- Yet, other models directly confront the challenge of intractable partition functions.
  They use techniques, described below, for training and evaluating models with intractable $Z$.
Handling the Partition Function:
There are a few approaches to handle the (intractable) partition function:
1. Estimate the partition function as a learned parameter; Noise-Contrastive Estimation.
2. Estimate the gradient of the partition function directly; Stochastic MLE, Contrastive-Divergence.
3. Avoid computing quantities related to the partition function altogether; Score Matching, Pseudolikelihood.
4. Estimate the partition function (itself) explicitly: Annealed IS, Bridge Sampling, Linked IS.
The Log-Likelihood Gradient:
Phase Decomposition of Learning:
Learning using MLE requires computing the gradient of the NLL, $\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x} ; \boldsymbol{\theta})$.
What makes learning undirected models by maximum likelihood particularly difficult is that the partition function depends on the parameters; thus, the gradient of the NLL wrt the parameters involves computing the gradient of $Z(\mathbf{\theta})$.
In undirected models, this gradient can be written as:
$$\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x} ; \boldsymbol{\theta})=\nabla_{\boldsymbol{\theta}} \log \tilde{p}(\mathbf{x} ; \boldsymbol{\theta})-\nabla_{\boldsymbol{\theta}} \log Z(\boldsymbol{\theta})$$

which decomposes the gradient (learning) into a positive phase and a negative phase.

Difficulties in Learning wrt the Decomposition:
- Difficulty in the Negative Phase:
  - For most undirected models of interest, the negative phase is difficult to compute. This is usually due to having to compute the unnormalized probability for all the states.
  - Directed models define many “implicit” conditional independencies between the variables, making it easier to compute the normalization due to many terms canceling out.
  - Example - RBMs:
    The quintessential example of a model with a straightforward positive phase and a difficult negative phase is the RBM.
    It has hidden units that are conditionally independent from each other given the visible units.
  Word2vec is another example.
- Difficulty in the Positive Phase:
  - Latent Variable Models, generally, have intractable positive phase.
  - Models with no latent variables or with few interactions between latent variables typically have a tractable positive phase.
  - Example - VAEs:
    VAEs define a continuous distribution (over the data) with latent variable $z$:
    $$p_{\theta}(x)=\int p_{\theta}(z) p_{\theta}(x \vert z) d z$$
    
    which is intractable to compute for every $z$.
    Due to complicated interactions between latent variables, this integral requires exponential time to compute as it needs to be evaluated over all configurations of latent variable.
    (all $z_i$ variables are dependent on each other.)
Positive and Negative Phases:
The terms positive and negative do not refer to the sign of each term in the equation, but rather reflect their effect on the probability density defined by the model.
- The positive phase increases the probability of training data (by reducing the corresponding free energy)
- The negative phase decreases the probability of samples generated by the model.

Monte Carlo Methods for Approximate LL Maximization:
To use MC methods for approximate learning, we need to rewrite the gradient of the partition function $\nabla_{\boldsymbol{\theta}} \log Z$ as an expectation of the unnormalized probability $\tilde{p}$:

$$\nabla_{\boldsymbol{\theta}} \log Z=\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})} \nabla_{\boldsymbol{\theta}} \log \tilde{p}(\mathbf{x})$$

This identity is the basis for a variety of Monte Carlo methods for approximately maximizing the likelihood of models with intractable partition functions.
- - Decomposing the gradient of $\log Z$:
    $$\begin{aligned} \nabla_{\boldsymbol{\theta}} \log Z =& \frac{\nabla_{\boldsymbol{\theta}} Z}{Z} \\=& \frac{\nabla_{\boldsymbol{\theta}} \sum_{\mathbf{x}} \tilde{p}(\mathbf{x})}{Z} \\=& \frac{\sum_{\mathbf{x}} \nabla_{\boldsymbol{\theta} \tilde{p}(\mathbf{x})}}{Z} \end{aligned}$$
  - For models that guarantee $p(\mathbf{x})>0$ for all $\mathbf{x},$ we can substitute $\exp (\log \tilde{p}(\mathbf{x}))$ for $\tilde{p}(\mathbf{x})$:
    $$\begin{aligned} \frac{\sum_{\mathbf{x}} \nabla_{\boldsymbol{\theta}} \exp (\log \tilde{p}(\mathbf{x}))}{Z} &= \frac{\sum_{\mathbf{x}} \exp (\log \tilde{p}(\mathbf{x})) \nabla_{\boldsymbol{\theta}} \log \tilde{p}(\mathbf{x})}{Z} \\ &=\frac{\sum_{\mathbf{x}} \tilde{p}(\mathbf{x}) \nabla_{\boldsymbol{\theta}} \log \tilde{p}(\mathbf{x})}{Z} \\ &=\sum_{\mathbf{x}} p(\mathbf{x}) \nabla_{\boldsymbol{\theta}} \log \tilde{p}(\mathbf{x}) \\ &= \mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})} \nabla_{\boldsymbol{\theta}} \log \tilde{p}(\mathbf{x}) \end{aligned}$$
- - We use Leibniz’s rule for diﬀerentiation under the integral sign to obtain the identity:
    $$\nabla_{\boldsymbol{\theta}} \int \tilde{p}(\mathbf{x}) d \boldsymbol{x}=\int \nabla_{\boldsymbol{\theta}} \tilde{p}(\mathbf{x}) d \boldsymbol{x}$$
    - Applicability - Measure Theory:
      This identity is applicable only under certain regularity conditions on $\tilde{p}$ and $\nabla_{\boldsymbol{\theta}} \tilde{p}(\mathbf{x})$.
      In measure theoretic terms, the conditions are:
      1. The unnormalized distribution $\tilde{p}$ must be a Lebesgue-integrable function of $\boldsymbol{x} for every value of$\boldsymbol{\theta}$$.
      2. The gradient $\nabla_{\boldsymbol{\theta}} \tilde{p}(\mathbf{x})$ must exist for all $\boldsymbol{\theta}$ and almost all $\boldsymbol{x}$.
      3. There must exist an integrable function $R(\boldsymbol{x})$ that bounds $\nabla_{\boldsymbol{\theta}} \tilde{p}(\mathbf{x})$ in the sense that $\max_{i}\left\vert\frac{\partial}{\partial \theta_{\theta}} \tilde{p}(\mathbf{x})\right\vert \leq R(\boldsymbol{x})$ for all $\boldsymbol{\theta}$ and almost all $\boldsymbol{x}$.
      Fortunately, most machine learning models of interest have these properties..
Intuition:
The Monte Carlo approach to learning provides an intuitive framework in terms of the phases of the learning decomposition:
- Positive Phase:
  In the positive phase, we increase $\log \tilde{p}(\mathbf{x})$ for $\boldsymbol{x}$ drawn from the data.
  - Parametrize $\log \tilde{p}$ in terms of an Energy Function:
    We interpret the positive phase as pushing down on the energy of training examples.
- Negative Phase:
  In the negative phase, we decrease the partition function by decreasing $\log \tilde{p}(\mathbf{x})$ drawn from the model distribution.
  - Parametrize $\log \tilde{p}$ in terms of an Energy Function:
    We interpret the negative phase as pushing up on the energy of samples drawn from the models.
Stochastic Maximum Likelihood and Contrastive Divergence:
To approximately maximize the log likelihood, using the identity derived above, we need to use MCMC methods.

Motivation - The Naive Approach:
The naive way to compute the identity above, is to approximate it by burning in a set of Markov Chains from a random initialization everytime the gradient is needed.
When learning is performed using stochastic gradient descent, this means the chains must be burned in once per gradient step.

The high cost of burning in the Markov chains in the inner loop makes this procedure computationally infeasible.

Learning Intuition (from naive algorithm):
- We can view the MCMC approach to maximum likelihood as trying to achieve balance between two forces:
  - One pushing up on the model distribution where the data occurs
    Corresponds to maximizing $\log \tilde{p}$.
  - Another pushing down on the model distribution where the model samples occur.
    Corresponds to minimizing $\log Z$.
- There are several approximations to the negative phase.
  Each of these approximations can be understood as making the negative phase computationally cheaper but also making it push down in the wrong locations.
- Negative Phase Intuition:
  - Because the negative phase involves drawing samples from the model’s distribution, we can think of it as finding points that the model believes in strongly.
  - Because the negative phase acts to reduce the probability of those points, they are generally considered to represent the model’s incorrect beliefs about the world.
    Referred to, in literature, as “hallucinations” or “fantasy particles”.
    - - In fact, the negative phase has been proposed as a possible explanation for dreaming in humans and other animals (Crick and Mitchison, 1983), the idea being that the brain maintains a probabilistic model of the world and follows the gradient of $\log \tilde{p}$ when experiencing real events while awake and follows the negative gradient of $\log \tilde{p}$ to minimize $\log Z$ while sleeping and experiencing events sampled from the current model. This view explains much of the language used to describe algorithms with a positive and a negative phase, but it has not been proved to be correct with neuroscientific experiments.
        In machine learning models, it is usually necessary to use the positive and negative phase simultaneously, rather than in separate periods of wakefulness and REM sleep.
        As we will see in section 19.5, other machine learning algorithms draw samples from the model distribution for other purposes, and such algorithms could also provide an account for the function of dream sleep.
Summary:

The main cost of the naive MCMC algorithm is the cost of burning in the Markov chains from a random initialization at each step.

Contrastive Divergence:
One way to avoid the high cost in Naive MCMC, is to initialize the Markov chains from a distribution that is very close to the model distribution, so that the burn in operation does not take as many steps.
The Contrastive Divergence (CD) (or CD-$k$ to indicate CD with $k$ Gibbs steps) algorithm initializes the Markov chain at each step with samples from the data distribution (Hinton, 2000, 2010).
- Obtaining samples from the data distribution is free, because they are already available in the dataset.
- Initially, the data distribution is not close to the model distribution, so the negative phase is not very accurate.
- Fortunately, the positive phase can still accurately increase the model’s probability of the data.
- After the positive phase has had some time to act, the model distribution is closer to the data distribution, and the negative phase starts to become accurate.

Drawbacks:
- Spurious Modes:
  Since CD is still an approximation to the correct negative phase, it results in spurious modes; i.e. fails to suppress regions of high probability that are far from actual training examples.
  Spurious Modes: are those regions that have high probability under the model but low probability under the data-generating distribution.
  
  - Modes in the distribution that are far from the data distribution will not be visited by Markov chains initialized at training points, unless $k$ is very large.
- CD as a Biased Estimator in RBMs and Boltzmann Machines:
  - Carreira-Perpiñan and Hinton (2005) showed experimentally that the CD estimator is biased for RBMs and fully visible Boltzmann machines, in that it converges to different points than the maximum likelihood estimator.
  - They argue that because the bias is small, CD could be used as an inexpensive way to initialize a model that could later be fine-tuned via more expensive MCMC methods.
  - Interpretation: Bengio and Delalleau (2009) show that CD can be interpreted as discarding the smallest terms of the correct MCMC update gradient, which explains the bias.
- Random Gradients:
  Sutskever and Tieleman (2010) showed that the CD update direction is not the gradient of any function.
  This allows for situations where CD could cycle forever, but in practice this is not a serious problem.
- Difficulty for Deep Models:
  - CD is useful for training shallow models like RBMs.
  - The RBMs can be stacked to initialize deeper models like DBNs or DBMs.
  - However, CD does NOT provide much help for training deeper models directly.
    - This is because it is difficult to obtain samples of the hidden units given samples of the visible units.
      - Since the hidden units are not included in the data, initializing from training points cannot solve the problem.
      - Even if we initialize the visible units from the data, we will still need to burn in a Markov chain sampling from the distribution over the hidden units conditioned on those visible samples.
Relation to Autoencoder Training:
- The CD algorithm can be thought of as penalizing the model for having a Markov chain that changes the input rapidly when the input comes from the data.
- This means training with CD somewhat resembles autoencoder training.
- Even though CD is more biased than some of the other training methods, it can be useful for pretraining shallow models that will later be stacked.
- This is because the earliest models in the stack are encouraged to copy more information up to their latent variables, thereby making it available to the later models.
  This should be thought of more as an often-exploitable side effect of CD training rather than a principled design advantage.
Stochastic Maximum Likelihood (SML) - Persistent Contrastive Divergence (PCD, PCD-$k$):
SML AKA PCD is a method that initializes the Markov Chains, in CD, at each gradient step with their states from the previous gradient step.
This strategy resolves many of the problems with CD.
Idea:
- The basic idea of this approach is that, as long as the steps taken by the stochastic gradient algorithm are small, the model from the previous step will be similar to the model from the current step.
- It follows that the samples from the previous model’s distribution will be very close to being fair samples from the current model’s distribution, so a Markov chain initialized with these samples will not require much time to mix.
Advantages:
- SML is considerably more resistant to forming models with spurious modes than CD is:
  Because each Markov chain is continually updated throughout the learning process, rather than restarted at each gradient step, the chains are free to wander far enough to find all the model’s modes.
- SML is able to train deep models efficiently:
  - SML provides an initialization point for both the hidden and the visible units:
    Because it is possible to store the state of all the sampled variables, whether visible or latent.
  - CD is only able to provide an initialization for the visible units, and therefore requires burn-in for deep models.
- Performance/Results - In-Practice:
  Marlinet al. (2010) compared SML to many other criteria presented in this section. They found that:
  - SML results in the best test set log-likelihood for an RBM, and that
  - if the RBM’s hidden units are used as features for an SVM classifier, SML results in the best classification accuracy.
Mixing Evaluation:

Sample Evaluation:

Care must be taken when evaluating the samples from a model trained with SML. It is necessary to draw the samples starting from a fresh Markov chain initialized from a random starting point after the model is done training. The samples present in the persistent negative chains used for training have been influenced by several recent versions of the model, and thus can make the model appear to have greater capacity than it actually does.

Bias-Variance of CD and SML:
Berglund and Raiko (2013) performed experiments to examine the bias and variance in the estimate of the gradient provided by CD and SML:
- CD proves to have lower variance than the estimator based on exact sampling.
  The cause of CD’s low variance is its use of the same training points in both the positive and negative phase.
  If the negative phase is initialized from different training points, the variance rises above that of the estimator based on exact sampling.
- SML has higher variance.
Improving CD & SML:
- MCMC Algorithms:
  All these methods based on using MCMC to draw samples from the model canin principle be used with almost any variant of MCMC. This means that techniques such as SML can be improved by using any of the enhanced MCMC techniques described in chapter 17, such as parallel tempering (Desjardins et al., 2010; Choet al., 2010).
- Fast PCD (FPCD):
  Another approach to accelerating mixing during learning relies not on changing the Monte Carlo sampling technology but rather on changing the parametrization of the model and the cost function.
  FPCD is such a method that involves replacing the parameters $\boldsymbol{\theta}$ of a traditional model with an expression:
  $$\boldsymbol{\theta}=\boldsymbol{\theta}^{(\mathrm{slow})}+\boldsymbol{\theta}^{(\mathrm{fast})}$$
Training with Positive Phase Estimators (bound-based, variational methods):
One key benefit to the MCMC-based methods described in this section is that they provide an estimate of the gradient of $\log Z,$ and thus we can essentially decompose the problem into the $\log \tilde{p}$ contribution and the $\log Z$ contribution.
We can then use any other method to tackle $\log \tilde{p}(\mathbf{x})$ and just add our negative phase gradient onto the other method’s gradient.
In particular, this means that our positive phase can make use of methods that provide only a lower bound on $\tilde{p}$.
Most of the other methods of dealing with $\log Z$ presented in this chapter are incompatible with bound-based positive phase methods.
Pseudolikelihood:
Motivation:
We can sidestep the issue of approximating the intractable partition function by training the model without computing it at all.

Idea:
Most of these approaches are based on the observation that it is easy to compute ratios of probabilities in an undirected model.
This is because the partition function appears in both the numerator and the denominator of the ratio and cancels out:

$$\frac{p(\mathbf{x})}{p(\mathbf{y})}=\frac{\frac{1}{Z} \tilde{p}(\mathbf{x})}{\frac{1}{Z} \tilde{p}(\mathbf{y})}=\frac{\tilde{p}(\mathbf{x})}{\tilde{p}(\mathbf{y})}$$

Pseudolikelihood:
The Pseudolikelihood is an objective function, based on predicting the value of feature $x _ {i}$ given all the other features $\boldsymbol{x}_ {-i}$:

$$\sum_{i=1}^{n} \log p\left(x_{i} \vert \boldsymbol{x}_ {-i}\right)$$

Derivation:
- The pseudolikelihood is based on the observation that conditional probabilities take this ratio-based form and thus can be computed without knowledge of the partition function.
Computational Cost:
- If each random variable has $k$ different values, this requires only $k \times n$ evaluations of $\tilde{p}$ to compute,
- as opposed to the $k^{n}$ evaluations needed to compute the partition function.
Justification:
Estimation by maximizing the pseudolikelihood is asymptotically consistent (Mase, 1995).
When the datasets do not approach the large sample limit, pseudolikelihood may display different behavior from the maximum likelihood estimator.

Generalized Pseudolikelihood Estimator:
The Generalized Pseudolikelihood Estimator gives us a way to trade-off computational complexity for deviation from maximum likelihood behavior.
The GPE objective function:

$$\sum_{i=1}^{m} \log p\left(\mathbf{x}_{\mathbb{S}^{(i)}} \vert \mathbf{x}_{-\mathbb{S}^{(i)}}\right)$$

Complexity-Consistency Tradeoff:
It uses $m$ different sets $\mathbb{S}^{(i)}, i=1, \ldots, m$ of indices of variables that appear together on the left side of the conditioning bar:
- In the extreme case of $m=1$ and $\mathbb{S}^{(1)}=1, \ldots, n,$ the generalized pseudolikelihood recovers the log-likelihood.
- In the extreme case of $m=n$ and $\mathbb{S}^{(i)}=\{i\},$ the generalized pseudolikelihood recovers the pseudolikelihood.
Performance:
The performance of pseudolikelihood-based approaches depends largely on how the model will be used:
- Pseudolikelihood tends to perform poorly on tasks that require a good model of the full joint $p(\mathbf{x}),$ such as density estimation and sampling.
- It can perform better than maximum likelihood for tasks that require only the conditional distributions used during training, such as filling in small amounts of missing values.
- Generalized pseudolikelihood techniques are especially powerful if the data has regular structure that allows the $\mathbb{S}$ index sets to be designed to capture the most important correlations while leaving out groups of variables that have only negligible correlation.
  For example, in natural images, pixels that are widely separated in space also have weak correlation, so the generalized pseudolikelihood can be applied with each $\mathbb{S}$ set being a small, spatially localized window.
Drawbacks - Training with Lower-Bound Maximization Methods:
- One weakness of the pseudolikelihood estimator is that it cannot be used with other approximations that provide only a lower bound on $\tilde{p}(\mathbf{x}),$ e.g. variational inference.
  - This is because $\tilde{p}$ appears in the denominator.
    A lower bound on the denominator provides only an upper bound on the expression as a whole, and there is no benefit to maximizing an upper bound.
    This makes it difficult to apply pseudolikelihood approaches to deep models such as deep Boltzmann machines, since variational methods are one of the dominant approaches to approximately marginalizing out the many layers of hidden variables that interact with each other.
- Nonetheless, pseudolikelihood is still useful for deep learning, because it can be used to train single-layer models or deep models using approximate inference methods that are not based on lower bounds.
Pseudolikelihood vs SML/PCD - Computational Cost:
Pseudolikelihood has a much greater cost per gradient step than SML, due to its explicit computation of all the conditionals.
But generalized pseudolikelihood and similar criteria can still perform well if only one randomly selected conditional is computed per example (Goodfellow et al., 2013b), thereby bringing the computational cost down to match that of SML.

Relation to the Negative Phase:
Though the pseudolikelihood estimator does not explicitly minimize $\log Z$, it can still be thought of as having something resembling a negative phase.
The denominators of each conditional distribution result in the learning algorithm suppressing the probability of all states that have only one variable differing from a training example.

Asymptotic Efficiency:
See Marlin and de Freitas (2011) for a theoretical analysis of the asymptotic efficiency of pseudolikelihood.
Score-Matching and Ratio-Matching:

Denoising Score Matching:
Noise-Contrastive Estimation (NCE):
Noise-Contrastive Estimation (NCE) is a method for computing the partition function as a learned parameter in the model; where the probability distribution estimated by the model is represented explicitly as:
$$\log p_{\text {model }}(\mathbf{x})=\log \tilde{p}_ {\text {model}}(\mathbf{x} ; \boldsymbol{\theta})+c$$

where $c$ is explicitly introduced as an approximation of $-\log Z(\boldsymbol{\theta})$.

Rather than estimating only $\boldsymbol{\theta}$, the noise contrastive estimation procedure treats $c$ as just another parameter and estimates $\boldsymbol{\theta}$ and $c$ simultaneously, using the same algorithm for both.
The resulting $\log p_{\text {model}}(\mathbf{x})$ thus may not correspond exactly to a valid probability distribution, but it will become closer and closer to being valid as the estimate of $c$ improves.¹

Derivation:
- Problem with Maximum Likelihood Criterion:
  Such an approach would not be possible using maximum likelihood as the criterion for the estimator.
  The maximum likelihood criterion would choose to set $c$ arbitrarily high, rather than setting $c$ to create a valid probability distribution.
- Solution - New Estimator of the original problem:
  NCE works by reducing the unsupervised learning problem of estimating $p(\mathrm{x})$ to that of learning a probabilistic binary classifier in which one of the categories corresponds to the data generated by the model.
  This supervised learning problem is constructed in such a way that maximum likelihood estimation defines an asymptotically consistent estimator of the original problem.
  
  Specifically,
  1. Posit two distributions: the model, and a noise distribution.
    - The Noise Distribution $p_{\text{noise}}(\mathbf{x})$:
      We introduce a new distribution $p_{\text{noise}}(\mathbf{x})$ over the noise.
      The noise distribution should be tractable to evaluate and to sample from.
  2. Construct a new joint model over both $\boldsymbol{x}$ and a binary variable $y$:
    - We can now construct a model over both $\mathbf{x}$ and a new, binary class variable $y$. In the new joint model, we specify that
      (1) $p_{\mathrm{joint}}(y=1)=\frac{1}{2}$
      (2) $p_{\mathrm{joint}}(\mathbf{x} \vert y=1)=p_{\mathrm{model}}(\mathbf{x})$
      (3) $p_{\mathrm{joint}}(\mathbf{x} \vert y=0)=p_{\mathrm{noise}}(\mathbf{x})$
      In other words, $y$ is a switch variable that determines whether we will generate $\mathbf{x}$ from the model or from the noise distribution.
    - Equivalently, We can construct a similar joint model of training data.
      Formally, (1) $p_{\text {train}}(y=1)=\frac{1}{2}$
      (2) $p_{\text {train}}(\mathbf{x} \vert y=1)=p_{\text {data }}(\mathbf{x}),$
      (3) $p_{\text {train}}(\mathbf{x} \vert y=0)=p_{\text {noise}}(\mathbf{x})$
      In this case, the switch variable determines whether we draw $\mathbf{x}$ from the data or from the noise distribution.
  3. Construct the new supervised Binary Classification Task - fitting $p_{\text {joint}}$ to $p_{\text {train}}$:
    We can now just use standard maximum likelihood learning on the supervised learning problem of fitting $p_{\text {joint}}$ to $p_{\text {train}}$, by swapping $p_{\text {model}}$ with $p_{\text {joint}}$:
    $$\boldsymbol{\theta}, c=\underset{\boldsymbol{\theta}, c}{\arg \max } \mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p_{\text {train}}} \log p_{\text {joint}}(y \vert \mathbf{x})$$
    - Expanding $p_{\text{joint}}(y \vert x)$:
      The distribution $p_{\text{joint}}$ is essentially a logistic regression model applied to the difference in log probabilities of the model and the noise distribution:
      $$\begin{aligned} p_{\text {joint}}(y=1 \vert \mathbf{x}) &= \frac{p_{\text {model }}(\mathbf{x})}{p_{\text {model }}(\mathbf{x})+p_{\text {noise}}(\mathbf{x})} \\ &= \frac{1}{1+\frac{p_{\text {noise}}(\mathbf{x})}{p_{\text {model}} (\mathbf{x})}} \\ &= \frac{1}{1+\exp \left(\log \frac{p_{\text {noise}}(\mathbf{x})}{p_{\text {model }}(\mathbf{x})}\right)} \\ &= \sigma\left(-\log \frac{p_{\text {noise}}(\mathbf{x})}{p_{\text {model }}(\mathbf{x})}\right) \\ &= \sigma\left(\log p_{\text {model }}(\mathbf{x})-\log p_{\text {noise}}(\mathbf{x})\right) \end{aligned}$$
Summary:
1. Posit two distributions: the model, and a noise distribution.
2. Given a data point, predict from which distribution this point was generated.
NCE is thus simple to apply as long as $\log \tilde{p}_{\text {model}}$ is easy to back-propagate through, and, as specified above, $p_{\text {noise}}$ is easy to evaluate (in order to evaluate $p_{\text {joint}}$ and sample from (to generate the training data).

The Noise Distribution:
- Practical Implications and Complexity:
- Better Distributions - Parametric $p_{\text{noise}}$:
  The noise distribution is generally non-parametric.
  However, there is nothing stopping us from evolving this distribution and giving it trainable parameters, then updating these parameters such that it generates increasingly “optimal” samples.
  - Optimality:
    Of course, we would have to design what “optimal” means.
    - Adversarial Contrastive Estimation:
      One interesting approach is called Adversarial Contrastive Estimation, wherein the authors adapt the noise distribution to generate increasingly “harder negative examples, which forces the main model to learn a better representation of the data.
Weaknesses/Drawbacks:
- Problems with Many RVs:
  When NCE is applied to problems with many random variables, it becomes less efficient.
  - The logistic regression classifier can reject a noise sample by identifying any one variable whose value is unlikely.
    This means that learning slows down greatly after $p_{\text {model}}$ has learned the basic marginal statistics.
  - Imagine learning a model of images of faces, using unstructured Gaussian noise as $p_{\text {noise}}$.
    If $p_{\text {model }}$ learns about eyes, it can reject almost all unstructured noise samples without having learned anything about other facial features, such as mouths.
- Noise Distribution Complexity:
  The constraint that $p_{\text {noise}}$ must be easy to evaluate and easy to sample from can be overly restrictive:
  - For our training data, we require the ability to sample from our noise distribution..
  - For our target, we require the ability to compute the likelihood of some data under our noise distribution.
  When $p_{\text {noise}}$ is simple, most samples are likely to be too obviously distinct from the data to force $p_{\text {model}}$ to improve noticeably.
- Training with Lower-Bound Maximizing Methods:
  NCE does not work if only a lower bound on $\tilde{p}$ is available.
  Such a lower bound could be used to construct a lower bound on $p_{\text {joint}}(y=1 \vert \mathbf{x}),$ but it can only be used to construct an upper bound on $p_{\text {joint}}(y=0 \vert \mathbf{x}),$ which appears in half the terms of the NCE objective.
  Likewise, a lower bound on $p_{\text {noise}}$ is not useful, because it provides only an upper bound on $p_{\text {joint}}(y=1 \vert \mathbf{x})$.
Self-Contrastive Estimation:
When the model distribution is copied to define a new noise distribution before each gradient step, NCE defines a procedure called self-contrastive estimation, whose expected gradient is equivalent to the expected gradient of maximum likelihood (Goodfellow, 2014).
Interpretation:
- Self-Contrastive Estimation:
  The special case of NCE where the noise samples are those generated by the model suggests that maximum likelihood can be interpreted as a procedure that forces a model to constantly learn to distinguish reality from its own evolving beliefs,
- NCE:
  However, NCE achieves some reduced computational cost by only forcing the model to distinguish reality from a fixed baseline (noise model).
Connection to Importance Sampling:
Jozefowicz et al. (2016) show that NCE and IS are not only similar as both are sampling-based approaches, but are strongly connected.
While NCE uses a binary classification task, they show that IS can be described similarly using a surrogate loss function: Instead of performing binary classification with a logistic loss function like NCE, IS then optimises a multi-class classification problem with a softmax and cross-entropy loss function.
They observe that as IS performs multi-class classification, it may be a better choice for language modeling, as the loss leads to tied updates between the data and noise samples rather than independent updates as with NCE.
Indeed, Jozefowicz et al. (2016) use IS for language modeling and obtain state-of-the-art performance on the 1B Word benchmark.

Relation to Generative Adversarial Networks (GANs):
Noise contrastive estimation is based on the idea that a good generative model should be able to distinguish data from noise.
A closely related idea is that a good generative model should be able to generate samples that no classifier can distinguish from data.
This idea yields generative adversarial networks.

Self-Normalization:
Mnih and Teh (2012) and Vaswani et al. (2013) fix $c = 1$.
They report does not affect the model’s performance.
This assumption has the nice side-effect of reducing the model’s parameters, while ensuring that the model self-normalises by not depending on the explicit normalisation in $c$.
Indeed, Zoph et al. (2016) find that even when learned, $c$ is close to $1$ and has low variance.
Negative Sampling:
Negative Sampling (NEG) can be seen as an approximation to NCE.
As we have mentioned above, NCE can be shown to approximate the loss of $\log p_{\text{model}}$ as the number of samples $k$ increase.
NEG simplifies NCE and does away with this guarantee, as the objective of NEG is to learn high-quality word representations rather than achieving low perplexity on a test set, as is the goal in language modeling.

The key difference to NCE is that NEG only approximates this probability by making it as easy to compute as possible.
It simplifies NCE as follows:
1. Considers noise distributions whose likelihood we cannot evaluate
2. To accommodate, it simply set the most expensive term $p_{\text {noise}}(x)=1$
  Equivalently, $k\:p_{\text {noise}}(x)=1$
  - - Thus, Expanding $p_{\text{joint}}(y \vert x)$:
      The distribution $p_{\text{joint}}$ is essentially a logistic regression model applied to the difference in log probabilities of the model and the noise distribution:
      $$\begin{aligned} p_{\text {joint}}(y=1 \vert \mathbf{x}) &= \frac{p_{\text {model }}(\mathbf{x})}{p_{\text {model }}(\mathbf{x})+p_{\text {noise}}(\mathbf{x})} \\ &= \frac{1}{1+\frac{p_{\text {noise}}(\mathbf{x})}{p_{\text {model}} (\mathbf{x})}} \\ &= \frac{1}{1+\exp \left(\log \frac{p_{\text {noise}}(\mathbf{x})}{p_{\text {model }}(\mathbf{x})}\right)} \\ &= \sigma\left(-\log \frac{p_{\text {noise}}(\mathbf{x})}{p_{\text {model }}(\mathbf{x})}\right) \\ &= \sigma\left(-\log \frac{1}{p_{\text {model}}(\mathbf{x})}\right) \\ &= \sigma\left(\log p_{\text {model}}(\mathbf{x})\right) \end{aligned}$$
Equivalence with NCE:
- $k p_{\text {noise}}=1$ is exactly then true, when (discrete):
  1. $k=\vert X\vert$ and
  2. $p_{\text {noise}}$ is a uniform distribution.
  In this case, NEG is equivalent to NCE.
- The reason we set $k p_{\text {noise}}=1$ and not to some other constant can be seen by rewriting the equation, as $P(y=1 \vert \mathbf{x})$ can simplify the sigmoid function.
- In all other cases, NEG only approximates NCE, which means that it will not directly optimize the likelihood $\log p_{\text {model}}(\mathbf{x})$.
- Asymptotic Consistency:
  Since NEG only approximates NCE, it lacks any asymptotic consistency guarantees.
Application - Language Modeling and Word Embeddings:
NEG only approximates NCE, which means that it will not directly optimise the likelihood of correct words, which is key for language modelling. While NEG may thus be useful for learning word embeddings, its lack of asymptotic consistency guarantees makes it inappropriate for language modelling.
Self-Normalization:
Remember from NCE that we decomposed the log likelihood of the model as:
$$\log p_{\text {model }}(\mathbf{x})=\log \tilde{p}_ {\text {model}}(\mathbf{x} ; \boldsymbol{\theta})+c$$

where $c$ is explicitly introduced as an approximation of $-\log Z(\boldsymbol{\theta})$.

If we are able to constrain our model so that it sets $c=0$ (i.e. $e^c = 1$), then we can avoid computing the normalization in $c$ altogether.
Devlin et al. (2014) thus propose to add a squared error penalty term to the loss function that encourages the model to keep $c$ as close as possible to $0$:

$$\tilde{J} = J + \lambda (c-0)^{2}$$

where $\lambda$ allows us to trade-off between model accuracy and mean self-normalisation.

At inference time, we set

$$p_{\text {model }}(\mathbf{x})=\dfrac{\tilde{p}_ {\text {model}}(\mathbf{x} ; \boldsymbol{\theta})}{Z(\boldsymbol{\theta})} \approx \dfrac{\tilde{p}_ {\text {model}}(\mathbf{x} ; \boldsymbol{\theta})}{1} = \tilde{p}_ {\text {model}}(\mathbf{x} ; \boldsymbol{\theta})$$

Results - MT:
They report that self-normalisation achieves a speed-up factor of about 15, while only resulting in a small degradation of BLEU scores compared to a regular non-self-normalizing neural language model.

Notes:
- Paper repro: “Self-Normalizing Neural Networks” (Blog - Code?)

Estimating the Partition Function

Estimating the Partition Function:
Annealed Importance Sampling (AIS):
Bridge Sampling:
Linked Importance Sampling (LIS):
Estimating the Partition Function while Training:

NCE is also applicable to problems with a tractable partition function, where there is no need to introduce the extra parameter $c$. However, it has generated the most interest as a means of estimating models with difficult partition functions. ↩

The Partition Function

Table of Contents

Introduction - The Partition Function

Estimating the Partition Function