Table of Contents



Resources:

Sampling

  1. Monte Carlo Sampling:
    When a sum or an integral cannot be computed exactly we can approximate it using Monte Carlo sampling. The idea is to view the sum or integral as if it were an expectation under some distribution and to approximate the expectation by a corresponding average:
    - Sum:

    $$s=\sum_{\boldsymbol{x}} p(\boldsymbol{x}) f(\boldsymbol{x})=E_{p}[f(\mathbf{x})]$$

    - Integral:

    $$s=\int p(\boldsymbol{x}) f(\boldsymbol{x}) d \boldsymbol{x}=E_{p}[f(\mathbf{x})]$$

    We can approximate \(s\) by drawing \(n\) samples \(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(n)}\) from \(p\) and then forming the empirical average:

    $$\hat{s}_{n}=\frac{1}{n} \sum_{i=1}^{n} f\left(\boldsymbol{x}^{(i)}\right)$$


  2. Importance Sampling:
    There is no unique decomposition of the MC approximation because \(p(\boldsymbol{x}) f(\boldsymbol{x})\) can always be rewritten as:

    $$p(\boldsymbol{x}) f(\boldsymbol{x})=q(\boldsymbol{x}) \frac{p(\boldsymbol{x}) f(\boldsymbol{x})}{q(\boldsymbol{x})} $$

    where we now sample from \(q\) and average \(\frac{p f}{q}\).

    Formally, the expectation becomes:

    $$E_{p}[f(\mathbf{x})] = \sum_{\boldsymbol{x}} p(\boldsymbol{x}) f(\boldsymbol{x}) = \sum_{\boldsymbol{x}} q(\boldsymbol{x}) \dfrac{p(\boldsymbol{x})}{q(\boldsymbol{x})} f(\boldsymbol{x}) = E_q\left[\dfrac{p(\boldsymbol{x})}{q(\boldsymbol{x})} f(\boldsymbol{x})\right]$$

    Biased Importance Sampling:
    Another approach is to use biased importance sampling, which has the advantage of not requiring normalized \(p\) or \(q\). In the case of discrete variables, the biased importance sampling estimator is given by

    $$\begin{aligned} \hat{s}_{B I S} &=\frac{\sum_{i=1}^{n} \frac{p\left(\boldsymbol{x}^{(i)}\right)}{q\left(\boldsymbol{x}^{(i)}\right)} f\left(\boldsymbol{x}^{(i)}\right)}{\sum_{i=1}^{n} \frac{p\left(\boldsymbol{x}^{(i)}\right)}{q\left(\boldsymbol{x}^{(i)}\right)}} \\ &=\frac{\sum_{i=1}^{n} \frac{p\left(\boldsymbol{x}^{(i)}\right)}{\tilde{q}\left(\boldsymbol{x}^{(i)}\right)} f\left(\boldsymbol{x}^{(i)}\right)}{\sum_{i=1}^{n} \frac{p(i)}{\tilde{q}(i)}} \\ &=\frac{\sum_{i=1}^{n} \frac{\tilde{p}\left(\boldsymbol{x}^{(i)}\right)}{\tilde{q}\left(\boldsymbol{x}^{(i)}\right)} f\left(\boldsymbol{x}^{(i)}\right)}{\sum_{i=1}^{n} \frac{\tilde{p}\left(\boldsymbol{x}^{(i)}\right)}{\tilde{q}\left(\boldsymbol{x}^{(i)}\right)}} \end{aligned}$$

    where \(\tilde{p}\) and \(\tilde{q}\) are the unnormalized forms of \(p\) and \(q\), and the \(\boldsymbol{x}^{(i)}\) are the samples from \(q\).
    Bias:
    This estimator is biased because \(\mathbb{E}[\hat{s}_ {BIS}] \neq s\), except asymptotically when \(n \rightarrow \infty\) and the denominator of the first equation (above) converges to \(1\). Hence this estimator is called asymptotically unbiased.

    Statistical Efficiency:
    Although a good choice of \(q\) can greatly improve the efficiency of Monte Carlo estimation, a poor choice of \(q\) can make the efficiency much worse.
    - If there are samples of \(q\) for which \(\frac{p(\boldsymbol{x})|f(\boldsymbol{x})|}{q(\boldsymbol{x})}\) is large, then the variance of the estimator can get very large.
    This may happen when \(q(\boldsymbol{x})\) is tiny while neither \(p(\boldsymbol{x})\) nor \(f(\boldsymbol{x})\) are small enough to cancel it.
    The \(q\) distribution is usually chosen to be a simple distribution so that it is easy to sample from. When \(\boldsymbol{x}\) is high dimensional, this simplicity in \(q\) causes it to match \(p\) or \(p\vert f\vert\) poorly.
    (1) When \(q\left(\boldsymbol{x}^{(i)}\right) \gg p\left(\boldsymbol{x}^{(i)}\right)\left|f\left(\boldsymbol{x}^{(i)}\right)\right|\), importance sampling collects useless samples (summing tiny numbers or zeros).
    (2) On the other hand, when \(q\left(\boldsymbol{x}^{(i)}\right) \ll p\left(\boldsymbol{x}^{(i)}\right)\left|f\left(\boldsymbol{x}^{(i)}\right)\right|\), which will happen more rarely, the ratio can be huge.
    Because these latter events are rare, they may not show up in a typical sample, yielding typical underestimation of \(s\), compensated rarely by gross overestimation.
    Such very large or very small numbers are typical when \(\boldsymbol{x}\) is high dimensional, because in high dimension the dynamic range of joint probabilities can be very large.

    A good IS sampling distribution \(q\) is a low variance distribution.

    Applications:
    In spite of this danger, importance sampling and its variants have been found very useful in many machine learning algorithms, including deep learning algorithms. They have been used to:

    • Accelerate training in neural language models with a large vocabulary
    • Accelerate other neural nets with a large number of outputs
    • Estimate a partition function (the normalization constant of a probability distribution)
    • Estimate the log-likelihood in deep directed models, e.g. Variational Autoencoders
    • Improve the estimate of the gradient of the cost function used to train model parameters with stochastic gradient descent
      Particularly for models, such as classifiers, in which most of the total value of the cost function comes from a small number of misclassified examples.
      Sampling more difficult examples more frequently can reduce the variance of the gradient in such cases (Hinton, 2006).

    Approximating Distributions:
    To approximate the expectation (mean) of a distribution \(p\):

    $${\mathbb{E}}_ {p}[x]=\sum_{x} x p(x)$$

    by sampling from a distribution \(q\).
    Notice that:
    (1) \({\displaystyle {\mathbb{E}}_ {p}[x]=\sum_{x} x p(x) = \sum_{x} x\frac{p(x)}{q(x)} q(x)}\)
    (2) \({\displaystyle \sum_{x} x\frac{p(x)}{q(x)} q(x)=\mathbb{E}_ {q}\left[x \frac{p(x)}{q(x)}\right]}\)
    We approximate the expectation over \(q\) in (2) with the empirical distribution:

    $$\mathbb{E}_ {q}\left[x \frac{p(x)}{q(x)}\right] \approx \dfrac{1}{n} \sum_{i=1}^n x_i \dfrac{p(x_i)}{q(x_i)}$$

    Approximating UnNormalized Distributions - Biased Importance Sampling:
    Let \(p(x)=\frac{h(x)}{Z}\), then

    $$\begin{aligned}\mathbb{E}_{p}[x] &= \sum_{x} x \frac{h(x)}{Z} \\ &= \sum_{x} x \frac{h(x)}{Z q(x)} q(x) \\ &\approx \frac{1}{Z} \frac{1}{n} \sum_{i=1}^{n} x_{i} \frac{h\left(x_{i}\right)}{q\left(x_{i}\right)} \end{aligned}$$

    where the samples \(x_i\) are drawn from \(q\).
    To get rid of the \(\dfrac{1}{Z}\) factor,
    - First, we define the importance sample weight:

    $$w_i = \frac{h\left(x_{i}\right)}{q\left(x_{i}\right)}$$

    - then the sample mean weight:

    $$\bar{w} = \dfrac{1}{n} \sum_{i=1}^n w_i = $$

    - Now, we decompose \(Z\) by noticing that:

    $$\mathbb{E}_ {p}[1]=1=\sum_{x} \frac{h(x)}{Z}$$

    \(\implies\)

    $$Z = \sum_{x} h(x)$$

    - we approximate the expectation again with IS:

    $$\begin{aligned} Z &= \sum_{x} h(x) \\ &= \sum_{x} \frac{h(x)}{q(x)} q(x) \\ &\approx \dfrac{1}{n} \sum_{i=1}^{n} \frac{h\left(x_{i}\right)}{q\left(x_{i}\right)} \\ &= \bar{w} \end{aligned}$$

    Thus, the sample normalizing constant \(\hat{Z}\) is equal to the sample mean weight:

    $$Z = \bar{w}$$

    Finally,

    $$\mathbb{E}_ {p}[x] \approx \frac{1}{\bar{w}} \frac{1}{n} \sum_{i=1}^{n} x_{i} w_i = \dfrac{\overline{xw}}{\bar{w}}$$

    Curse of Dimensionality in IS - Variance of the Estimator:
    A big problem with Importance Sampling is that the variance of the IS estimator can be greatly sensitive to the choice of \(q\).
    The Variance is:

    $$\operatorname{Var}\left[\hat{s}_ {q}\right]=\operatorname{Var}\left[\frac{p(\mathbf{x}) f(\mathbf{x})}{q(\mathbf{x})}\right] / n$$

    The Minimum Variance occurs when \(q\) is:

    $$q^{* }(\boldsymbol{x})=\frac{p(\boldsymbol{x})|f(\boldsymbol{x})|}{Z}$$

    where \(Z\) is the normalization constant, chosen so that \(q^{* }(\boldsymbol{x})\) sums or integrates to \(1\) as appropriate.
    - Any choice of sampling distribution \(q\) is valid (in the sense of yielding the correct expected value), and - \(q^{ * }\) is the optimal one (in the sense of yielding minimum variance).
    - Sampling from \(q^{ * }\) is usually infeasible, but other choices of \(q\) can be feasible while still reducing the variance somewhat.

  3. Markov Chain Monte Carlo (MCMC) Methods:
    Motivation:
    In many cases, we wish to use a Monte Carlo technique but there is no tractable method for drawing exact samples from the distribution \(p_{\text {model}}(\mathbf{x})\) or from a good (low variance) importance sampling distribution \(q(\mathbf{x})\).
    In the context of deep learning, this most often happens when \(p_{\text {model}}(\mathbf{x})\) is represented by an undirected model.
    In these cases, we introduce a mathematical tool called a Markov chain to approximately sample from \(p_{\text {model}}(\mathbf{x})\). The family of algorithms that use Markov chains to perform Monte Carlo estimates is called Markov Chain Monte Carlo (MCMC) methods.

    Idea of MCs:
    - The core idea of a Markov chain is to have a state \(\boldsymbol{x}\) that begins as an arbitrary value.
    - Over time, we randomly update \(\boldsymbol{x}\) repeatedly.
    - Eventually \(\boldsymbol{x}\) becomes (very nearly) a fair sample from \(p(\boldsymbol{x})\).

    Definition:
    Formally, a Markov chain is defined by:

    • A random state \(x\) and
    • A transition distribution \(T\left(x^{\prime} \vert x\right)\)
      specifying the probability that a random update will go to state \(x^{\prime}\) if it starts in state \(x\).

      Running the Markov chain means repeatedly updating the state \(x\) to a value \(x^{\prime}\) sampled from \(T\left(\mathbf{x}^{\prime} \vert x\right)\).

    Finite, Countable States:
    We take the case where the random variable \(\mathbf{x}\) has countably many states.
    Representation:
    We represent the state as just a positive integer \(x\).
    Different integer values of \(x\) map back to different states \(\boldsymbol{x}\) in the original problem.

    Consider what happens when we run infinitely many Markov chains in parallel.
    - All the states of the different Markov chains are drawn from some distribution \(q^{(t)}(x)\), where \(t\) indicates the number of time steps that have elapsed.
    - At the beginning, \(q^{(0)}\) is some distribution that we used to arbitrarily initialize \(x\) for each Markov chain.
    - Later, \(q^{(t)}\) is influenced by all the Markov chain steps that have run so far.
    - Our goal is for \(q^{(t)}(x)\) to converge to \(p(x)\).

    • Probability of transitioning to a new state:
      Let’s update a single Markov chain’s state \(x\) to a new state \(x^{\prime}\).
      The probability of a single state landing in state \(x^{\prime}\) is given by:

      $$q^{(t+1)}\left(x^{\prime}\right)=\sum_{x} q^{(t)}(x) T\left(x^{\prime} \vert x\right)$$

      • Describing \(q\):
        Because we have reparametrized the problem in terms of a positive integer \(x\), we can describe the probability distribution \(q\) using a vector \(\boldsymbol{v}\) with:

        $$q(\mathrm{x}=i)=v_{i}$$

      • The Transition Operator \(T\) as a Matrix:
        Using our integer parametrization, we can represent the effect of the transition operator \(T\) using a matrix \(A\).
        We define \(A\) so that:

        $$A_{i, j}=T\left(\mathbf{x}^{\prime}=i \vert \mathbf{x}=j\right)$$

      Rather than writing it in terms of \(q\) and \(T\) to understand how a single state is updated, we may now use \(v\) and \(A\) to describe how the entire distribution over all the different Markov chains (running in parallel) shifts as we apply an update.
      Rewriting the probability of a single state landing in state \(x^{\prime} = i\):

      $$\boldsymbol{v}^{(t)}=\boldsymbol{A} \boldsymbol{v}^{(t-1)}$$

      • Matrix Exponentiation:
        Applying the Markov chain update repeatedly corresponds to multiplying by the matrix \(A\) repeatedly.
        In other words, we can think of the process as exponentiating the matrix \(\boldsymbol{A}\).

      Thus, \(\boldsymbol{v}^{(t)}\) can, finally, be rewritten as

      $$\boldsymbol{v}^{(t)}=\boldsymbol{A}^{t} \boldsymbol{v}^{(0)}$$

    • Convergence - The Stationary Distribution:
      Let’s first examine the matrix \(A\).
      • Stochastic Matrices:
        Stochastic Matrices are ones where each of their columns represents a probability distribution.
        The Matrix \(A\) is a stochastic matrix.
        • Perron-Frobenius Theorem - Largest Eigenvalue:
          If there is a nonzero probability of transitioning from any state \(x\) to any other state \(x\) for some power \(t\), then the Perron-Frobenius theorem guarantees that the largest eigenvalue is real and equal to \(1\).
        • Unique Largest Eigenvalue:
          Under some additional mild conditions, \(A\) is guaranteed to have only one eigenvector with eigenvalue \(1\).
      • Exponentiated Eigenvalues:
        Over time, we can see that all the eigenvalues are exponentiated:

        $$\boldsymbol{v}^{(t)}=\left(\boldsymbol{V} \operatorname{diag}(\boldsymbol{\lambda}) \boldsymbol{V}^{-1}\right)^{t} \boldsymbol{v}^{(0)}=\boldsymbol{V} \operatorname{diag}(\boldsymbol{\lambda})^{t} \boldsymbol{V}^{-1} \boldsymbol{v}^{(0)}$$

        This process causes all the eigenvalues that are not equal to \(1\) to decay to zero.

      The process thus converges to a stationary distribution (equilibrium distribution).

      • Convergence Condition - Eigenvector Equation:
        At convergence, the following eigenvector equation holds:

        $$\boldsymbol{v}^{\prime}=\boldsymbol{A} \boldsymbol{v}=\boldsymbol{v}$$

        and this same condition holds for every additional step.

        • Stationary Point Condition:
          Thus, To be a stationary point, \(\boldsymbol{v}\) must be an eigenvector with corresponding eigenvalue \(1\).
          This condition guarantees that once we have reached the stationary distribution, repeated applications of the transition sampling procedure do not change the distribution over the states of all the various Markov chains (although the transition operator does change each individual state, of course).
      • Convergence to \(p\):
        If we have chosen \(T\) correctly, then the stationary distribution \(q\) will be equal to the distribution \(p\) we wish to sample from.
        Gibbs Sampling is one way to choose \(T\).

    Continuous Variables:

    Convergence:
    In general, a Markov chain with transition operator $T$ will converge, under mild conditions, to a fixed point described by the equation

    $$q^{\prime}\left(\mathbf{x}^{\prime}\right)=\mathbb{E}_ {\mathbf{x} \sim q} T\left(\mathbf{x}^{\prime} \vert \mathbf{x}\right)$$

    which is exactly what we had in the discrete case defined as a sum:

    $$q^{\prime}\left(x^{\prime}\right)=\sum_{x} q^{(t)}(x) T\left(x^{\prime} \vert x\right)$$

    and in the continuous case as an integral:

    $$q^{\prime}\left(x^{\prime}\right)=\int_{x} q^{\prime}(x) T\left(x^{\prime} \vert x\right)$$

    Using the Markov Chain:

    Regardless of whether the state is continuous or discrete, all Markov chain methods consist of repeatedly applying stochastic updates until eventually the state begins to yield samples from the equilibrium distribution.

    - Training the Markov Chain:
    Running the Markov chain until it reaches its equilibrium distribution is called burning in the Markov chain.

    - Sampling from the Markov Chain:
    After the chain has reached equilibrium, a sequence of infinitely many samples may be drawn from the equilibrium distribution.
    There are difficulties/drawbacks with using Markov Chains for sampling:

    • Representative Samples - Independence:
      The samples are identically distributed, but any two successive samples will be highly correlated with each other.
      • Issue:
        A finite sequence of samples may thus not be very representative of the equilibrium distribution.
      • Solutions:
        1. One way to mitigate this problem is to return only every \(n\) successive samples, so that our estimate of the statistics of the equilibrium distribution is not as biased by the correlation between an MCMC sample and the next several samples.
          Markov chains are thus expensive to use because of the time required to burn in to the equilibrium distribution and the time required to transition from one sample to another reasonably decorrelated sample after reaching equilibrium.
        2. To get truly independent samples, one can run multiple Markov chains in parallel.
          This approach uses extra parallel computation to eliminate latency.

      - The strategy of using only a single Markov chain to generate all samples and the strategy of using one Markov chain for each desired sample are two extremes.
      - In deeplearning we usually use a number of chains that is similar to the number of examples in a minibatch and then draw as many samples as are needed from this fixed set of Markov chains.

      A commonly used number of Markov chains is \(100\).

    • Convergence to Equilibrium - Halting:
      The theory of Markov Chains allows us to guarantee convergence to equilibrium. However, it does not specify anything about the convergence criterion:
      • The theory does not allow us to know the Mixing Time in advance..
        The Mixing Time is the number of steps the Markov chain must run before reaching its equilibrium distribution.
      • The theory, also, does not guide us on how to test/determine whether an MC has reached equilibrium..

      Convergence Criterion Theoretical Analysis:
      If we analyze the Markov chain from the point of view of a matrix \(A\) acting on a vector of probabilities \(v\) , then we know that the chain mixes when \(A^{t}\) has effectively lost all the eigenvalues from \(A\) besides the unique eigenvalue of 1.
      This means that the magnitude of the second-largest eigenvalue will determine the mixing time.

      Convergence Criterion In Practice:
      In practice, though, we cannot actually represent our Markov chain in terms of a matrix.
      - The number of states that our probabilistic model can visit is exponentially large in the number of variables, so it is infeasible to represent \(\boldsymbol{v}\), \(A\), or the eigenvalues of \(\boldsymbol{A}\).
      Because of these and other obstacles, we usually do not know whether a Markov chain has mixed.
      Instead, we simply run the Markov chain for an amount of time that we roughly estimate to be sufficient, and use heuristic methods to determine whether the chain has mixed.
      These heuristic methods include manually inspecting samples or measuring correlations between successive samples.

    This section described how to draw samples from a distribution \(q(x)\) by repeatedly updating \(\boldsymbol{x} \leftarrow \boldsymbol{x}^{\prime} \sim T\left(\boldsymbol{x}^{\prime} \vert \boldsymbol{x}\right)\).

    Finding a useful \(q(x)\):
    There are two basic approaches to ensure that \(q(x)\) is a useful distribution:

    (1) Derive \(T\) from a given learned \(p_{\text {model}}\). E.g. Gibbs Sampling, Metropolis-Hastings, etc.
    (2) Directly parameterize \(T\) and learn it, so that its stationary distribution implicitly defines the \(p_{\text {model}}\) of interest. E.g. Generative Stochastic Networks, Diffusion Inversion, Approximate Bayesian Computation.

  4. Gibbs Sampling:
    Gibbs Sampling is an MCMC algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.

    It is a method for finding a useful distribution \(q(x)\) by deriving \(T\) from a given learned \(p_{\text {model}}\); in the case of sampling from EBMs.

    It is a conceptually simple and effective approach to building a Markov Chain that samples from \(p_{\text {model}}(\boldsymbol{x})\), in which sampling from \(T\left(\mathbf{x}^{\prime} \vert \mathbf{x}\right)\) is accomplished by selecting one variable \(\mathbf{x}_ {i}\) and sampling it from \(p_{\text {model}}\) conditioned on its neighbors in the undirected graph \(\mathcal{G}\) defining the structure of the energy-based model.

    Block Gibbs Sampling:
    We can, also, sample several variables at the same time as long as they are conditionally independent given all their neighbors.
    Block Gibbs Sampling is a Gibbs sampling approach that updates many variables simultaneously.

    Application - RBMs:
    All the hidden units of an RBM may be sampled simultaneously because they are conditionally independent from each other given all the visible units.
    Likewise, all the visible units may be sampled simultaneously because they are conditionally independent from each other given all the hidden units.

    In Deep Learning:
    In the context of the deep learning approach to undirected modeling, it is rare to use any approach other than Gibbs sampling. Improved sampling techniques are one possible research frontier.

    Summary:

    • A method for sampling from probability distributions of \(\geq 2\)-dimensions.
    • It is an MCMC method; A dependent sampling algorithm.
    • It is a special case of the Metropolis-Hastings Algorithm.
      • But, accept all proposals (i.e. no rejections).
      • It is slightly more efficient than MH because of no rejections.
      • It requires us to know the conditional probabilities \(p(X_i \vert X_{0}^t, \ldots, X_{i-1}^{t}, X_{i+1}^{t-1}, \ldots, X_{n}^{t-1})\) and be able to sample from them.
      • It is slow for correlated parameters; like MH.
        Can be alleviated by doing block sampling (blocks of correlated variables).
        I.E. sample \(X_j, X_k \sim p(X_j, X_k \vert X_{0}^t, \ldots, X_{n}^{t-1})\) at the same time.
        It is more efficient than sampling from uni-dimensional conditional distributions, but generally harder.
        • Gibbs walks in a zig-zag pattern.
        • MH walks in the diagonal direction but frequently goes off in the orthogonal direction (which have to be rejected).
        • Hamiltonian MC, best of both worlds: walks in diagonal direction and accept a high proportion of steps).
    • Often used in Bayesian Inference.
    • Guaranteed to Asymptotically Converge to the true joint distribution.
    • It is an alternative to deterministic algorithms for inference like EM.
    • Gibbs Sampling (Tut. B-Lambert)

  5. The Challenge of Mixing between Separated Modes in MCMC Algorithms:
    The primary difficulty involved with MCMC methods is that they have a tendency to mix poorly.

    Slow Mixing/Failure to Mix:
    Ideally, successive samples from a Markov chain designed to sample from \(p(\boldsymbol{x})\) would be completely independent from each other and would visit many different regions in \(\boldsymbol{x}\) space proportional to their probability.
    Instead, especially in high-dimensional cases, MCMC samples become very correlated. We refer to such behavior as slow mixing or even failure to mix.

    Intuition - Noisy Gradient Descent:
    MCMC methods with slow mixing can be seen as inadvertently performing something resembling noisy gradient descent on the energy function, or equivalently noisy hill climbing on the probability, with respect to the state of the chain (the random variables being sampled).
    - The chain tends to take small steps (in the space of the state of the Markov chain), from a configuration \(\boldsymbol{x}^{(t-1)}\) to a configuration \(\boldsymbol{x}^{(t)}\), with the energy \(E\left(\boldsymbol{x}^{(t)}\right)\) generally lower or approximately equal to the energy \(E\left(\boldsymbol{x}^{(t-1)}\right)\), with a preference for moves that yield lower energy configurations.
    - When starting from a rather improbable configuration (higher energy than the typical ones from \(p(\mathbf{x})\)), the chain tends to gradually reduce the energy of the state and only occasionally move to another mode.
    - Once the chain has found a region of low energy (for example, if the variables are pixels in an image, a region of low energy might be a connected manifold of images of the same object), which we call a mode, the chain will tend to walk around that mode (following a kind of random walk).
    - Once in a while it will step out of that mode and generally return to it or (if it finds an escape route) move toward another mode.
    - The problem is that successful escape routes are rare for many interesting distributions, so the Markov chain will continue to sample the same mode longer than it should.

    In Gibbs Sampling:
    The problem is very clear when we consider the Gibbs Sampling algorithm.
    The probability of going from one mode to a nearby mode within a given number of steps is determined by the shape of the “energy barrier” between these modes.
    - Transitions between two modes that are separated by a high energy barrier (a region of low probability) are exponentially less likely (in terms of the height of the energy barrier).

    The problem arises when there are multiple modes with high probability that are separated by regions of low probability, especially when each Gibbs sampling step must update only a small subset of variables whose values are largely determined by the other variables.

    Example and Analysis:

    Possible Solution - Block Gibbs Sampling:
    Sometimes this problem can be resolved by finding groups of highly dependent units and updating all of them simultaneously in a block. Unfortunately, when the dependencies are complicated, it can be computationally intractable to draw a sample from the group. After all, the problem that the Markov chain was originally introduced to solve is this problem of sampling from a large group of variables.

    In (Generative) Latent Variable Models:
    In the context of models with latent variables, which define a joint distribution \(p_{\text {model}}(\boldsymbol{x}, \boldsymbol{h}),\) we often draw samples of \(\boldsymbol{x}\) by alternating between sampling from \(p_{\text {model}}(\boldsymbol{x} \vert \boldsymbol{h})\) and sampling from \(p_{\text {model}}(\boldsymbol{h} \vert \boldsymbol{x})\).

    Learning-Mixing Tradeoff:
    - From the pov of mixing rapidly, we would like \(p_{\text {model}}(\boldsymbol{h} \vert \boldsymbol{x})\) to have high entropy.
    - From the pov of learning a useful representation of \(\boldsymbol{h},\) we would like \(\boldsymbol{h}\) to encode enough information about \(\boldsymbol{x}\) to reconstruct it well, which implies that \(\boldsymbol{h}\) and \(\boldsymbol{x}\) and \(\boldsymbol{x}\) should have high mutual information.
    These two goals are at odds with each other. We often learn generative models that very precisely encode \(\boldsymbol{x}\) into \(\boldsymbol{h}\) but are not able to mix very well.

    In Boltzmann Machines:
    This situation arises frequently with Boltzmann machines-the sharper the distribution a Boltzmann machine learns, the harder it is for a Markov chain sampling from the model distribution to mix well.

    Summary - Takeaways:
    All this could make MCMC methods less useful when the distribution of interest has a manifold structure with a separate manifold for each class: the distribution is concentrated around many modes, and these modes are separated by vast regions of high energy.
    This type of distribution is what we expect in many classification problems, and it would make MCMC methods converge very slowly because of poor mixing between modes.

  6. Solutions for the Slow Mixing Problem:
    Since, it is difficult to mix between the different modes of a distribution when the distribution has sharp peaks of high probability surrounded by regions of low probability,
    Several techniques for faster mixing are based on constructing alternative versions of the target distribution in which the peaks are not as high and the surrounding valleys are not as low.
    - A particularly simple way to do so, is to use Energy-based Models:

    $$p(\boldsymbol{x}) \propto \exp (-E(\boldsymbol{x}))$$

    - Energy-based models may be augmented with an extra parameter \(\beta\) controlling how sharply peaked the distribution is:

    $$p_{\beta}(\boldsymbol{x}) \propto \exp (-\beta E(\boldsymbol{x}))$$

    - The \(\beta\) parameter is often described as being the reciprocal of the temperature, reflecting the origin of energy-based models in statistical physics.
    - - When the temperature falls to zero, and \(\beta\) rises to infinity, the EBM becomes deterministic.
    - - When the temperature rises to infinity, and \(\beta\) falls to zero, the distribution (for discrete \(\boldsymbol{x}\)) becomes uniform.

    Typically, a model is trained to be evaluated at \(\beta=1\). However, we can make use of other temperatures, particularly those where \(\beta<1\).

    Tempering:
    Tempering is a general strategy of mixing between modes of \(p_{1}\) rapidly by drawing samples with \(\beta<1\).
    Markov chains based on tempered transitions (Neal, 1994) temporarily sample from higher-temperature distributions to mix to different modes, then resume sampling from the unit temperature distribution.
    These techniques have been applied to models such as RBMs (Salakhutdinov, 2010).

    Parallel Tempering:
    Another approach is to use parallel tempering (Iba, 2001), in which the Markov chain simulates many different states in parallel, at different temperatures.
    - The highest temperature states mix slowly, while the lowest temperature states, at temperature \(1\), provide accurate samples from the model.
    - The transition operator includes stochastically swapping states between two different temperature levels, so that a sufficiently high-probability sample from a high-temperature slot can jump into a lower temperature slot. This approach has also been applied to RBMs (Desjardins et al., 2010 ; Cho et al., 2010).

    Results - In Practice:
    Although tempering is a promising approach, at this point it has not allowed researchers to make a strong advance in solving the challenge of sampling from complex EBMs.
    One possible reason is that there are critical temperatures around which the temperature transition must be very slow (as the temperature is gradually reduced) for tempering to be effective.

    Depth for Mixing (in Latent-Variable Models):

    • Problem - Mixing in Latent Variable Models:
      When drawing samples from a latent variable model \(p(\boldsymbol{h}, \boldsymbol{x}),\) we have seen that if \(p(\boldsymbol{h} \vert \boldsymbol{x})\) encodes \(\boldsymbol{x}\) too well, then sampling from \(p(\boldsymbol{x} \vert \boldsymbol{h})\) will not change \(\boldsymbol{x}\) very much, and mixing will be poor.
      • Example of the problem \((\alpha)\):
        Many representation learning algorithms, such as Autoencoders and RBMs, tend to yield a marginal distribution over \(\boldsymbol{h}\) that is more uniform and more unimodal than the original data distribution over \(\boldsymbol{x}\).
      • Reason for \((\alpha)\):
        It can be argued that this arises from trying to minimize reconstruction error while using all the available representation space, because minimizing reconstruction error over the training examples will be better achieved when different training examples are easily distinguishable from each other in \(\boldsymbol{h}\)-space, and thus well separated.
    • Solution - Deep Representations:
      One way to resolve this problem is to make \(\boldsymbol{h}\) a deep representation, encoding \(\boldsymbol{x}\) into \(\boldsymbol{h}\) in such a way that a Markov chain in the space of \(\boldsymbol{h}\) can mix more easily.
      • Solution to the problem \((\alpha)\):
        - Bengio et al. (2013 a) observed that deeper stacks of regularized autoencoders or RBMs yield marginal distributions in the top-level \(\boldsymbol{h}\)-space that appeared more spread out and more uniform, with less of a gap between the regions corresponding to different modes (categories, in the experiments).
        - Training an RBM in that higher-level space allowed Gibbs sampling to mix faster between modes.

        It remains unclear, however, how to exploit this observation to help better train and sample from deep generative models.

    Summary/Takeaway of MCMC methods In-Practice (DL):
    Despite the difficulty of mixing, Monte Carlo techniques are useful and are often the best tool available.
    Indeed, they are the primary tool used to confront the intractable partition function of undirected models.