Table of Contents
Resources:
- A Thorough Introduction to Boltzmann Machines
- RBMs Developments (Hinton Talk)
- A Tutorial on Energy-Based Learning (LeCun)
- DBMs (paper Hinton)
- Generative training of quantum Boltzmann machines with hidden units (paper)
- Binary Stochastic Neurons in TF
- Geometry of the Restricted Boltzmann Machine (paper)
Preliminaries
- The Boltzmann Distribution:
The Boltzmann Distribution is a probability distribution (or probability measure) that gives the probability that a system will be in a certain state as a function of that state’s energy and the temperature of the system:$$p_{i} = \dfrac{1}{Z} e^{-\frac{\varepsilon_{i}}{k_B T}}$$
where \(p_{i}\) is the probability of the system being in state \(i\), \(\varepsilon_{i}\) is the energy of that state, and a constant \(k_B T\) of the distribution is the product of Boltzmann’s constant \(k_B\) and thermodynamic temperature \(T\), and \(Z\) is the partition function.
The distribution shows that states with lower energy will always have a higher probability of being occupied.
The ratio of probabilities of two states (AKA Boltzmann factor) only depends on the states’ energy difference (AKA Energy Gap):{: #bodyContents92BF}$$\frac{p_{i}}{p_{j}}=e^{\frac{\varepsilon_{j}-\varepsilon_{i}}{k_B T}}$$
Derivation:
The Boltzmann distribution is the distribution that maximizes the entropy:$$H\left(p_{1}, p_{2}, \cdots, p_{M}\right)=-\sum_{i=1}^{M} p_{i} \log_{2} p_{i}$$
subject to the constraint that \(\sum p_{i} \varepsilon_{i}\) equals a particular mean energy value.
This is a simple Lagrange Multipliers maximization problem (can be found here).
Applications in Different Fields:
- Statistical Mechanics
The canonical ensemble is a probability distribution with the form of the Boltzmann distribution.
It gives the probabilities of the various possible states of a closed system of fixed volume, in thermal equilibrium with a heat bath. - Measure Theory
The Boltzmann distribution is also known as the Gibbs Measure.
The Gibbs Measure is a probability measure, which is a generalization of the canonical ensemble to infinite systems. - Statistics/Machine-Learning
The Boltzmann distribution is called a log-linear model. - Probability Theory/Machine-Learning
The Boltzmann distribution is known as the softmax function.
The softmax function is used to represent a categorical distribution. - Deep Learning
The Boltzmann distribution is the sampling distribution of stochastic neural networks (e.g. RBMs).
- Statistical Mechanics
-
Asynchronous:
- Asynchronous:
Boltzmann Machines
-
Boltzmann Machines (BMs):
A Boltzmann Machine (BM) is a type of stochastic recurrent neural network and Markov Random Field (MRF).Goal - What do BMs Learn:
Boltzmann Machines were originally introduced as a general “connectionist” approach to learning arbitrary probability distributions over binary vectors.
They are capable of learning internal representations of data.
They are also able to represent and solve (difficult) combinatoric problems.Structure:
- Input:
BMs are defined over a \(d\)-dimensional binary random vector \(\mathrm{x} \in\{0,1\}^{d}\). - Output:
The units produce binary results. - Units:
- Visible Units: \(\boldsymbol{v}\)
- Hidden Units: \(\boldsymbol{h}\)
- Probabilistic Model:
It is an energy-based model; it defines the joint probability distribution using an energy function:$$P(\boldsymbol{x})=\frac{\exp (-E(\boldsymbol{x}))}{Z}$$
where \(E(\boldsymbol{x})\) is the energy function and \(Z\) is the partition function.
- The Energy Function:
- With only visible units:
$$E(\boldsymbol{x})=-\boldsymbol{x}^{\top} \boldsymbol{U} \boldsymbol{x}-\boldsymbol{b}^{\top} \boldsymbol{x}$$
where \(U\) is the “weight” matrix of model parameters and \(\boldsymbol{b}\) is the vector of bias parameters.
- With both, visible and hidden units:
$$E(\boldsymbol{v}, \boldsymbol{h})=-\boldsymbol{v}^{\top} \boldsymbol{R} \boldsymbol{v}-\boldsymbol{v}^{\top} \boldsymbol{W} \boldsymbol{h}-\boldsymbol{h}^{\top} \boldsymbol{S} \boldsymbol{h}-\boldsymbol{b}^{\top} \boldsymbol{v}-\boldsymbol{c}^{\top} \boldsymbol{h}$$
- With only visible units:
Approximation Capabilities:
A BM with only visible units is limited to modeling linear relationships between variables as described by the weight matrix1.
A BM with hidden units is a universal approximator of probability mass functions over discrete variables (Le Roux and Bengio, 2008).Relation to Hopfield Networks:
A Boltzmann Machine is just a Stochastic Hopfield Network with Hidden Units.
BMs can be viewed as the stochastic, generative counterpart of Hopfield networks.It is important to note that although Boltzmann Machines bear a strong resemblance to Hopfield Networks, they are actually nothing like them in there functionality.
- Similarities:
- They are both networks of binary units.
- They both are energy-based models with the same energy function
- They both have the same update rule/condition (of estimating a unit’s output by the sum of all weighted inputs).
- Differences:
- Goal: BMs are NOT memory networks. They are not trying to store things. Instead, they employ a different computational role; they are trying to learn latent representations of the data.
The goal is representation learning. - Units: BMs have an extra set of units, other than the visible units, called hidden units. These units represent latent variables that are not observed but learned from the data.
These are necessary for representation learning. - Objective: BMs have a different objective; instead of minimizing the energy function, they minimize the error (KL-Divergence) between the “real” distribution over the data and the model distribution over global states (marginalized over hidden units).
Interpreted as the error between the input data and the reconstruction produced by the hidden units and their weights.
This is necessary to capture the training data probability distribution. - Energy Minima: energy minima were useful for Hopfield Nets and served as storage points for our input data (memories). However, they are very harmful for BMs since there is a global objective of finding the best distribution that approximates the real distribution.
This is necessary to capture the training data probability distribution “well”. - Activation Functions: the activation function for a BM is just a stochastic version of the binary threshold function. The unit would still update to a binary state according to a threshold value but with the update to the unit state being governed by a probability distribution (Boltzmann distribution).
This is necessary (important\(^{ * }\)) to escape energy minima.
- Goal: BMs are NOT memory networks. They are not trying to store things. Instead, they employ a different computational role; they are trying to learn latent representations of the data.
Relation to the Ising Model:
The global energy \(E\) in a Boltzmann Machine is identical in form to that of the Ising Model.Notes:
- Factor Analysis is a Causal Model with continuous variables.
- Input:
- Unit-State Probability:
- The units in a BM are binary units.
- Thus, they have two states \(s_i \in \{0,1\}\) to be in:
- On: \(s_i = 1\)
- Off: \(s_i = 0\)
- The probability that the \(i\)-th unit will be on (\(s_i = 1\)) is:
$$p(s_i=1)=\dfrac{1}{1+ e^{-\Delta E_{i}/T}}$$
where the scalar \(T\) is the temperature of the system.
- The RHS is just the logistic function. Rewriting the probability:
$$p(s_i=1)=\sigma(\Delta E_{i}/T)$$
- Using the Boltzmann Factor (ratio of probabilities of states):
$$\begin{align} \dfrac{p(s_i=0)}{p(s_i=1)} &= e^{\frac{E\left(s_{i}=0\right)-E\left(s_{i}=1\right)}{k T}} \\ \dfrac{1 - p(s_i=1)}{p(s_i=1)} &= e^{\frac{-(E\left(s_{i}=1\right)-E\left(s_{i}=0\right))}{k T}} \\ \dfrac{1}{p(s_i=1)} - 1 &= e^{\frac{-\Delta E_i}{k T}} \\ p(s_i=1) &= \dfrac{1}{1 + e^{-\Delta E_i/T}} \end{align} $$
where we absorb the Boltzmann constant \(k\) into the artificial Temperature constant \(T\).
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
- Asynchronous:
Restricted Boltzmann Machines (RBMs)
-
Restricted Boltzmann Machines (RBMs):
Restricted Boltzmann Machines (RBMs) -
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
Deep Boltzmann Machines (DBNs)
-
Deep Boltzmann Machines (DBNs):
Deep Boltzmann Machines (DBNs) -
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Asynchronous:
-
Specifically, the probability of one unit being on is given by a linear model (logistic regression) from the values of the other units. ↩