Ahmad Badary

Hopfield Networks

Hopfield Networks Exercises
Hopfield Networks Example and Code (medium)
INTRODUCTION TO HOPFIELD NEURAL NETWORKS (blog)
The Hopfield Model (paper!)
Hopfield Network Demo (github)
Hopfield Networks Tutorial + Code (blog)
Why learn Hopfield Nets and why they work (blog)
Hopfield Networks (Quantum ML Book)

Hopfield Networks

Hopfield Networks:
Hopfield Networks are a form of recurrent artificial neural networks that serve as content-addressable (“associative”) memory systems with binary threshold nodes.

They are energy models i.e. their properties derive from a global energy function.
- A Hopfield net is composed of binary threshold units with recurrent connections between them.
Motivation:
Recurrent networks of non-linear units are generally very hard to analyze.
They can behave in many different ways:
- Settle to a stable state
- Oscillate
- Follow chaotic trajectories that cannot be predicted far into the future
Main Idea:
John Hopfield realized that if the connections are symmetric, there is a global energy function:
- Each binary “configuration” of the whole network has an energy.
  - Binary Configuration: is an assignment of binary values to each neuron in the network.
    Every neuron has a particular binary value in a configuration.
- The binary threshold decision rule causes the network to settle to a minimum of this energy function.
  The rule causes the network to go downhill in energy, and by repeatedly applying the rule, the network will end-up in an energy minimum.
The Energy Function:
- The global energy is the sum of many contributions:
  $$E=-\sum_{i} s_{i} b_{i}-\sum_{i< j} s_{i} s_{j} w_{i j}$$
  - Each contribution depends on:
    - One connection weight: $w_{i j}$
      A symmetric connection between two neurons; thus, have the following restrictions:
      - $w_{i i}=0, \forall i$ (no unit has a connection with itself)
      - $w_{i j}=w_{j i}, \forall i, j$ (connections are symmetric)
        and
    - The binary states of two neurons: $s_{i}$ and $s_{j}$ where $s_{j} \in \{-1, 1\}$ (or $\in \{0, 1\}$) is the state of unit $j$, and $\theta_{j}$ is the threshold of unit $j$.
  - To make up the following terms:
    - The quadratic term $s_{i} s_{j} \in \{-1, 1\}$, involving the states of two units and
    - The bias term $s_i b_i \in \{-\theta, \theta\}$, involving the states of individual units.
- This simple quadratic energy function makes it possible for each unit to compute locally how it’s state affects the global energy:
  The Energy Gap is the difference in the global energy of the whole configuration depending on whether $i$ is on:
  $$\begin{align} \text{Energy Gap} &= \Delta E_{i} \\ &= E\left(s_{i}=0\right)-E\left(s_{i}=1\right) \\ &= b_{i}+\sum_{j} s_{j} w_{i j} \end{align} $$
  
  i.e. the difference between the energy when $i$ is on and the energy when $i$ is off.
  - The Energy Gap and the Binary Threshold Decision Rule:
    - This difference (energy gap) is exactly what the binary threshold decision rule computes.
    - The Binary Decision Rule is the derivative of the energy gap wrt the state of the $i$-th unit $s_i$.
Settling to an Energy Minimum:
To find an energy minimum in this net:
- Start from a random state, then
- Update units one at a time in random order:
  - Update each unit to whichever of its two states gives the lowest global energy.
    i.e. use binary threshold units.
A Deeper Energy Minimum:
The net has two triangles in which the three units mostly support each other.
- Each triangle mostly hates the other triangle.
  The triangle on the left differs from the one on the right by having a weight of $2$ where the other one has a weight of $3$.
- So turning on the units in the triangle on the right gives the deepest minimum.
Sequential Updating - Justification:
- If units make simultaneous decisions the energy could go up.
- With simultaneous parallel updating we can get oscillations.
  - They always have a period of $2$ (bi-phasic oscillations).
- If the updates occur in parallel but with random timing, the oscillations are usually destroyed.
Using Energy Models (with binary threshold rule) for Storing Memories:
- Hopfield (1982) proposed that memories could be energy minima of a neural net (w/ symmetric weights).
  - The binary threshold decision rule can then be used to “clean up” incomplete or corrupted memories.
    Transforms partial memories to full memories.
- The idea of memories as energy minima was proposed by I. A. Richards (1924) in “Principles of Literary Criticism”.
- Using energy minima to represent memories gives a content-addressable (“associative”) memory:
  - An item can be accessed by just knowing part of its content.
    - This was really amazing in the year 16 BG (Before Google).
  - It is robust against hardware damage.
  - It’s like reconstructing a dinosaur from a few bones.
    Because you have an idea about how the bones are meant to fit together.
Storing memories in a Hopfield net:
- If we use activities of $1$ and $-1$ we can store a binary state vector by incrementing the weight between any two units by the product of their activities.
  $$\Delta w_{i j}=s_{i} s_{j}$$
  - This is a very simple rule that is not error-driven (i.e. does not learn by correcting errors).
    That is both its strength and its weakness:
    - It is an online rule
    - It is not very efficient to store things
  - We treat biases as weights from a permanently on unit.
- With states of $0$ and $1$ the rule is slightly more complicated:
  $$\Delta w_{i j}=4\left(s_{i}-\frac{1}{2}\right)\left(s_{j}-\frac{1}{2}\right)$$
Summary - Big Ideas of Hopfield Networks:
- Idea #1: we can find a local energy minimum by using a network of symmetrically connected binary threshold units.
- Idea #2: these local energy minima might correspond to memories.
Notes:
- They were responsible for resurgence of interest in Neural Networks in 1980s
- They can be used to store memories as distributed patterns of activity
- The constraint that weights are symmetric guarantees that the energy function decreases monotonically while following the activation rules
- The Hopfield Network is a non-linear dynamical system that converges to an attractor.
- A Hopfield net the size of a brain (connectivity patterns are quite diff, of course) could store a memory per second for 450 years.
Structure:
The Hopfield Network is formally described as a complete Undirected graph $G=\langle V, f\rangle,$ where $V$ is a set of McCulloch-Pitts neurons and $f : V^{2} \rightarrow \mathbb{R}$ is a function that links pairs of units to a real value, the connectivity weight.
- The Units:
  The units in a Hopfield Net are binary threshold units,
  i.e. the units only take on two different values for their states and the value is determined by whether or not the units’ input exceeds their threshold.
- The States:
  The state $s_i$ for unit $i$ take on values of $1$ or $-1$,
  i.e. $s_i \in \{-1, 1\}$.
- The Weights:
  Every pair of units $i$ and $j$ in a Hopfield network has a connection that is described by the connectivity weight $w_{i j}$.
  - Symmetric Connections (weights):
    - The connections in a Hopfield net are constrained to be symmetric by making the following restrictions:
      - $w_{i i}=0, \forall i$ (no unit has a connection with itself)
      - $w_{i j}=w_{j i}, \forall i, j$ (connections are symmetric)
    - The constraint that weights are symmetric guarantees that the energy function decreases monotonically while following the activation rules.
      A network with asymmetric weights may exhibit some periodic or chaotic behaviour; however, Hopfield found that this behavior is confined to relatively small parts of the phase space and does not impair the network’s ability to act as a content-addressable associative memory system.
Update Rule:
Updating one unit (node in the graph simulating the artificial neuron) in the Hopfield network is performed using the following rule:
$$s_{i} \leftarrow\left\{\begin{array}{ll}{+1} & {\text { if } \sum_{j} w_{i j} s_{j} \geq \theta_{i}} \\ {-1} & {\text { otherwise }}\end{array}\right.$$

Updates in the Hopfield network can be performed in two different ways:
- Asynchronous: Only one unit is updated at a time. This unit can be picked at random, or a pre-defined order can be imposed from the very beginning.
- Synchronous: All units are updated at the same time. This requires a central clock to the system in order to maintain synchronization.
  This method is viewed by some as less realistic, based on an absence of observed global clock influencing analogous biological or physical systems of interest.
Neural Attraction and Repulsion (in state-space):
Neurons “attract or repel each other” in state-space.
The weight between two units has a powerful impact upon the values of the neurons. Consider the connection weight $w_{ij}$ between two neurons $i$ and $j$.
If $w_>0$, the updating rule implies that:
- when $s_{j}=1,$ the contribution of $j$ in the weighted sum is positive. Thus, $s_{i}$ is pulled by $j$ towards its value $s_{i}=1$
- when $s_{j}=-1,$ the contribution of $j$ in the weighted sum is negative. Then again, $s_{i}$ is pushed by $j$ towards its value $s_{i}=-1$
Thus, the values of neurons $i$ and $j$ will converge if the weight between them is positive.
Similarly, they will diverge if the weight is negative.
Energy:
Hopfield nets have a scalar value associated with each state of the network, referred to as the “energy”, $E$, of the network, where:
$$E=-\frac{1}{2} \sum_{i, j} w_{i j} s_{i} s_{j}+\sum_{i} \theta_{i} s_{i}$$

This quantity is called “energy” because it either decreases or stays the same upon network units being updated.
Furthermore, under repeated updating the network will eventually converge to a state which is a local minimum in the energy function.
Thus, if a state is a local minimum in the energy function it is a stable state for the network.

Relation to Ising Models:
Note that this energy function belongs to a general class of models in physics under the name of Ising models.
These in turn are a special case of Markov Random Fields (MRFs), since the associated probability measure, the Gibbs measure, has the Markov property.
Initialization and Running:
Initialization of the Hopfield Networks is done by setting the values of the units to the desired start pattern.
Repeated updates are then performed until the network converges to an attractor pattern.
Convergence:
The Hopfield Network converges to an attractor pattern describing a stable state of the network (as a non-linear dynamical systems).

Convergence is generally assured, as Hopfield proved that the attractors of this nonlinear dynamical system are stable, non-periodic and non-chaotic as in some other systems.

Therefore, in the context of Hopfield Networks, an attractor pattern is a final stable state, a pattern that cannot change any value within it under updating.
Training:
- Training a Hopfield net involves lowering the energy of states that the net should “remember”.
- This allows the net to serve as a content addressable memory system, I.E. the network will converge to a “remembered” state if it is given only part of the state.
- The net can be used to recover from a distorted input to the trained state that is most similar to that input.
  This is called associative memory because it recovers memories on the basis of similarity.
- Thus, the network is properly trained when the energy of states which the network should remember are local minima.
- Note that, in contrast to Perceptron training, the thresholds of the neurons are never updated.
Learning Rules:
There are various different learning rules that can be used to store information in the memory of the Hopfield Network.

Desirable Properties:
- Local: A learning rule is local if each weight is updated using information available to neurons on either side of the connection that is associated with that particular weight.
- Incremental: New patterns can be learned without using information from the old patterns that have been also used for training.
  That is, when a new pattern is used for training, the new values for the weights only depend on the old values and on the new pattern.
These properties are desirable, since a learning rule satisfying them is more biologically plausible.

For example, since the human brain is always learning new concepts, one can reason that human learning is incremental. A learning system that were not incremental would generally be trained only once, with a huge batch of training data.

Hebbian Learning Rule:
The Hebbian rule is both local and incremental.
For the Hopfield Networks, it is implemented in the following manner, when learning $n$ binary patterns:

$$w_{i j}=\frac{1}{n} \sum_{\mu=1}^{n} \epsilon_{i}^{\mu} \epsilon_{j}^{\mu}$$

where $\epsilon_{i}^{\mu}$ represents bit $i$ from pattern $\mu$.

- If the bits corresponding to neurons $i$ and $j$ are equal in pattern $\mu,$ then the product $\epsilon_{i}^{\mu} \epsilon_{j}^{\mu}$ will be positive.
This would, in turn, have a positive effect on the weight $w_{i j}$ and the values of $i$ and $j$ will tend to become equal.
- The opposite happens if the bits corresponding to neurons $i$ and $j$ are different.

The Storkey Learning Rule:
This rule was introduced by Amos Storkey (1997) and is both local and incremental.
The weight matrix of an attractor neural network is said to follow the Storkey learning rule if it obeys:

$$w_{i j}^{\nu}=w_{i j}^{\nu-1}+\frac{1}{n} \epsilon_{i}^{\nu} \epsilon_{j}^{\nu}-\frac{1}{n} \epsilon_{i}^{\nu} h_{j i}^{\nu}-\frac{1}{n} \epsilon_{j}^{\nu} h_{i j}^{\nu}$$

where $h_{i j}^{\nu}=\sum_{k=1}^{n} \sum_{i \neq k \neq j}^{n} w_{i k}^{\nu-1} \epsilon_{k}^{\nu}$ is a form of local field at neuron $i$.

This learning rule is local, since the synapses take into account only neurons at their sides.

Storkey vs Hebbian Learning Rules:
Storkey showed that a Hopfield network trained using this rule has a greater capacity than a corresponding network trained using the Hebbian rule.
The Storkey rule makes use of more information from the patterns and weights than the generalized Hebbian rule, due to the effect of the local field.
Spurious Patterns:
- Patterns that the network uses for training (called retrieval states) become attractors of the system.
- Repeated updates would eventually lead to convergence to one of the retrieval states.
- However, sometimes the network will converge to spurious patterns (different from the training patterns).
- Spurious Patterns arise due to spurious minima.
  The energy in these spurious patterns is also a local minimum:
  - For each stored pattern $x,$ the negation $-x$ is also a spurious pattern.
  - A spurious state can also be a linear combination of an odd number of retrieval states. For example, when using $3$ patterns $\mu_{1}, \mu_{2}, \mu_{3},$ one can get the following spurious state:
  $$\epsilon_{i}^{\operatorname{mix}}=\pm \operatorname{sgn}\left( \pm \epsilon_{i}^{\mu_{1}} \pm \epsilon_{i}^{\mu_{2}} \pm \epsilon_{i}^{\mu_{3}}\right)$$
- Spurious patterns that have an even number of states cannot exist, since they might sum up to zero.
- Spurious Patterns (memories) occur when two nearby energy minima combine to make a new minimum in the wrong place.
- Physicists, in trying to increase the capacity of Hopfield nets, rediscovered the Perceptron convergence procedure.
Capacity:
The Network capacity of the Hopfield network model is determined by neuron amounts and connections within a given network. Therefore, the number of memories that are able to be stored is dependent on neurons and connections.

Capacity:
- It was shown that the recall accuracy between vectors and nodes was $0.138$ (approximately $138$ vectors can be recalled from storage for every $1000$ nodes) (Hertz et al. 1991).
- Using Hopfield’s storage rule the capacity of a totally connected net with $N$ units is only about $0.15N$ memories:
  - At $N$ bits per memory the total information stored is only $0.15 N^{2}$ bits.
  - This does not make efficient use of the bits required to store the weights.
  - It depends on a constant $0.15$
- Capacity Requirements for Efficient Storage:
  - The net has $N^{2}$ weights and biases.
  - After storing $M$ memories, each connection weight has an integer value in the range $[-M, M]$.
  - So the number of bits required to Efficiently store the weights and biases is:
    $$ N^{2} \log (2 M+1)$$
  - It scales logarithmically with the number of stored memories $M$.
Effects of Limited Capacity:
- Since the capacity of Hopfield Nets is limited to $\approx 0.15N$, it is evident that many mistakes will occur if one tries to store a large number of vectors.
  When the Hopfield model does not recall the right pattern, it is possible that an intrusion has taken place, since semantically related items tend to confuse the individual, and recollection of the wrong pattern occurs.
  Therefore, the Hopfield network model is shown to confuse one stored item with that of another upon retrieval.
- Perfect recalls and high capacity, $>0.14$, can be loaded in the network by Storkey learning method.
  Ulterior models inspired by the Hopfield network were later devised to raise the storage limit and reduce the retrieval error rate, with some being capable of one-shot learning.
  - A study of retrieval algorithms of sparse messages in networks of neural cliques
Spurious Minima Limit the Capacity:
- Each time we memorize a configuration, we hope to create a new energy minimum.
- The problem is if two nearby minima merge to create a minimum at an intermediate location¹:
  - Then we would get a blend of them rather than individual memories.
  - The Merging of Nearby Minima limits the capacity of a Hopfield Net.
Avoiding Spurious Minima by Unlearning:
Unlearning is a strategy proposed by Hopfield, Feinstein and Palmer to avoid spurious minima.
It involves applying the opposite of the storage rule of the binary state the network settles to.

Strategy:
- Let the net settle from a random initial state and then do Unlearning.
  Whatever binary state it settles to, apply the opposite of the storage rule.
  Starting from red merged minimum, doing unlearning will produce the two separate minima:
- This will get rid of deep, spurious minima and increase memory capacity.
- The strategy was shown to work but with no good analysis.
Unlearning and Biological Dreaming:
The question of why do we dream/what is the function of dreaming is a long standing question:
- When dreaming, the state of the brain is extremely similar to the state of the brain when its awake; except its not driven by real input, rather, its driven by a relay station.
- We dream for several hours a day, yet we actually don’t remember most if not all of our dreams at all.
Crick and Mitchison proposed unlearning as a model of what dreams are for:
- During the day, we store a lot of things and get spurious minima.
- At night, we put the network (brain) in a random state, settle to a minimum, and then unlearn what we settled to.
- The function of dreams is to get rid of those spurious minima.
- That’s why we don’t remember them, even though we dream for many hours (unless we wake up during the dream).
  I.E. We don’t store our dreams.
Optimal Amount of Unlearning:
From a mathematical pov, we want to derive exactly how much unlearning we need to do.
Unlearning is part of the process of fitting a model to data, and doing maximum likelihood fitting of that model, then unlearning should automatically come out of fitting the model AND the amount of unlearning needed to be done.
Thus, the solution is to derive unlearning as the right way to minimize some cost function, where the cost function is “how well your network models the data that you saw during the day”.

The Maximal Capacity of a given network architecture, over all possible learning rules:
Elizabeth Gardner showed that the capacity of fully connected networks of binary neurons with dense patterns scales as $2N$, a storage capacity which is much larger than the one of the Hopfield model.
Learning Rules that are able to saturate the Gardner Bound:
A simple learning rule that is guaranteed to achieve this bound is the Perceptron Learning Algorithm (PLA) applied to each neuron independently.
However, unlike the rule used in the Hopfield model, PLA is a supervised rule that needs an explicit “error signal” in order to achieve the Gardner bound.

Increasing the Capacity of Hopfield Networks:
Elizabeth Gardner showed that there was a much better storage rule that uses the full capacity of the weights:
- Instead of trying to store vectors in one shot, cycle through the training set many times.
- Use the Perceptron Learning Algorithm (PLA) to train each unit to have the correct state given the states of all the other units in that vector.
- It loses the online learning property in the interest of more efficient storage.
- Statisticians call this technique “pseudo-likelihood”.
- Procedure Description:
  - Set the network to the memory state you want to store
  - Take each unit separately and check if this unit adopts the state we want for it given the states of all the other units:
    - if it would you, leave its incoming weights alone
    - if it wouldn’t, you change its incoming weights in the way specified by the PLA
      Notice those will be integer changes to the weights
  - Repeat several times, as needed.
- Convergence:
  - If there are too many memories the Perceptron Convergence Procedure won’t converge
  - PLA converges, only, if there is a set of weights that will solve the problem
    Assuming there is, this is a much more efficient way to store memories.
Hopfield Networks with Hidden Units:
We add some hidden units to the network with the goal of making the states of those hidden units represent an interpretation of the perceptual input that’s shown on the visible units.
The idea is that the weights between units represent constraints on good interpretations and by finding a low energy state we find a good interpretation of the input data.

Different Computational Role for Hopfield Nets:
Instead of using the Hopfield Net to store memories, we can use it to construct interpretations of sensory input, where
- The input is represented by the visible units.
- The interpretation is represented by the states of the hidden units.
- The Quality of the interpretations is represented by the (negative) Energy function.
Example - Interpreting a Line Drawing

Two Difficult Computational Issues:
Using the states of the hidden units to represent an interpretation of the input raises two difficult issues:
- Search: How do we avoid getting trapped in poor local minima of the energy function?
  Poor minima represent sub-optimal interpretations.
- Learning: How do we learn the weights on the connections to the hidden units? and between the hidden units?
  Notice that there is no supervision in the problem.
Stochastic Units to improve Search:
Adding noise helps systems escape local energy minima.

Noisy Networks Find Better Energy Minima:
- A Hopfield net always makes decisions that reduce the energy.
  - This makes it impossible to escape from local minima.
- We can use random noise to escape from poor minima.
  - Start with a lot of noise so its easy to cross energy barriers.
  - Slowly reduce the noise so that the system ends up in a deep minimum.
    This is “simulated annealing” (Kirkpatrick et.al. 1981).
Effects of Temperature on the Transition Probabilities:
The temperature in a physical system (or a simulated system with an Energy Function) affects the transition probabilities.
- High Temperature System:
  - The probability of crossing barriers is high.
    I.E. The probability of going uphill from $B$ to $A$ is lower than the probability of going downhill from $A$ to $B$; but not much lower.
    - In effect, the temperature flattens the energy landscape.
  - So, the Ratio of the Probabilities is low.
    Thus,
    - It is easy to cross barriers
    - It is hard to stay in a deep minimum once you’ve got there
- Low Temperature System:
  - The probability of crossing barriers is much smaller.
    I.E. The probability of going uphill from $B$ to $A$ is much lower than the probability of going downhill from $A$ to $B$.
  - So, the Ratio of the Probabilities is much higher.
    Thus,
    - It is harder to cross barriers
    - It is easy to stay in a deep minimum once you’ve got there
  - Thus, if we run the system long enough, we expect all the particles to end up in $B$.
    However, if we run it at a low temperature, it will take a very long time for particles to escape from $A$.
  - To increase the speed of convergence, starting at a high temperature then gradually decreasing it, is a good compromise.
Stochastic Binary Units:
To inject noise in a Hopfield Net, we replace the binary threshold units with binary binary stochastic units that make biased random decisions.
- The Temperature controls the amount of noise.
- Raising the noise level is equivalent to decreasing all the energy gaps between configurations.

$$p\left(s_{i}=1\right)=\frac{1}{1+e^{-\Delta E_{i} / T}}$$
- This is a normal logistic equation, but with the energy gap scaled by a temperature $T$:
  - High Temperature: the exponential will be $\approx 0$ and $p\left(s_{i}=1\right)= \dfrac{1}{2}$.
    I.E. the probability of a unit turning on is about a half.
    It will be in its on and off states, equally often.
  - Low Temperature: depending on the sign of $\Delta E_{i}$, the unit will become more firmly on or more firmly off.
  - Zero Temperature: (e.g. in Hopfield Nets) the sign of $\Delta E_{i}$ determines whether RHS is $0$ or $1$.
    I.E. the unit will behave deterministically; a standard binary threshold unit, that will always adopt whichever of the two states gives the lowest energy.
- Boltzmann Machines use stochastic binary units, with temperature $T=1$ (i.e. standard logistic equation).
Thermal Equilibrium at a fixed temperature $T=1$:
- Thermal Equilibrium does not mean that the system has settled down into the lowest energy configuration.
  I.E. not the states of the individual units that settle down.
  The individual units still rattle around at Equilibrium, unless the temperature is zero $T=0$.
- What settles down is the probability distribution over configurations.
  - It settles to the stationary distribution.
    - The stationary distribution is determined by the energy function of the system.
    - In the stationary distribution, the probability of any configuration is $\propto e^{-E}$.
- Intuitive Interpretation of Thermal Equilibrium:
  - Imagine a huge ensemble of systems that all have exactly the same energy function.
  - The probability of a configuration is just the fraction of the systems that have that configuration.
- Approaching Thermal Equilibrium:
  - Start with any distribution we like over all the identical systems.
    - We could start with all the systems in the same configuration (Dirac distribution).
    - Or with an equal number of systems in each possible configuration (uniform distribution).
  - Then we keep applying our stochastic update rule to pick the next configuration for each individual system.
  - After running the systems stochastically in the right way, we may eventually reach a situation where the fraction of systems in each configuration remains constant.
    - This is the stationary distribution that physicists call thermal equilibrium.
    - Any given system keeps changing its configuration, but the fraction of systems in each configuration does not change.
- Analogy:
  - Imagine a casino in Las Vegas that is full of card dealers (we need many more than $52!$ of them).
  - We start with all the card packs in standard order and then the dealers all start shuffling their packs.
    - After a few shuffling steps, the king of spades. still has a good chance of being next to the queen of spades. The packs have not yet forgotten where they stated.
    - After prolonged shuffling, the packs will have forgotten where they! started. There will be an equal number of packs in each of the $52!$ possible orders.
    - Once equilibrium has been reached, the number of packs that leave a configuration at each time step will be equal to the number that enter the configuration.
  - The only thing wrong with this analogy is that all the configurations have equal energy, so they all end up with the same probability.
    - We are generally interested in reaching equilibrium for systems where certain configurations have lower energy than others.
As Boltzmann Machines:
Boltzmann Machines are just Stochastic Hopfield Nets with Hidden Units.

How a Boltzmann machine models a set of binary data vectors
Why model a set of binary data vectors and what we could do with such a model if we had it The probabilities assigned to binary data vectors are determined by the weights in a Boltzmann machine

BMs are good at modeling binary data

Modeling Binary Data:
Given a training set of binary vectors, fit a model that will assign a probability to every possible binary vector.

This is useful for deciding if other binary vectors come from the same distribution (e.g. documents represented by binary features that represents the occurrence of a particular word).
It can be used for monitoring complex systems to detect unusual behavior.
If we have models of several different distributions it can be used to compute the posterior probability that a particular distribution produced the observed data:
$$p(\text {Model}_ i | \text { data })=\dfrac{p(\text {data} | \text {Model}_ i)}{\sum_{j} p(\text {data} | \text {Model}_ j)}$$

Models for Generating Data:
There are different kinds of models to generate data:

Causal Models
Energy-based Models

How a Causal Model Generates Data:

In a Causal Model we generate data in two sequential steps:
- First pick the hidden states from their prior distribution.
  
  in causal models, often independent in the prior.
- Then pick the visible states from their conditional distribution given the hidden states.
The probability of generating a visible vector, $\mathrm{v},$ is computed by summing over all possible hidden states. Each hidden state is an ‘explanation” of $\mathrm{v}$:
$$p(\boldsymbol{v})=\sum_{\boldsymbol{h}} p(\boldsymbol{h}) p(\boldsymbol{v} | \boldsymbol{h})$$

Generating a binary vector: first generate the states of some latent variables, and then use the latent variables to generate the binary vector.

How a Boltzmann Machine Generates Data:

It is not a causal generative model.
Instead, everything is defined in terms of the energies of joint configurations of the visible and hidden units.
The energies of joint configurations are related to their probabilities in two ways:
- We can simply define the probability to be:
  $$p(\boldsymbol{v}, \boldsymbol{h}) \propto e^{-E(\boldsymbol{v}, \boldsymbol{h})}$$
- Alternatively, we can define the probability to be the probability of finding the network in that joint configuration after we have updated all of the stochastic binary units many times (until thermal equilibrium).
These two definitions agree - analysis below.
The Energy of a joint configuration:
$$\begin{align} E(\boldsymbol{v}, \boldsymbol{h}) &= - \sum_{i \in v_{i s}} v_{i} b_{i}-\sum_{k \in h_{i d}} h_{k} b_{k}-\sum_{i< j} v_{i} v_{j} w_{i j}-\sum_{i, k} v_{i} h_{k} w_{i k}-\sum_{k< l} h_{k} h_{l} w_{k l} \\ &= -\boldsymbol{v}^{\top} \boldsymbol{R} \boldsymbol{v}-\boldsymbol{v}^{\top} \boldsymbol{W} \boldsymbol{h}-\boldsymbol{h}^{\top} \boldsymbol{S} \boldsymbol{h}-\boldsymbol{b}^{\top} \boldsymbol{v}-\boldsymbol{c}^{\top} \boldsymbol{h} \end{align} $$

where $v_{i} b_{i}$ binary state of unit $i$ in $\boldsymbol{v}$, $h_k b_k$ is the bias of unit $k$, $i< j$ indexes every non-identical pair of $i$ and $j$ once (avoid self-interactions and double counting), and $w_{i k}$ is the weight between visible unit $i$ and hidden unit $k$.
Using Energies to define Probabilities:
- The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations:
  $$p(\boldsymbol{v}, \boldsymbol{h})=\dfrac{e^{-E(\boldsymbol{v}, \boldsymbol{h})}}{\sum_{\boldsymbol{u}, \boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}}$$
- The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it:
  $$p(\boldsymbol{v})=\dfrac{\sum_{\boldsymbol{h}} e^{-E(\boldsymbol{v}, \boldsymbol{h})}}{\sum_{\boldsymbol{u}, \boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}}$$
where the denomenators are the partition function $Z$.
Sampling from the Model:
- If there are more than a few hidden units, we cannot compute the normalizing term (the partition function) because it has exponentially many terms i.e. intractable.
- So we use Markov Chain Monte Carlo to get samples from the model starting from a random global configuration:
  - Keep picking units at random and allowing them to stochastically update their states based on their energy gaps.
- Run the Markov chain until it reaches its stationary distribution (thermal equilibrium at a temperature of $1$).
  - The probability of a global configuration is then related to its energy by the, Boltzmann Distribution:
    $$p(\mathbf{v}, \mathbf{h}) \propto e^{-E(\mathbf{v}, \mathbf{h})}$$
Sampling from the Posterior distribution over hidden configurations (for a given Data vector):
- The number of possible hidden configurations is exponential so we need MCMC to sample from the posterior.
  - It is just the same as getting a sample from the model, except that we keep the visible units clamped to the given data vector.
    I.E. Only the hidden units are allowed to change states (updated)
- Samples from the posterior are required for learning the weights.
  Each hidden configuration is an “explanation” of an observed visible configuration.
  Better explanations have lower energy.

The Goal of Learning:
We want to maximize the product of the probabilities (sum of log-probabilities) that the Boltzmann Machine assigns to the binary vectors in the training set.
This is Equivalent to maximizing the probability of obtaining exactly $N$ training cases if we ran the BM as follows:

For $i$ in $[1, \ldots, N]$:
- Run the network with no external input and let it settle to its stationary distribution
- Sample the visible vector

Possible Difficulty in Learning - Global Information:
Consider a chain of units with visible units at the ends:

If the training set is $(1,0)$ and $(0,1)$ we want the product of all the weights to be negative.
So to know how to change w1 or w5 we must know w3.

Learning with Local Information:
A very surprising fact is the following:
Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.
The derivative of log probability of one training vector wrt. one weight $w_{ij}$:

$$\dfrac{\partial \log p(\mathbf{v})}{\partial w_{i j}}=\left\langle s_{i} s_{j}\right\rangle_{\mathbf{v}}-\left\langle s_{i} s_{j}\right\rangle_{\text {free}}$$

where:

$\left\langle s_{i} s_{j}\right\rangle_{\mathbf{v}}$: is the expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units.
This is the positive phase of learning.
- Effect: Raise the weights in proportion to the product of the activities the units have when you are presenting data.
- Interpretation: similar to the storage term for a Hopfield Net.
  It is a Hebbian Learning Rule.
$\left\langle s_{i} s_{j}\right\rangle_{\text {free}}$: is the expected value of product of states at thermal equilibrium when nothing is clamped.
This is the negative phase of learning.
- Effect: Reduce the weights in proportion to “how often the two units are on together when sampling from the models distribution“.
- Interpretation: similar to the Unlearning term i.e. the opposite of the storage rule for avoiding (getting rid of) spurious minima.
  Moreover, this rule specifies the exact amount of unlearning to be applied.

So, the change in the weight is proportional to the expected product of the activities averaged over all visible vectors in the training set that’s what we call data MINUS the product of the same two activities when there is no clamping and the network has reached thermal equilibrium with no external interference:

$$\Delta w_{i j} \propto\left\langle s_{i} s_{j}\right\rangle_{\text{data}}-\left\langle s_{i} s_{j}\right\rangle_{\text{model}}$$

Thus, the learning algorithm only requires local information.

Effects of the Positive and Negative Phases of Learning:
Given the probability of a training data vector $\boldsymbol{v}$:

$$p(\boldsymbol{v})=\dfrac{\sum_{\boldsymbol{h}} e^{-E(\boldsymbol{v}, \boldsymbol{h})}}{\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}}$$

and the log probability:

$$\begin{align} \log p(\boldsymbol{v}) &= \log \left(\dfrac{\sum_{\boldsymbol{h}} e^{-E(\boldsymbol{v}, \boldsymbol{h})}}{\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}}\right) \\ &= \log \left(\sum_{\boldsymbol{h}} e^{-E(\boldsymbol{v}, \boldsymbol{h})}\right) - \log \left(\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}\right) \\ &= \left( \sum_{\boldsymbol{h}} \log e^{-E(\boldsymbol{v}, \boldsymbol{h})}\right) - \left(\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} \log e^{-E(\boldsymbol{u}, \boldsymbol{g})}\right) \\ &= \left(\sum_{\boldsymbol{h}} -E(\boldsymbol{v}, \boldsymbol{h})\right) - \left(\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} -E(\boldsymbol{u}, \boldsymbol{g})\right) \end{align} $$

Positive Phase:
- The first term is decreasing the energy of terms in that sum that are already large.
  - It finds those terms by settling to thermal equilibrium with the vector $\boldsymbol{v}$ clamped so they can find an $\boldsymbol{h}$ that produces a low energy with $\boldsymbol{v}$).
- Having sampled those vectors $\boldsymbol{h}$, it then changes the weights to make that energy even lower.
- Summary:
  The positive phase finds hidden configurations that work well with $\boldsymbol{v}$ and lowers their energies.
Negative Phase:
- The second term is doing the same thing but for the partition function.
  It’s finding global configurations (combinations of visible and hidden states) that give low energy and therefore are large contributors to the partition function.
- Having found those global configurations $(\boldsymbol{v}', \boldsymbol{h}')$, it tries to raise their energy so that they contribute less.
- Summary:
  The negative phase finds the joint configurations that are the best competitors and raises their energies.

Thus, the positive term is making the top term in $p(\boldsymbol{v})$ bigger and the negative term is making the bottom term in $p(\boldsymbol{v})$ smaller.

Learning Rule Justification - Why the Derivative is so simple:

The probability of a global configuration at thermal equilibrium is an exponential function of its energy.
- Thus, settling to equilibrium makes the log probability a linear function of the energy.
The energy is a linear function of the weights and states:
$$\dfrac{\partial E}{\partial w_{i j}}=s_{i} s_{j}$$

It is a log-linear model.
This an important fact because we are trying to manipulate the log probabilities by manipulating the weights.
The process of settling to thermal equilibrium propagates information about the weights.
- We don’t need an explicit back-propagation stage.
- We still need two stages:
  1. Settle with the data
  2. Settle with NO data
- However, the network behaves, basically, in the same way in the two phases¹; while the forward and backprop stages are very different.

The Batch Learning Algorithm - An inefficient way to collect the Learning Statistics:

Positive phase:
- Clamp a data-vector on the visible units.
- Let the hidden units reach thermal equilibrium at a temperature of 1 (may use annealing to speed this up)
  by updating the hidden units, one at a time.
- Sample $\left\langle s_{i} s_{j}\right\rangle$ for all pairs of units
- Repeat for all data-vectors in the training set.
Negative phase:
- Do not clamp any of the units
- Set all the units to random binary states.
- Let the whole network reach thermal equilibrium at a temperature of 1, by updating all the units, one at a time.
  - Difficulty: where do we start?
- Sample $\left\langle s_{i} s_{j}\right\rangle$ for all pairs of units
- Repeat many times to get good estimates
  - Difficulty: how many times? (especially w/ multiple modes)
Weight updates:
- Update each weight by an amount proportional to the difference in $\left\langle s_{i} s_{j}\right\rangle$ in the two phases.

The state space is the corners of a hypercube. Showing it as a $1-D$ continuous space is a misrepresentation. ↩ ↩²

Hopfield Networks

Table of Contents

Hopfield Networks