Table of Contents



Hopfield Networks Exercises
Hopfield Networks Example and Code (medium)
INTRODUCTION TO HOPFIELD NEURAL NETWORKS (blog)
The Hopfield Model (paper!)
Hopfield Network Demo (github)
Hopfield Networks Tutorial + Code (blog)
Why learn Hopfield Nets and why they work (blog)
Hopfield Networks (Quantum ML Book)

Hopfield Networks

  1. Hopfield Networks:
    Hopfield Networks are a form of recurrent artificial neural networks that serve as content-addressable (“associative”) memory systems with binary threshold nodes.

    They are energy models i.e. their properties derive from a global energy function.

    • A Hopfield net is composed of binary threshold units with recurrent connections between them.

    Motivation:
    Recurrent networks of non-linear units are generally very hard to analyze.
    They can behave in many different ways:

    • Settle to a stable state
    • Oscillate
    • Follow chaotic trajectories that cannot be predicted far into the future

    Main Idea:
    John Hopfield realized that if the connections are symmetric, there is a global energy function:

    • Each binary “configuration” of the whole network has an energy.
      • Binary Configuration: is an assignment of binary values to each neuron in the network.
        Every neuron has a particular binary value in a configuration.
    • The binary threshold decision rule causes the network to settle to a minimum of this energy function.
      The rule causes the network to go downhill in energy, and by repeatedly applying the rule, the network will end-up in an energy minimum.

    The Energy Function:

    • The global energy is the sum of many contributions:

      $$E=-\sum_{i} s_{i} b_{i}-\sum_{i< j} s_{i} s_{j} w_{i j}$$

      • Each contribution depends on:
        • One connection weight: \(w_{i j}\)
          A symmetric connection between two neurons; thus, have the following restrictions:
          • \(w_{i i}=0, \forall i\) (no unit has a connection with itself)
          • \(w_{i j}=w_{j i}, \forall i, j\) (connections are symmetric)
            and
        • The binary states of two neurons: \(s_{i}\) and \(s_{j}\) where \(s_{j} \in \{-1, 1\}\) (or \(\in \{0, 1\}\)) is the state of unit \(j\), and \(\theta_{j}\) is the threshold of unit \(j\).
      • To make up the following terms:
        • The quadratic term \(s_{i} s_{j} \in \{-1, 1\}\), involving the states of two units and
        • The bias term \(s_i b_i \in \{-\theta, \theta\}\), involving the states of individual units.
    • This simple quadratic energy function makes it possible for each unit to compute locally how it’s state affects the global energy:
      The Energy Gap is the difference in the global energy of the whole configuration depending on whether \(i\) is on:

      $$\begin{align} \text{Energy Gap} &= \Delta E_{i} \\ &= E\left(s_{i}=0\right)-E\left(s_{i}=1\right) \\ &= b_{i}+\sum_{j} s_{j} w_{i j} \end{align} $$

      i.e. the difference between the energy when \(i\) is on and the energy when \(i\) is off.

      • The Energy Gap and the Binary Threshold Decision Rule:
        • This difference (energy gap) is exactly what the binary threshold decision rule computes.
        • The Binary Decision Rule is the derivative of the energy gap wrt the state of the \(i\)-th unit \(s_i\).

    Settling to an Energy Minimum:
    To find an energy minimum in this net:

    • Start from a random state, then
    • Update units one at a time in random order:
      • Update each unit to whichever of its two states gives the lowest global energy.
        i.e. use binary threshold units.

    A Deeper Energy Minimum:
    The net has two triangles in which the three units mostly support each other.

    • Each triangle mostly hates the other triangle.
      The triangle on the left differs from the one on the right by having a weight of \(2\) where the other one has a weight of \(3\).
    • So turning on the units in the triangle on the right gives the deepest minimum.

    Sequential Updating - Justification:

    • If units make simultaneous decisions the energy could go up.
    • With simultaneous parallel updating we can get oscillations.
      • They always have a period of \(2\) (bi-phasic oscillations).
    • If the updates occur in parallel but with random timing, the oscillations are usually destroyed.

    Using Energy Models (with binary threshold rule) for Storing Memories:

    • Hopfield (1982) proposed that memories could be energy minima of a neural net (w/ symmetric weights).
      • The binary threshold decision rule can then be used to “clean up” incomplete or corrupted memories.
        Transforms partial memories to full memories.
    • The idea of memories as energy minima was proposed by I. A. Richards (1924) in “Principles of Literary Criticism”.
    • Using energy minima to represent memories gives a content-addressable (“associative”) memory:
      • An item can be accessed by just knowing part of its content.
        • This was really amazing in the year 16 BG (Before Google).
      • It is robust against hardware damage.
      • It’s like reconstructing a dinosaur from a few bones.
        Because you have an idea about how the bones are meant to fit together.

    Storing memories in a Hopfield net:

    • If we use activities of \(1\) and \(-1\) we can store a binary state vector by incrementing the weight between any two units by the product of their activities.

      $$\Delta w_{i j}=s_{i} s_{j}$$

      • This is a very simple rule that is not error-driven (i.e. does not learn by correcting errors).
        That is both its strength and its weakness:
        • It is an online rule
        • It is not very efficient to store things
      • We treat biases as weights from a permanently on unit.
    • With states of \(0\) and \(1\) the rule is slightly more complicated:

      $$\Delta w_{i j}=4\left(s_{i}-\frac{1}{2}\right)\left(s_{j}-\frac{1}{2}\right)$$

    Summary - Big Ideas of Hopfield Networks:

    • Idea #1: we can find a local energy minimum by using a network of symmetrically connected binary threshold units.
    • Idea #2: these local energy minima might correspond to memories.

    Notes:

    • They were responsible for resurgence of interest in Neural Networks in 1980s
    • They can be used to store memories as distributed patterns of activity
    • The constraint that weights are symmetric guarantees that the energy function decreases monotonically while following the activation rules
    • The Hopfield Network is a non-linear dynamical system that converges to an attractor.
    • A Hopfield net the size of a brain (connectivity patterns are quite diff, of course) could store a memory per second for 450 years.

  2. Structure:
    The Hopfield Network is formally described as a complete Undirected graph \(G=\langle V, f\rangle,\) where \(V\) is a set of McCulloch-Pitts neurons and \(f : V^{2} \rightarrow \mathbb{R}\) is a function that links pairs of units to a real value, the connectivity weight.
    • The Units:
      The units in a Hopfield Net are binary threshold units,
      i.e. the units only take on two different values for their states and the value is determined by whether or not the units’ input exceeds their threshold.
    • The States:
      The state \(s_i\) for unit \(i\) take on values of \(1\) or \(-1\),
      i.e. \(s_i \in \{-1, 1\}\).
    • The Weights:
      Every pair of units \(i\) and \(j\) in a Hopfield network has a connection that is described by the connectivity weight \(w_{i j}\).
      • Symmetric Connections (weights):
        • The connections in a Hopfield net are constrained to be symmetric by making the following restrictions:
          • \(w_{i i}=0, \forall i\) (no unit has a connection with itself)
          • \(w_{i j}=w_{j i}, \forall i, j\) (connections are symmetric)
        • The constraint that weights are symmetric guarantees that the energy function decreases monotonically while following the activation rules.
          A network with asymmetric weights may exhibit some periodic or chaotic behaviour; however, Hopfield found that this behavior is confined to relatively small parts of the phase space and does not impair the network’s ability to act as a content-addressable associative memory system.
  3. Update Rule:
    Updating one unit (node in the graph simulating the artificial neuron) in the Hopfield network is performed using the following rule:

    $$s_{i} \leftarrow\left\{\begin{array}{ll}{+1} & {\text { if } \sum_{j} w_{i j} s_{j} \geq \theta_{i}} \\ {-1} & {\text { otherwise }}\end{array}\right.$$

    Updates in the Hopfield network can be performed in two different ways:

    • Asynchronous: Only one unit is updated at a time. This unit can be picked at random, or a pre-defined order can be imposed from the very beginning.
    • Synchronous: All units are updated at the same time. This requires a central clock to the system in order to maintain synchronization.
      This method is viewed by some as less realistic, based on an absence of observed global clock influencing analogous biological or physical systems of interest.

    Neural Attraction and Repulsion (in state-space):
    Neurons “attract or repel each other” in state-space.
    The weight between two units has a powerful impact upon the values of the neurons. Consider the connection weight \(w_{ij}\) between two neurons \(i\) and \(j\).
    If \(w_>0\), the updating rule implies that:

    • when \(s_{j}=1,\) the contribution of \(j\) in the weighted sum is positive. Thus, \(s_{i}\) is pulled by \(j\) towards its value \(s_{i}=1\)
    • when \(s_{j}=-1,\) the contribution of \(j\) in the weighted sum is negative. Then again, \(s_{i}\) is pushed by \(j\) towards its value \(s_{i}=-1\)

    Thus, the values of neurons \(i\) and \(j\) will converge if the weight between them is positive.
    Similarly, they will diverge if the weight is negative.

  4. Energy:
    Hopfield nets have a scalar value associated with each state of the network, referred to as the “energy”, \(E\), of the network, where:

    $$E=-\frac{1}{2} \sum_{i, j} w_{i j} s_{i} s_{j}+\sum_{i} \theta_{i} s_{i}$$


    This quantity is called “energy” because it either decreases or stays the same upon network units being updated.
    Furthermore, under repeated updating the network will eventually converge to a state which is a local minimum in the energy function.
    Thus, if a state is a local minimum in the energy function it is a stable state for the network.

    Relation to Ising Models:
    Note that this energy function belongs to a general class of models in physics under the name of Ising models.
    These in turn are a special case of Markov Random Fields (MRFs), since the associated probability measure, the Gibbs measure, has the Markov property.

  5. Initialization and Running:
    Initialization of the Hopfield Networks is done by setting the values of the units to the desired start pattern.
    Repeated updates are then performed until the network converges to an attractor pattern.

  6. Convergence:
    The Hopfield Network converges to an attractor pattern describing a stable state of the network (as a non-linear dynamical systems).

    Convergence is generally assured, as Hopfield proved that the attractors of this nonlinear dynamical system are stable, non-periodic and non-chaotic as in some other systems.

    Therefore, in the context of Hopfield Networks, an attractor pattern is a final stable state, a pattern that cannot change any value within it under updating.

  7. Training:
    • Training a Hopfield net involves lowering the energy of states that the net should “remember”.
    • This allows the net to serve as a content addressable memory system, I.E. the network will converge to a “remembered” state if it is given only part of the state.
    • The net can be used to recover from a distorted input to the trained state that is most similar to that input.
      This is called associative memory because it recovers memories on the basis of similarity.
    • Thus, the network is properly trained when the energy of states which the network should remember are local minima.
    • Note that, in contrast to Perceptron training, the thresholds of the neurons are never updated.

  8. Learning Rules:
    There are various different learning rules that can be used to store information in the memory of the Hopfield Network.

    Desirable Properties:

    • Local: A learning rule is local if each weight is updated using information available to neurons on either side of the connection that is associated with that particular weight.
    • Incremental: New patterns can be learned without using information from the old patterns that have been also used for training.
      That is, when a new pattern is used for training, the new values for the weights only depend on the old values and on the new pattern.

    These properties are desirable, since a learning rule satisfying them is more biologically plausible.

    For example, since the human brain is always learning new concepts, one can reason that human learning is incremental. A learning system that were not incremental would generally be trained only once, with a huge batch of training data.

    Hebbian Learning Rule:
    The Hebbian rule is both local and incremental.
    For the Hopfield Networks, it is implemented in the following manner, when learning \(n\) binary patterns:

    $$w_{i j}=\frac{1}{n} \sum_{\mu=1}^{n} \epsilon_{i}^{\mu} \epsilon_{j}^{\mu}$$

    where \(\epsilon_{i}^{\mu}\) represents bit \(i\) from pattern \(\mu\).

    - If the bits corresponding to neurons \(i\) and \(j\) are equal in pattern \(\mu,\) then the product \(\epsilon_{i}^{\mu} \epsilon_{j}^{\mu}\) will be positive.
    This would, in turn, have a positive effect on the weight \(w_{i j}\) and the values of \(i\) and \(j\) will tend to become equal.
    - The opposite happens if the bits corresponding to neurons \(i\) and \(j\) are different.

    The Storkey Learning Rule:
    This rule was introduced by Amos Storkey (1997) and is both local and incremental.
    The weight matrix of an attractor neural network is said to follow the Storkey learning rule if it obeys:

    $$w_{i j}^{\nu}=w_{i j}^{\nu-1}+\frac{1}{n} \epsilon_{i}^{\nu} \epsilon_{j}^{\nu}-\frac{1}{n} \epsilon_{i}^{\nu} h_{j i}^{\nu}-\frac{1}{n} \epsilon_{j}^{\nu} h_{i j}^{\nu}$$

    where \(h_{i j}^{\nu}=\sum_{k=1}^{n} \sum_{i \neq k \neq j}^{n} w_{i k}^{\nu-1} \epsilon_{k}^{\nu}\) is a form of local field at neuron \(i\).

    This learning rule is local, since the synapses take into account only neurons at their sides.

    Storkey vs Hebbian Learning Rules:
    Storkey showed that a Hopfield network trained using this rule has a greater capacity than a corresponding network trained using the Hebbian rule.
    The Storkey rule makes use of more information from the patterns and weights than the generalized Hebbian rule, due to the effect of the local field.

  9. Spurious Patterns:
    • Patterns that the network uses for training (called retrieval states) become attractors of the system.
    • Repeated updates would eventually lead to convergence to one of the retrieval states.
    • However, sometimes the network will converge to spurious patterns (different from the training patterns).
    • Spurious Patterns arise due to spurious minima.
      The energy in these spurious patterns is also a local minimum:
      • For each stored pattern \(x,\) the negation \(-x\) is also a spurious pattern.
      • A spurious state can also be a linear combination of an odd number of retrieval states. For example, when using \(3\) patterns \(\mu_{1}, \mu_{2}, \mu_{3},\) one can get the following spurious state:

      $$\epsilon_{i}^{\operatorname{mix}}=\pm \operatorname{sgn}\left( \pm \epsilon_{i}^{\mu_{1}} \pm \epsilon_{i}^{\mu_{2}} \pm \epsilon_{i}^{\mu_{3}}\right)$$

    • Spurious patterns that have an even number of states cannot exist, since they might sum up to zero.

    • Spurious Patterns (memories) occur when two nearby energy minima combine to make a new minimum in the wrong place.
    • Physicists, in trying to increase the capacity of Hopfield nets, rediscovered the Perceptron convergence procedure.

  10. Capacity:
    The Network capacity of the Hopfield network model is determined by neuron amounts and connections within a given network. Therefore, the number of memories that are able to be stored is dependent on neurons and connections.

    Capacity:

    • It was shown that the recall accuracy between vectors and nodes was \(0.138\) (approximately \(138\) vectors can be recalled from storage for every \(1000\) nodes) (Hertz et al. 1991).
    • Using Hopfield’s storage rule the capacity of a totally connected net with \(N\) units is only about \(0.15N\) memories:
      • At \(N\) bits per memory the total information stored is only \(0.15 N^{2}\) bits.
      • This does not make efficient use of the bits required to store the weights.
      • It depends on a constant \(0.15\)
    • Capacity Requirements for Efficient Storage:
      • The net has \(N^{2}\) weights and biases.
      • After storing \(M\) memories, each connection weight has an integer value in the range \([-M, M]\).
      • So the number of bits required to Efficiently store the weights and biases is:

        $$ N^{2} \log (2 M+1)$$

      • It scales logarithmically with the number of stored memories \(M\).

    Effects of Limited Capacity:

    • Since the capacity of Hopfield Nets is limited to \(\approx 0.15N\), it is evident that many mistakes will occur if one tries to store a large number of vectors.
      When the Hopfield model does not recall the right pattern, it is possible that an intrusion has taken place, since semantically related items tend to confuse the individual, and recollection of the wrong pattern occurs.
      Therefore, the Hopfield network model is shown to confuse one stored item with that of another upon retrieval.
    • Perfect recalls and high capacity, \(>0.14\), can be loaded in the network by Storkey learning method.
      Ulterior models inspired by the Hopfield network were later devised to raise the storage limit and reduce the retrieval error rate, with some being capable of one-shot learning.

    Spurious Minima Limit the Capacity:

    • Each time we memorize a configuration, we hope to create a new energy minimum.
    • The problem is if two nearby minima merge to create a minimum at an intermediate location1:
      img
      • Then we would get a blend of them rather than individual memories.
      • The Merging of Nearby Minima limits the capacity of a Hopfield Net.

    Avoiding Spurious Minima by Unlearning:
    Unlearning is a strategy proposed by Hopfield, Feinstein and Palmer to avoid spurious minima.
    It involves applying the opposite of the storage rule of the binary state the network settles to.

    Strategy:

    • Let the net settle from a random initial state and then do Unlearning.
      Whatever binary state it settles to, apply the opposite of the storage rule.
      Starting from red merged minimum, doing unlearning will produce the two separate minima:
      img
    • This will get rid of deep, spurious minima and increase memory capacity.
    • The strategy was shown to work but with no good analysis.

    Unlearning and Biological Dreaming:
    The question of why do we dream/what is the function of dreaming is a long standing question:

    • When dreaming, the state of the brain is extremely similar to the state of the brain when its awake; except its not driven by real input, rather, its driven by a relay station.
    • We dream for several hours a day, yet we actually don’t remember most if not all of our dreams at all.

    Crick and Mitchison proposed unlearning as a model of what dreams are for:

    • During the day, we store a lot of things and get spurious minima.
    • At night, we put the network (brain) in a random state, settle to a minimum, and then unlearn what we settled to.
    • The function of dreams is to get rid of those spurious minima.
    • That’s why we don’t remember them, even though we dream for many hours (unless we wake up during the dream).
      I.E. We don’t store our dreams.

    Optimal Amount of Unlearning:
    From a mathematical pov, we want to derive exactly how much unlearning we need to do.
    Unlearning is part of the process of fitting a model to data, and doing maximum likelihood fitting of that model, then unlearning should automatically come out of fitting the model AND the amount of unlearning needed to be done.
    Thus, the solution is to derive unlearning as the right way to minimize some cost function, where the cost function is “how well your network models the data that you saw during the day”.

    The Maximal Capacity of a given network architecture, over all possible learning rules:
    Elizabeth Gardner showed that the capacity of fully connected networks of binary neurons with dense patterns scales as \(2N\), a storage capacity which is much larger than the one of the Hopfield model.
    Learning Rules that are able to saturate the Gardner Bound:
    A simple learning rule that is guaranteed to achieve this bound is the Perceptron Learning Algorithm (PLA) applied to each neuron independently.
    However, unlike the rule used in the Hopfield model, PLA is a supervised rule that needs an explicit “error signal” in order to achieve the Gardner bound.

    Increasing the Capacity of Hopfield Networks:
    Elizabeth Gardner showed that there was a much better storage rule that uses the full capacity of the weights:

    • Instead of trying to store vectors in one shot, cycle through the training set many times.
    • Use the Perceptron Learning Algorithm (PLA) to train each unit to have the correct state given the states of all the other units in that vector.
    • It loses the online learning property in the interest of more efficient storage.
    • Statisticians call this technique “pseudo-likelihood”.
    • Procedure Description:
      • Set the network to the memory state you want to store
      • Take each unit separately and check if this unit adopts the state we want for it given the states of all the other units:
        • if it would you, leave its incoming weights alone
        • if it wouldn’t, you change its incoming weights in the way specified by the PLA
          Notice those will be integer changes to the weights
      • Repeat several times, as needed.
    • Convergence:
      • If there are too many memories the Perceptron Convergence Procedure won’t converge
      • PLA converges, only, if there is a set of weights that will solve the problem
        Assuming there is, this is a much more efficient way to store memories.

  11. Hopfield Networks with Hidden Units:
    We add some hidden units to the network with the goal of making the states of those hidden units represent an interpretation of the perceptual input that’s shown on the visible units.
    The idea is that the weights between units represent constraints on good interpretations and by finding a low energy state we find a good interpretation of the input data.

    Different Computational Role for Hopfield Nets:
    Instead of using the Hopfield Net to store memories, we can use it to construct interpretations of sensory input, where

    • The input is represented by the visible units.
    • The interpretation is represented by the states of the hidden units.
    • The Quality of the interpretations is represented by the (negative) Energy function.

    img

    Example - Interpreting a Line Drawing

    Two Difficult Computational Issues:
    Using the states of the hidden units to represent an interpretation of the input raises two difficult issues:

    • Search: How do we avoid getting trapped in poor local minima of the energy function?
      Poor minima represent sub-optimal interpretations.
    • Learning: How do we learn the weights on the connections to the hidden units? and between the hidden units?
      Notice that there is no supervision in the problem.
  12. Stochastic Units to improve Search:
    Adding noise helps systems escape local energy minima.

    Noisy Networks Find Better Energy Minima:

    • A Hopfield net always makes decisions that reduce the energy.
      • This makes it impossible to escape from local minima.
    • We can use random noise to escape from poor minima.
      • Start with a lot of noise so its easy to cross energy barriers.
      • Slowly reduce the noise so that the system ends up in a deep minimum.
        This is “simulated annealing” (Kirkpatrick et.al. 1981).

    Effects of Temperature on the Transition Probabilities:
    The temperature in a physical system (or a simulated system with an Energy Function) affects the transition probabilities.

    • High Temperature System:
      img
      • The probability of crossing barriers is high.
        I.E. The probability of going uphill from \(B\) to \(A\) is lower than the probability of going downhill from \(A\) to \(B\); but not much lower.
        • In effect, the temperature flattens the energy landscape.
      • So, the Ratio of the Probabilities is low.
        Thus,
        • It is easy to cross barriers
        • It is hard to stay in a deep minimum once you’ve got there
    • Low Temperature System:
      img
      • The probability of crossing barriers is much smaller.
        I.E. The probability of going uphill from \(B\) to \(A\) is much lower than the probability of going downhill from \(A\) to \(B\).
      • So, the Ratio of the Probabilities is much higher.
        Thus,
        • It is harder to cross barriers
        • It is easy to stay in a deep minimum once you’ve got there
      • Thus, if we run the system long enough, we expect all the particles to end up in \(B\).
        However, if we run it at a low temperature, it will take a very long time for particles to escape from \(A\).
      • To increase the speed of convergence, starting at a high temperature then gradually decreasing it, is a good compromise.

    Stochastic Binary Units:
    To inject noise in a Hopfield Net, we replace the binary threshold units with binary binary stochastic units that make biased random decisions.
    - The Temperature controls the amount of noise.
    - Raising the noise level is equivalent to decreasing all the energy gaps between configurations.

    $$p\left(s_{i}=1\right)=\frac{1}{1+e^{-\Delta E_{i} / T}}$$

    • This is a normal logistic equation, but with the energy gap scaled by a temperature \(T\):
      • High Temperature: the exponential will be \(\approx 0\) and \(p\left(s_{i}=1\right)= \dfrac{1}{2}\).
        I.E. the probability of a unit turning on is about a half.
        It will be in its on and off states, equally often.
      • Low Temperature: depending on the sign of \(\Delta E_{i}\), the unit will become more firmly on or more firmly off.
      • Zero Temperature: (e.g. in Hopfield Nets) the sign of \(\Delta E_{i}\) determines whether RHS is \(0\) or \(1\).
        I.E. the unit will behave deterministically; a standard binary threshold unit, that will always adopt whichever of the two states gives the lowest energy.
    • Boltzmann Machines use stochastic binary units, with temperature \(T=1\) (i.e. standard logistic equation).

    Thermal Equilibrium at a fixed temperature \(T=1\):

    • Thermal Equilibrium does not mean that the system has settled down into the lowest energy configuration.
      I.E. not the states of the individual units that settle down.
      The individual units still rattle around at Equilibrium, unless the temperature is zero \(T=0\).
    • What settles down is the probability distribution over configurations.
      • It settles to the stationary distribution.
        • The stationary distribution is determined by the energy function of the system.
        • In the stationary distribution, the probability of any configuration is \(\propto e^{-E}\).
    • Intuitive Interpretation of Thermal Equilibrium:
      • Imagine a huge ensemble of systems that all have exactly the same energy function.
      • The probability of a configuration is just the fraction of the systems that have that configuration.
    • Approaching Thermal Equilibrium:
      • Start with any distribution we like over all the identical systems.
        • We could start with all the systems in the same configuration (Dirac distribution).
        • Or with an equal number of systems in each possible configuration (uniform distribution).
      • Then we keep applying our stochastic update rule to pick the next configuration for each individual system.
      • After running the systems stochastically in the right way, we may eventually reach a situation where the fraction of systems in each configuration remains constant.
        • This is the stationary distribution that physicists call thermal equilibrium.
        • Any given system keeps changing its configuration, but the fraction of systems in each configuration does not change.
    • Analogy:

    As Boltzmann Machines:
    Boltzmann Machines are just Stochastic Hopfield Nets with Hidden Units.




How a Boltzmann machine models a set of binary data vectors
Why model a set of binary data vectors and what we could do with such a model if we had it The probabilities assigned to binary data vectors are determined by the weights in a Boltzmann machine

BMs are good at modeling binary data

Modeling Binary Data:
Given a training set of binary vectors, fit a model that will assign a probability to every possible binary vector.

Models for Generating Data:
There are different kinds of models to generate data:

How a Causal Model Generates Data:

Generating a binary vector: first generate the states of some latent variables, and then use the latent variables to generate the binary vector.

How a Boltzmann Machine Generates Data:

The Goal of Learning:
We want to maximize the product of the probabilities (sum of log-probabilities) that the Boltzmann Machine assigns to the binary vectors in the training set.
This is Equivalent to maximizing the probability of obtaining exactly \(N\) training cases if we ran the BM as follows:

Possible Difficulty in Learning - Global Information:
Consider a chain of units with visible units at the ends:
img
If the training set is \((1,0)\) and \((0,1)\) we want the product of all the weights to be negative.
So to know how to change w1 or w5 we must know w3.

Learning with Local Information:
A very surprising fact is the following:
Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.
The derivative of log probability of one training vector wrt. one weight \(w_{ij}\):

$$\dfrac{\partial \log p(\mathbf{v})}{\partial w_{i j}}=\left\langle s_{i} s_{j}\right\rangle_{\mathbf{v}}-\left\langle s_{i} s_{j}\right\rangle_{\text {free}}$$

where:

So, the change in the weight is proportional to the expected product of the activities averaged over all visible vectors in the training set that’s what we call data MINUS the product of the same two activities when there is no clamping and the network has reached thermal equilibrium with no external interference:

$$\Delta w_{i j} \propto\left\langle s_{i} s_{j}\right\rangle_{\text{data}}-\left\langle s_{i} s_{j}\right\rangle_{\text{model}}$$

Thus, the learning algorithm only requires local information.

Effects of the Positive and Negative Phases of Learning:
Given the probability of a training data vector \(\boldsymbol{v}\):

$$p(\boldsymbol{v})=\dfrac{\sum_{\boldsymbol{h}} e^{-E(\boldsymbol{v}, \boldsymbol{h})}}{\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}}$$

and the log probability:

$$\begin{align} \log p(\boldsymbol{v}) &= \log \left(\dfrac{\sum_{\boldsymbol{h}} e^{-E(\boldsymbol{v}, \boldsymbol{h})}}{\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}}\right) \\ &= \log \left(\sum_{\boldsymbol{h}} e^{-E(\boldsymbol{v}, \boldsymbol{h})}\right) - \log \left(\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} e^{-E(\boldsymbol{u}, \boldsymbol{g})}\right) \\ &= \left( \sum_{\boldsymbol{h}} \log e^{-E(\boldsymbol{v}, \boldsymbol{h})}\right) - \left(\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} \log e^{-E(\boldsymbol{u}, \boldsymbol{g})}\right) \\ &= \left(\sum_{\boldsymbol{h}} -E(\boldsymbol{v}, \boldsymbol{h})\right) - \left(\sum_{\boldsymbol{u}} \sum_{\boldsymbol{g}} -E(\boldsymbol{u}, \boldsymbol{g})\right) \end{align} $$

Thus, the positive term is making the top term in \(p(\boldsymbol{v})\) bigger and the negative term is making the bottom term in \(p(\boldsymbol{v})\) smaller.

Learning Rule Justification - Why the Derivative is so simple:

The Batch Learning Algorithm - An inefficient way to collect the Learning Statistics:

  1. The state space is the corners of a hypercube. Showing it as a \(1-D\) continuous space is a misrepresentation.  2