Table of Contents



Resources:

Interpretation of NNs:

Neural Architectures

  1. Neural Architectures - Graphical and Probabilistic Properties:

    Notes:

  2. Neural Architectures:
    FeedForward Network:
    • Representations:
      • Representational-Power: Universal Function Approximator.
        Learns non-linear features.
    • Input Structure:
      • Size: Fixed-Sized Inputs.
    • Transformation/Operation: Linear-Transformations (Matrix-Multiplication).
    • Inductive Biases:
    • Computational Power:

    Convolutional Network:

    • Representations:
      • Representational-Power: Universal Function Approximator.
      • Representations Properties:
        • Translational-Equivariance via Convolutions (Translational-Equivariant Representations)
        • Translational-Invariance via Pooling
    • Input Structure:
      • Inputs with grid-like topology.
        Images, Time-series, Sentences.
      • Size: Variable-Sized Inputs.
    • Transformation/Operation: Convolution.
    • Inductive Biases:
      • Local-Connectivity: Spatially Local Correlations.

    Recurrent Network:

    • Representations:
      • Representational-Power:
    • Input Structure:
      • Sequential Data.
        Sentences, Time-series, Images.
    • Transformation/Operation: Gated Linear-Transformations (Matrix-Multiplication).
    • Inductive Biases:
    • Computational Power (Model of Computation): Turing Complete (Universal Turing Machine).

    • Mathematical Model/System: Non-Linear Dynamical System.

    Transformer Network:

    • Representations:
      • Representational-Power:
    • Input Structure:

    Recursive Network:

    • Representational-Power:
    • Input Structure: Any Hierarchical Structure.

    Further Network Architectures (More Specialized):

  3. Types/Taxonomy of Neural Networks:


  4. Neural Networks and Graphical Models:
    Deep NNs as PGMs:
    You can view a deep neural network as a graphical model, but here, the CPDs are not probabilistic but are deterministic. Consider for example that the input to a neuron is \(\vec{x}\) and the output of the neuron is \(y .\) In the CPD for this neuron we have, \(p(\vec{x}, y)=1,\) and \(p(\vec{x}, \hat{y})=0\) for \(\hat{y} \neq y .\) Refer to the section 10.2 .3 of Deep Learning Book for more details.

  5. Neural Networks as Gaussian Processes:
    It’s long been known that these deep tools can be related to Gaussian processes, the ones I mentioned above. Take a neural network (a recursive application of weighted linear functions followed by non-linear functions), put a probability distribution over each weight (a normal distribution for example), and with infinitely many weights you recover a Gaussian process (see Neal or Williams for more details).

    We can think about the finite model as an approximation to a Gaussian process.
    When we optimise our objective, we minimise some “distance” (KL divergence to be more exact) between your model and the Gaussian process.

  6. Neural Layers and Block Architectures:
    • Feed-Forward Layer:
      • Representational-Power: Universal Function Approximator.
        Learns non-linear features.
      • Input Structure:
    • Convolutional Layer:
      • Representational-Power:
      • Input Structure:
    • Recurrent Layer:
      • Representational-Power:
      • Input Structure:
    • Recursive Layer:
      • Representational-Power:
    • Attention Layer:
      • Representational-Power:
      • Input Structure:
    • Attention Block:
      • Representational-Power:
    • Residual Block:
      • Representational-Power:
    • Reversible Block:
      • Representational-Power:
    • Reversible Layer:
      • Representational-Power:
  7. Notes:
    • Complexity:
      • Caching the activations of a NN:
        We need to cache the activation vectors of a NN after each layer \(Z^{[l]}\) because they are required in the backward computation.
    • Initializations:
      • Initializing NN:
        • Don’t initialize the weights to Zero. The symmetry of hidden units results in a similar computation for each hidden unit, making all the rows of the weight matrix to be equal (by induction).
        • It’s OK to initialize the bias term to zero.
        • Since a neuron takes the sum of \(N\) inputsXweights, if \(N\) is large, you want smaller \(w_i\)s. You want to initialize with a variance \(\propto \dfrac{1}{n}\) (i.e. multiply by \(\dfrac{1}{\sqrt{n}}\); \(n\) is the number of weights in previous layer).
          This doesnt solve but reduces vanishing/exploding gradient problem because \(z\) would take a similar distribution.
          • Xavier Initialization: assumes \(\tanh\) activation; ^ uses logic above; samples from normal distribution and multiplies by \(\dfrac{1}{\sqrt{n}}\).
          • If ReLU activation, it turns out to be better to make variance \(\propto \dfrac{2}{n}\) instead.
    • Training:
    • The Bias Parameter:
    • Failures of Neural Networks:
    • Bayesian Deep Learning: