Ahmad Badary

Neural Architectures

Resources:

Interpretation of NNs:

Schmidhuber was a pioneer for the view of “neural networks as programs”, which is claimed in his blog post. As opposed to the “representation learning view” by Hinton, Bengio, and other people, which is currently dominant in deep learning.

Neural Architectures

Neural Architectures - Graphical and Probabilistic Properties:
- FFN
  - Directed
  - Acyclic
  - ?
- MLP
  - Directed
  - Acyclic
  - Fully Connected (Complete?)
- CNN
  - Directed
  - Acyclic
  - ?
- RNN
  - Directed
  - Cyclic
  - ?
- Hopfield
  - Undirected
  - Cyclic
  - Complete
- Boltzmann Machine
  - Undirected
  - Cyclic
  - Complete
- RBM
  - Undirected
  - Cyclic
  - Bipartite
- Bayesian Networks
  - Directed
  - Acyclic
  - ?
- HMMs
  - Directed
  - Acyclic
  - ?
- MRF
  - Undirected
  - Cyclic
  - ?
- CRF
  - Undirected
  - Cyclic
  - ?
- DBN
  - Directed
  - Acyclic
  - ?
- GAN
  - ?
  - ?
  - Bipartite-Complete
Notes:
- NNs Graphical VS Functional view (wiki!)
Neural Architectures:
FeedForward Network:
- Representations:
  - Representational-Power: Universal Function Approximator.
    Learns non-linear features.
- Input Structure:
  - Size: Fixed-Sized Inputs.
- Transformation/Operation: Linear-Transformations (Matrix-Multiplication).
- Inductive Biases:
- Computational Power:
Convolutional Network:
- Representations:
  - Representational-Power: Universal Function Approximator.
  - Representations Properties:
    - Translational-Equivariance via Convolutions (Translational-Equivariant Representations)
    - Translational-Invariance via Pooling
- Input Structure:
  - Inputs with grid-like topology.
    Images, Time-series, Sentences.
  - Size: Variable-Sized Inputs.
- Transformation/Operation: Convolution.
- Inductive Biases:
  - Local-Connectivity: Spatially Local Correlations.
Recurrent Network:
- Representations:
  - Representational-Power:
- Input Structure:
  - Sequential Data.
    Sentences, Time-series, Images.
- Transformation/Operation: Gated Linear-Transformations (Matrix-Multiplication).
- Inductive Biases:
- Computational Power (Model of Computation): Turing Complete (Universal Turing Machine).
- Mathematical Model/System: Non-Linear Dynamical System.
Transformer Network:
- Representations:
  - Representational-Power:
- Input Structure:
Recursive Network:
- Representational-Power:
- Input Structure: Any Hierarchical Structure.
Further Network Architectures (More Specialized):
- Residual Network:
- Highway Network:
- Reversible Network:
- Generative Adversarial Network:
- Autoencoder Network:
- Symmetrically Connected Networks:
  - Hopfield Network:
  - Boltzmann Machines:
- Types of ANNs (wiki)
Types/Taxonomy of Neural Networks:
- FeedForward Neural Networks
  - Group method of data handling (GMDH) Network
  - Autoencoder Network
  - Probabilistic Neural Network
  - Time Delay Neural Network
  - Convolutional Neural Network
  - (Vanilla/Tensor) Deep Stacking Network
- Recurrent Neural Networks:
  - Fully Recurrent Neural Network
  - Hopfield Network
  - Boltzmann Machine Network
  - Self-Organizing Map
  - Learning Vector Quantization
  - Simple Recurrent
  - Reservoir Computing
  - Echo State
  - Long Short-term Memory (LSTM)
  - Bi-Directional
  - Hierarchical
  - Stochastic
  - Genetic Scale
- Memory Networks
  - One-Shot Associative Memory
  - Hierarchical Temporal Memory
  - Holographic Associative Memory
  - LSTM-related Differentiable Memory Structures
    - Differentiable push and pop actions for alternative memory networks called neural stack machines
    - Memory networks where the control network’s external differentiable storage is in the fast weights of another network
    - LSTM forget gates
    - Self-referential RNNs with special output units for addressing and rapidly manipulating the RNN’s own weights in differentiable fashion (internal storage)
    - Learning to transduce with unbounded memory
  - Neural Turing Machines
  - Semantic Hashing
  - Pointer Networks
Neural Networks and Graphical Models:
Deep NNs as PGMs:
You can view a deep neural network as a graphical model, but here, the CPDs are not probabilistic but are deterministic. Consider for example that the input to a neuron is \(\vec{x}\) and the output of the neuron is \(y .\) In the CPD for this neuron we have, \(p(\vec{x}, y)=1,\) and \(p(\vec{x}, \hat{y})=0\) for \(\hat{y} \neq y .\) Refer to the section 10.2 .3 of Deep Learning Book for more details.
Neural Networks as Gaussian Processes:
It’s long been known that these deep tools can be related to Gaussian processes, the ones I mentioned above. Take a neural network (a recursive application of weighted linear functions followed by non-linear functions), put a probability distribution over each weight (a normal distribution for example), and with infinitely many weights you recover a Gaussian process (see Neal or Williams for more details).

We can think about the finite model as an approximation to a Gaussian process.
When we optimise our objective, we minimise some “distance” (KL divergence to be more exact) between your model and the Gaussian process.
Neural Layers and Block Architectures:
- Feed-Forward Layer:
  - Representational-Power: Universal Function Approximator.
    Learns non-linear features.
  - Input Structure:
- Convolutional Layer:
  - Representational-Power:
  - Input Structure:
- Recurrent Layer:
  - Representational-Power:
  - Input Structure:
- Recursive Layer:
  - Representational-Power:
- Attention Layer:
  - Representational-Power:
  - Input Structure:
- Attention Block:
  - Representational-Power:
- Residual Block:
  - Representational-Power:
- Reversible Block:
  - Representational-Power:
- Reversible Layer:
  - Representational-Power:
Notes:
- Complexity:
  - Caching the activations of a NN:
    We need to cache the activation vectors of a NN after each layer \(Z^{[l]}\) because they are required in the backward computation.
- Initializations:
  - Initializing NN:
    - Don’t initialize the weights to Zero. The symmetry of hidden units results in a similar computation for each hidden unit, making all the rows of the weight matrix to be equal (by induction).
    - It’s OK to initialize the bias term to zero.
    - Since a neuron takes the sum of \(N\) inputsXweights, if \(N\) is large, you want smaller \(w_i\)s. You want to initialize with a variance \(\propto \dfrac{1}{n}\) (i.e. multiply by \(\dfrac{1}{\sqrt{n}}\); \(n\) is the number of weights in previous layer).
      This doesnt solve but reduces vanishing/exploding gradient problem because \(z\) would take a similar distribution.
      - Xavier Initialization: assumes \(\tanh\) activation; ^ uses logic above; samples from normal distribution and multiplies by \(\dfrac{1}{\sqrt{n}}\).
      - If ReLU activation, it turns out to be better to make variance \(\propto \dfrac{2}{n}\) instead.
- Training:
- The Bias Parameter:
  - Role of Bias in a NN
- Failures of Neural Networks:
  - Vanilla Sequence-to-Sequence Neural Nets cannot Model Reduplication (paper)
- Bayesian Deep Learning:
  - Uncertainty in deep learning,
  - Applications of Bayesian deep learning,
  - Probabilistic deep models (such as extensions and application of Bayesian neural networks),
  - Deep probabilistic models (such as hierarchical Bayesian models and their applications),
  - Generative deep models (such as variational autoencoders),
  - Information theory in deep learning,
  - Deep ensemble uncertainty,
  - NTK and Bayesian modeling,
  - Connections between NNs and GPs,
  - Incorporating explicit prior knowledge in deep learning (such as posterior regularisation with logic rules),
  - Approximate inference for Bayesian deep learning (such as variational Bayes / expectation propagation / etc. in Bayesian neural networks),
  - Scalable MCMC inference in Bayesian deep models,
  - Deep recognition models for variational inference (amortised inference),
  - Bayesian deep reinforcement learning,
  - Deep learning with small data,
  - Deep learning in Bayesian modelling,
  - Probabilistic semi-supervised learning techniques,
  - Active learning and Bayesian optimisation for experimental design,
  - Kernel methods in Bayesian deep learning,
  - Implicit inference,
  - Applying non-parametric methods, one-shot learning, and Bayesian deep learning in general.

Neural Networks
Architectures & Interpretations

Table of Contents

Neural Architectures

Neural Networks Architectures & Interpretations

Table of Contents

Neural Architectures

Neural Networks
Architectures & Interpretations