Table of Contents



LeNet-5: (LeCun et al., 1998)
img

AlexNet (Krizhevsky et al. 2012)

img

  1. Architecture:
    [CONV1-MAX-POOL1-NORM1-CONV2-MAX-POOL2-NORM2-CONV3-CONV4-CONV5-Max-POOL3-FC6-FC7-FC8]
  2. Parameters:
    • First Layer (CONV1):
      • F: second
    img
  3. Key Insights:
    • first use of ReLU
    • used Norm layers (not common anymore)
    • heavy data augmentation
    • dropout 0.5
    • batch size 128
    • SGD Momentum 0.9
    • Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus
    • L2 weight decay 5e-4
    • 7 CNN ensemble: 18.2% -> 15.4%
  4. Results:
    img
  5. ZFNet (Zeiler and Fergus, 2013):
    img
    img

VGGNet (Simonyan and Zisserman, 2014)

img

  1. Parameters:
    • CONV: F=1, S=1, P=1
    • POOL: F=2, S=2

      For all layers

    img

    Notice:
    Parameters are mostly in the FC Layers
    Memory mostly in the CONV Layers

  2. Key Insights:
    • Smaller Filters
    • Deeper Networks
    • Similar Training as AlexNet
    • No LRN Layer
    • Both VGG16 and VGG19
    • Uses Ensembles for Best Results
  3. Smaller Filters Justification:
    • A Stack of three 3x3 conv (stride 1) layers has same effective receptive field as one 7x7 conv layer
    • However, now, we have deeper nets and more non-linearities
    • Also, fewer parameters:
      3 * (3^2C^2 ) vs. 7^2C^2 for C channels per layer
  4. Properties:
    • FC7 Features generalize well to other tasks
  5. VGG16 vs VGG19:
    VGG19 is only slightly better and uses more memory
  6. Results:
    ILSVRC’14 2nd in classification, 1st in localization
    img

GoogLeNet (Szegedy et al., 2014)

img

  1. Architecture:
    img
  2. Parameters:
    Parameters as specified in the Architecture and the Inception Modules
  3. Key Insights:
    • (Even) Deeper Networks
    • Computationally Efficient
    • 22 layers
    • Efficient “Inception” module
    • No FC layers
    • Only 5 million parameters: 12x less than AlexNet
  4. Inception Module:
    • Idea: design a good local network topology (network within a network) and then stack these modules on top of each other
    img
    • Architecture:
      • Apply parallel filter operations on the input from previous layer:
        • Multiple receptive field sizes for convolution (1x1, 3x3, 5x5)
        • Pooling operation (3x3)
      • Concatenate all filter outputs together depth-wise
    • Issue: Computational Complexity is very high
      img
    • Solution: use BottleNeck Layers that use 1x1 convolutions to reduce feature depth

      preserves spatial dimensions, reduces depth!

    img
  5. Results:
    ILSVRC’14 classification winner
    img

ResNet (He et al., 2015)

  1. Architecture:
    img
  2. Key Insights:
    • Very Deep Network: 152-layers
    • Uses Residual Connections
    • Deep Networks have very bad performance NOT because of overfitting but because of a lack of adequate optimization
  3. Motivation:
    img
    • Observation: Deeper Networks perform badly on the test error but also on the training error
    • Assumption: Deep Layers should be able to perform at least as well as the shallower models
    • Hypothesis: the problem is an optimization problem, deeper models are harder to optimize
    • Solution (work-around): Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping
  4. Residuals:
    img
  5. BottleNecks:
    img
  6. Training:
    • Batch Normalization after every CONV layer
      • Xavier/2 initialization from He et al.
      • SGD + Momentum (0.9)
      • Learning rate: 0.1, divided by 10 when validation error plateaus
      • Mini-batch size 256
      • Weight decay of 1e-5
      • No dropout used
  7. Results:
    • ILSVRC’15 classification winner (3.57% top 5 error)
    • Swept all classification and detection competitions in ILSVRC’15 and COCO’15
    • Able to train very deep networks without degrading (152 layers on ImageNet, 1202 on Cifar)
    • Deeper networks now achieve lowing training error as expected
    img
    img

Comparisons

  1. Complexity:
    img
    img
  2. Forward-Pass Time and Power Consumption:
    img
    img

Interesting Architectures

  1. Network in Network (NiN) [Lin et al. 2014]:
    • Mlpconv layer with “micronetwork” within each conv layer to compute more abstract features for local patches
    • Micronetwork uses multilayer perceptron (FC, i.e. 1x1 conv layers)
    • Precursor to GoogLeNet and ResNet “bottleneck” layers
    • Philosophical inspiration for GoogLeNet
    img
  2. Identity Mappings in Deep Residual Networks (Improved ResNets) [He et al. 2016]:
    • Improved ResNet block design from creators of ResNet
    • Creates a more direct path for propagating information throughout network (moves activation to residual mapping pathway)
    • Gives better performance
    img
  3. Wide Residual Networks (Improved ResNets) [Zagoruyko et al. 2016]:
    • Argues that residuals are the important factor, not depth
    • User wider residual blocks (F x k filters instead of F filters in each layer)
    • 50-layer wide ResNet outperforms 152-layer original ResNet
    • Increasing width instead of depth more computationally efficient (parallelizable)
    img
  4. Aggregated Residual Transformations for Deep Neural Networks (ResNeXt) [Xie et al. 2016]:
    • Also from creators of ResNet
    • Increases width of residual block through multiple parallel pathways (“cardinality”)
    • Parallel pathways similar in spirit to Inception module
    img
  5. Deep Networks with Stochastic Depth (Improved ResNets) [Huang et al. 2016]:
    • Motivation: reduce vanishing gradients and training time through short networks during training
    • Randomly drop a subset of layers during each training pass
    • Bypass with identity function
    • Use full deep network at test time
    img

Beyond ResNets

  1. FractalNet: Ultra-Deep Neural Networks without Residuals [Larsson et al. 2017]:
    • Argues that key is transitioning effectively from shallow to deep and residual representations are not necessary
    • Fractal architecture with both shallow and deep paths to output
    • Trained with dropping out sub-paths
    • Full network at test time
    img
  2. Densely Connected Convolutional Networks [Huang et al. 2017]:
    • Dense blocks where each layer is connected to every other layer in feedforward fashion
    • Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse ; img
  3. SqueezeNet (Efficient NetWork) [Iandola et al. 2017]:
    • AlexNet-level Accuracy With 50x Fewer Parameters and <0.5Mb Model Size
    • Fire modules consisting of a ‘squeeze’ layer with 1x1 filters feeding an ‘expand’ layer with 1x1 and 3x3 filters
    • Can compress to 510x smaller than AlexNet (0.5Mb)
    img