Ahmad Badary

Deep Learning Generalization

Casting ML Algorithms as Bayesian Approximations:
- Classical ML:
  - L1 regularization is just MAP estimation with sparsity inducing priors
  - SVMS, support vector machines, are just the wrong way to train Gaussian processes
  - Herding is just Bayesian quadrature done slightly wrong
- DL:
  - LeCun Post on Uncertainty in Neural Networks
  - Dropout is just variational inference done wrong: Dropout as a Bayesian Approximation
- Deep Nets memorize
Why do Deep Nets Generalize?:
One possibility is: “because they are really just an approximation to Bayesian machine learning.” - Ferenc
- SGD:
  SGD could be responsible for the good generalization capabilities of Deep Nets.
  - SGD finds Flat Minima.
    - A Flat Minima is a minima where the Hessian - and consequently the inverse Fisher information matrix - has small eigenvalues.
    - Flat might be better than sharp minima:
      If you are in a flat minimum, there is a relatively large region of parameter space where many parameters are almost equivalent inasmuch as they result in almost equally low error. Therefore, given an error tolerance level, one can describe the parameters at the flat minimum with limited precision, using fewer bits while keeping the error within tolerance. In a sharp minimum, you have to describe the location of your minimum very precisely, otherwise your error may increase by a lot.
  - (Keskar et al, 2017) show that deep nets generalize better with smaller batch-size when no other form of regularisation is used.
    - And it may be because SGD biases learning towards flat minima, rather than sharp minima.
  - (Wilson et al, 2017) show that these good generalization properties afforded by SGD diminish somewhat when using popular adaptive SGD methods such as Adam or rmsprop.
  - Though, there is contradictory work by (Dinh et al, (2017) who claim sharp minima can generalize well, too.
    Also, (Zhang et al, 2017)
  - One conclusion is: The reason deep networks work so well (and generalize at all) is not just because they are some brilliant model, but because of the specific details of how we optimize them.
    Stochastic gradient descent does more than just converge to a local optimum, it is biased to favor local optima with certain desirable properties, resulting in better generalization.
  - Is SGD Bayesian?
    - Some work:
      - Stochastic Gradient Descent as Approximate Bayesian Inference
    - Flat Minima is Bayesian:
      It turns out, (Hochreiter and Schmidhuber, 1997) motivated their work on seeking flat minima from a Bayesian, minimum description length perspective.
      Even before them, (Hinton and van Camp, 1993) presented the same argument in the context of Bayesian neural networks.
Indeed, the empirical evidence about generalization in deep neural networks with a huge number of weights is that they generalize better than the theory would predict. This is not just in terms of the loose bounds that the theory provides. The performance is better than even the “tight” rules of thumb that were based on the theory and worked in practice.

This is not the first time this happens in ML. When boosting was the method of choice, generalization was better than it should be. Specifically, there was no overfitting in cases where the model complexity was going up and overfitting would be expected. In that case, a theoretical approach to explain the phenomenon based on a cost function other than 𝐸𝑖𝑛 [in-sample error] was advanced. It made sense but it didn’t stand up to scrutiny, as minimizing that cost function directly (instead of letting it be minimized through the specific structure of the AdaBoost algorithm for instance) suffered from the usual overfitting. There was no conclusive verdict about how AdaBoost avoids overfitting. There were bits and pieces of intuition, but it is difficult to tell whether that was explanation or rationalization.

In the case of neural networks, there have also been efforts to explain why the performance is better. There are other approaches to generalization, e.g., based on “stability” of learning, that were invoked. However, the theoretical work on stability was based on perturbation of the training set that does not lead to a significant change in the final hypothesis. The way stability is discussed in the results I have seen in neural networks is based on perturbation of the weights that does not lead to a significant change. It thus uses the concept of stability rather than the established theoretical results to explain why generalization is good. In fact, there are algorithms that deliberately look for a solution that has this type of stability as a way to get good generalization, a regularization of sorts.

It is conceivable that the structure of deep neural networks, similar to the case of AdaBoost, tends to result in better generalization than the general theory would indicate. To establish that, we need to identify what is it about the structure that makes this happen. In comparison, if you study SVM as a model without getting into the notion of support vectors, you will encounter “inexplicable” good generalization. Once you know about how the number of support vectors affects generalization, the mystery is gone.

Let me conclude by emphasizing that the VC theory is not violated in any of these instances, since the theory only provides an upper bound. Those cases show a much better performance for particular models, but the performance is still within the theoretical bound. What would be a breakthrough is another, better bound that is applicable to an important class of models. For example, the number of parameters in deep neural networks is far bigger than previous models. If better generalization bounds can be proven for models with huge number of parameters, for instance, that would be quite a coup.

Ways to Explain Generalization:
- The bigger (deeper) the network the easier it is to train (because the optimization landscape becomes simpler); so this + early-stopping can lead to good solutions that wouldn’t utilize the more funky functions we can represent with the bigger network.
  Intuition:
  - Optimizing/Searching over a huge function space (e.g. all possible functions), it is easier to “steer” into the correct one, whereas if you’re restricted and you can only have certain types of functions, then you need to find your path from one to the other which is generally harder to do.
  - Another view of the same is that, if you have a really large network w/ random initialization (e.g. infinitely big) such that a subnetwork exists that solves the problem (i.e. what you are searching for), so if it’s already present, backprop can choose the direction of the parameters leading to that subnetwork cuz that’ll make the biggest improvement, and you will learn very quickly.
  - On the other hand, having a huge number of parameters can lead many of the parameter settings to be equally good/leading to the same solution (since the NN is never unique), so finding any local minima yields a good solution.
    Moreover, perhaps some of these local minimas are actually bad if they were to be optimized fully however, w/ early-stopping you can stop at the configuration of the parameters that would yield a good result w/o overtraining into that local minima that would lead to overlearning.
    While even when early-stopping in a small network, the parameters setting we learn is already at such a low/deep point in the local minima that it already “overlearned”? (or that most of these local minimas are not great)
Notes:
- “We can connect this finding to recent work examining the generalization of large neural networks. Zhang et al. (2017) observe that deep neural networks seemingly violate the common understanding of learning theory that large models with little regularization will not generalize well. The observed disconnect between NLL and 0/1 loss suggests that these high capacity models are not necessarily immune from overfitting, but rather, overfitting manifests in probabilistic error rather than classification error.” - On Calibration of Modern Neural Networks
- Another w

Misc.

NLP Research and What’s Next:
Progress in NLP/AI:
- Machine learning with feature engineering:
  Learning weights for engineered featured.
- Deep learning for feature learning:
  Using DL to automatically learn features (e.g. embeddings).
- Deep architecture engineering for single tasks:
  Each sub-field in NLP converged to a particular Network Architecture.
- (NOW) Deep Single MultiTask Model
Limits of Single-Task Learning:
- Great performance improvements in recent years given {dataset, task, model, metric}
- We can hill-climb to local optima as long as \(\vert \text{dataset} \vert > 1000 \times C\)
- For more general AI, we need continuous learning in a single model instead
- Models typically start from random or are only partly pre-trained
There is no single blocking task in Natural Language. (compared to Classification in Vision)
HOWEVER, MultiTask Learning is a blocker for general NLP systems.

Why has weight and model sharing not happened as much in NLP?:
- NLP requires many types of reasoning:
  Logical, Linguistic, Emotional, Visual, etc.
- Requires Short and Long-term Memory
- NLP had been divided into intermediate and separate tasks to make progress:
  \(\rightarrow\) Benchmark chasing in each community
- Can a single unsupervised task solve it all? No.
  - Language clearly requires supervision in nature (kid in jungle -> easy to develop vision not language).
How to express many NLP tasks in the same framework?:
- NLP Frameworks:
  - Sequence Tagging: named entity recognition, aspect specific sentiment
  - Text Classification: dialogue state tracking, sentiment classification
  - Seq2seq: machine translation, summarization, question answering
- NLP SuperTasks:
  Hypothesis: The following are Three Equivalent SuperTasks of NLP where we can pose all possible NLP tasks as either one of them:
  - Language Modeling: condition on Question+Context, then generate
  - Question Answering: Question=Task
  - Dialogue: open-ended, limited datasets
  Thus, Question Answering is the most appropriate SuperTask to choose to cast NLP problems in.
The Natural Language Decathlon (decaNLP):
- Multitask Learning as Question Answering:
  Casts all NLP tasks as Question Answering problems.
- decaNLP Tasks:
  1. Question Answering
  2. Machine Translation
  3. Summarization
  4. Natural Language Inference
  5. Sentiment Classification
  6. Semantic Role Labeling
  7. Relation Extraction
  8. Dialogue
  9. Semantic Parsing
  10. Commonsense Reasoning
- Meta-Supervised Learning: From \(\{x, y\}\) to \(\{x, t, y\}\) (\(t\) is the task)
- Use a question, \(q\), as a natural description of the task, \(t\), to allow the model to use linguistic information to connect tasks
- \(y\) is the answer to \(q\) and \(x\) is the context necessary to answer \(q\)
- Model Specifications for decaNLP:
  - No task-specific modules or parameters because we assume the task ID is not available
  - Must be able to adjust internally to perform disparate tasks
  - Should leave open the possibility of zero-shot inference for unseen tasks

Papers

Fast Weights:
Fast-Weights:
- FAST-WEIGHTS:
  - Basic Idea:
    - on each connection: Total weight = Sum of:
      - Standard Slow Weights. This learns slowly & (may also) decay slowly. Holds long term Knowledge.
      - The Fast Weights: Learns quickly, decays quickly, Holds Temp. info.
  - Motivation:
    - Priming: listen to a word \(\rightarrow\) recognize many minutes later in Noisy Env.
      - If we had localist Representation could just temporarily lower the threshold of the “cucumber” weight
      - If we use point Attractors instead of “localist units” we con temporarily increase the “attractiveness” of the words unit (by changing the weights between the neurons in that pattern of activity)
  - Weight Matrices VS Activity Vectors:
    Weight Matrices are better:
    (1) More capacity \(N^2\) vs \(N\) (2) A fast weight matrix of \(1000 x 1000\) can easily make 100 attractors more “attractive”
  - Three ways to store Temp. knowledge:
    - LSTM, Stores it in its activity vectors [hidden weights] \(\implies\) Irrelevant temp Memory interferes with on-going process
    - An additional External memory to LSTM, can store without interference but need to - learn when to read/white.
    - Fast-Weights: Allow the temporal Knowledge to be stored without having any extra neurons.
      They just make some attractors easier to fall into; and they also “flavor” the attractor by slightly changing the activity vector you end up with.

Observations, Ideas, Questions, etc.

Observations from Papers/Blogs/etc.:
- deep neural networks seemingly violate the common understanding of learning theory that large models with little regularization will not generalize well. The observed disconnect between NLL and 0/1 loss suggests that these high capacity models are not necessarily immune from overfitting, but rather, overfitting manifests in probabilistic error rather than classification error. paper
- It is also interesting to see that the global average pooling operation can significantly increase the classification accuracy for both CNNs and CNTKs. From this observation, we suspect that many techniques that improve the performance of neural networks are in some sense universal, i.e., these techniques might benefit kernel methods as well lnk.
- Is Optimization a Sufficient Language for Understanding Deep Learning?
  - Conventional View (CV) of Optimization:
    Find a solution of minimum possible value of the objective, as fast as possible.
  - If our goal is mathematical understanding of deep learning, then the CV of Opt is potentially insufficient.
- Representable Does Not Imply Learnable.
- Recurrent Models in-practice can be approximated with FeedForward Models:
  FF models seem to match or exceed the performance of Recurrent models on almost all tasks.
  Suggesting that Recurrent models extra expressiveness might not be needed/used.
  The following is conjectured: “Recurrent models trained in practice are effectively feed-forward”.
  - This paper (Stable Recurrent Models) proves that stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent.
- The unlimited context offered by recurrent models is not strictly necessary for language modeling.
  i.e. it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view.
Ideas:
Project Ideas:
- NMF on Text Data
- read old Schmidhuber/Hinton Papers and reapply them on current hardware/datasets
- word-embeddings and topic-modeling and NMF, on ARXIV ML Papers
- Generative Adversarial Framework for Speech Recognition
Research Ideas:
- Overfitting on NLL to explain Deep-NN generalization
- DeepSets and Attention theoretical guarantees
- Experimenting w/ Mutual Info (IB) w/ knowledge distillation
- Language Modeling Decoding using Attention (main problem is beam size = greedy)
- Experiment with Lots of data w/ simpler models VS Less data w/ advanced models
  (“A dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.”)
- To measure the effect of depth: construct a dataset that requires a deep network to model efficiently
- Weights that generalize the most have the least gradient when training on a new dataset (remember: cats vs dogs -> foxes)
- K-Separable learning instead of linearly (2-)separable learning; the output layer dictates the configuration of the transformed input data to be “classified”
- wrt. Karpathys “Unreasonable Effectiveness of RNNs” post and Y.Goldbergs “Unreasonable Effectiveness of (Char-Level) n-grams” post: Do the complex rules learned by an LSTM architecture (considering its inductive biases) e.g. neuron that counts, tracks indentation, etc. have better generalization than n-gram probabilities? Do these rules imply learning a better method, algorithm, mechanism, etc.?
  - Golberg claims that RNNs are impressive because of “context awareness” (in C-code generation syntax).
  - (compare the number \(n\) of n-tuples VS dimension size of \(h\))
    (hint: google seems to think it’s \(n=13\) beats infinite \(h\))
  - Is this why/how NNs generalize?
- wrt. Unintended Memorization in Neural Networks (Blog+Paper): it proposes an attack to extract sensitive info from a model trained on private data (using a “canary”). I can refine the attack much further by exploiting the fact that the model is trained to maximize NLL and it will give the training data higher probability.
- Can we use the ideas in the paper on differential privacy and unintended memorization in NNs to help learn more generalizable models/weights/patterns?
Questions:
- Does Hinge Loss Maximize a margin when used with any classifier?
- How does LDA do feature extraction / dim-red?
- Time series data is known to posses linearity?
- How can we frame the Abstract Modeling Problem?
- How do humans speak (generate sentences)? They definitely do not just randomly sample from the distribution of natural language. Then, how should we teach models to speak, respond, think, etc.?
- Is it true that research “should” be hypothesis -> experiments and not the other way like in AI?
- Where does “Game Theory” fit into AI, really?
General Notes & Observations:
- Three Schools of Learning: (1) Bayesians (2) Kernel People (3) Frequentists
- Bayesians’, INCORRECTLY, claimed that:
  - Highly over-parametrised models fitted via maximum likelihood can’t possibly work, they will overfit, won’t generalise, etc.
  - Any model with infinite parameters should be strictly better than any large, but finite parametric model.
    (e.g. nonparametric models like kernel machines are a principled way to build models with effectively infinite number of parameters)
- Don’t try averaging if you want to synchronize a bunch of clocks! (Ensemble Averaging/Interview Q)
  The noise is not Gaussian.
  Instead, you expect that many of them would be slightly wrong, and a few of them would have stopped or would be wildly wrong and by averaging you end making them all significantly wrong.
- Generalization:
  - It seems that Occams Razor is equivalent to saying that a “non-economical” model is not a good model. So, can we use Inf-Th to quantify the information in these models?
    The idea that a simple model e.g. “birds fly”, is much better than a much more complicated and hard to encode model e.g. “birds fly except chicken, penguins, etc.”
- Width vs Depth in NN Architectures:
  Thinking of the NN as running a computer program that performs a calculation, you can think of width as a measure of how much parallelization you can have in your computation, and depth as a measure of serialization.
- A Hopfield net the size of a brain (connectivity patterns are quite diff, of course) could store a memory per second for 450 years.
- Overfitting in the Brain: You can call it superstition or bad habits. Even teach some to animals.
- Real world data prefers lower Kolmogorov complexity (and hence enables the ability to learn) is a very strange fundamental asymmetry in nature??
  It’s as puzzling as having so much matter than antimatter.
- An SVM is, in a way, a type of neural network (you can learn a SVM solution through backpropagation)
- In CNNs there are no FC layers, they are equivalent to \(1 \times 1\) convolutions link
- In support of IB (Information Bottleneck) Theory: this paper suggests that Memorization happens early in training.
Insights:
- Utkarsh idea of co-adaptation is similar to DROPOUT Motivation: hidden units co-adapting to each other on training data
- Attention Functions Properties: monotonicity, sparsity etc.
- Think about Learning, Overfitting, and Regularization in terms of accidental regularities/patterns due to the particular sample
  - Process of Learning:
    1. Fit Most Common Pattern vs Fit Easiest Patterns?
    2. Fit Most Common Pattern vs Fit Easiest Patterns?
    3. Fit next Most Common Pattern vs Fit next Easiest Patterns? ..
    4. …
    5. Fit patterns that exist per/sample (e.g. Noise)
  - Overfitting:
    Happens when there are patterns that manifest in the particular sample that might not have been that common when looking at a larger/different sample.
  - Regularization:
    Stops the model from learning the least-common/hardest patterns by putting some sort of threshold.
  - When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.
    The higher the capacity, the better it fits the sampling error.
  - This ties in nicely with the idea of “match your model capacity to the amount of data that you have and NOT to the target capacity”.
  - - The bias term is big if the model has too little capacity to fit the data.
    - The variance term is big if the model has so much capacity that it is good at fitting the sampling error in each particular training set.
- Although Recurrent models do not have an “explicit way to model long and short range dependencies”, FastWeights does.
- Although Recurrent models do not have an “explicit way to model long and short range dependencies”, FastWeights does.
Experiments & Results:

Machine Learning Research

Table of Contents

Deep Learning Generalization

Misc.

Papers

Observations, Ideas, Questions, etc.