Table of Contents



Deep Learning Generalization

  1. Casting ML Algorithms as Bayesian Approximations:
  2. Why do Deep Nets Generalize?:
    One possibility is: “because they are really just an approximation to Bayesian machine learning.” - Ferenc

    • SGD:
      SGD could be responsible for the good generalization capabilities of Deep Nets.
      • SGD finds Flat Minima.
        • A Flat Minima is a minima where the Hessian - and consequently the inverse Fisher information matrix - has small eigenvalues.
        • Flat might be better than sharp minima:
          If you are in a flat minimum, there is a relatively large region of parameter space where many parameters are almost equivalent inasmuch as they result in almost equally low error. Therefore, given an error tolerance level, one can describe the parameters at the flat minimum with limited precision, using fewer bits while keeping the error within tolerance. In a sharp minimum, you have to describe the location of your minimum very precisely, otherwise your error may increase by a lot.
      • (Keskar et al, 2017) show that deep nets generalize better with smaller batch-size when no other form of regularisation is used.
        • And it may be because SGD biases learning towards flat minima, rather than sharp minima.
      • (Wilson et al, 2017) show that these good generalization properties afforded by SGD diminish somewhat when using popular adaptive SGD methods such as Adam or rmsprop.
      • Though, there is contradictory work by (Dinh et al, (2017) who claim sharp minima can generalize well, too.
        Also, (Zhang et al, 2017)
      • One conclusion is: The reason deep networks work so well (and generalize at all) is not just because they are some brilliant model, but because of the specific details of how we optimize them.
        Stochastic gradient descent does more than just converge to a local optimum, it is biased to favor local optima with certain desirable properties, resulting in better generalization.
      • Is SGD Bayesian?

    Ways to Explain Generalization:

    • The bigger (deeper) the network the easier it is to train (because the optimization landscape becomes simpler); so this + early-stopping can lead to good solutions that wouldn’t utilize the more funky functions we can represent with the bigger network.
      Intuition:
      • Optimizing/Searching over a huge function space (e.g. all possible functions), it is easier to “steer” into the correct one, whereas if you’re restricted and you can only have certain types of functions, then you need to find your path from one to the other which is generally harder to do.
      • Another view of the same is that, if you have a really large network w/ random initialization (e.g. infinitely big) such that a subnetwork exists that solves the problem (i.e. what you are searching for), so if it’s already present, backprop can choose the direction of the parameters leading to that subnetwork cuz that’ll make the biggest improvement, and you will learn very quickly.
      • On the other hand, having a huge number of parameters can lead many of the parameter settings to be equally good/leading to the same solution (since the NN is never unique), so finding any local minima yields a good solution.
        Moreover, perhaps some of these local minimas are actually bad if they were to be optimized fully however, w/ early-stopping you can stop at the configuration of the parameters that would yield a good result w/o overtraining into that local minima that would lead to overlearning.
        While even when early-stopping in a small network, the parameters setting we learn is already at such a low/deep point in the local minima that it already “overlearned”? (or that most of these local minimas are not great)

    Notes:

    • “We can connect this finding to recent work examining the generalization of large neural networks. Zhang et al. (2017) observe that deep neural networks seemingly violate the common understanding of learning theory that large models with little regularization will not generalize well. The observed disconnect between NLL and 0/1 loss suggests that these high capacity models are not necessarily immune from overfitting, but rather, overfitting manifests in probabilistic error rather than classification error.” - On Calibration of Modern Neural Networks
    • Another w

Misc.

  1. NLP Research and What’s Next:
    Progress in NLP/AI:
    • Machine learning with feature engineering:
      Learning weights for engineered featured.
    • Deep learning for feature learning:
      Using DL to automatically learn features (e.g. embeddings).
    • Deep architecture engineering for single tasks:
      Each sub-field in NLP converged to a particular Network Architecture.
    • (NOW) Deep Single MultiTask Model

    Limits of Single-Task Learning:

    • Great performance improvements in recent years given {dataset, task, model, metric}
    • We can hill-climb to local optima as long as \(\vert \text{dataset} \vert > 1000 \times C\)
    • For more general AI, we need continuous learning in a single model instead
    • Models typically start from random or are only partly pre-trained

    There is no single blocking task in Natural Language. (compared to Classification in Vision)
    HOWEVER, MultiTask Learning is a blocker for general NLP systems.

    Why has weight and model sharing not happened as much in NLP?:

    • NLP requires many types of reasoning:
      Logical, Linguistic, Emotional, Visual, etc.
    • Requires Short and Long-term Memory
    • NLP had been divided into intermediate and separate tasks to make progress:
      \(\rightarrow\) Benchmark chasing in each community
    • Can a single unsupervised task solve it all? No.
      • Language clearly requires supervision in nature (kid in jungle -> easy to develop vision not language).

    How to express many NLP tasks in the same framework?:

    • NLP Frameworks:
      • Sequence Tagging: named entity recognition, aspect specific sentiment
      • Text Classification: dialogue state tracking, sentiment classification
      • Seq2seq: machine translation, summarization, question answering
    • NLP SuperTasks:
      Hypothesis: The following are Three Equivalent SuperTasks of NLP where we can pose all possible NLP tasks as either one of them:
      • Language Modeling: condition on Question+Context, then generate
      • Question Answering: Question=Task
      • Dialogue: open-ended, limited datasets

      Thus, Question Answering is the most appropriate SuperTask to choose to cast NLP problems in.

    The Natural Language Decathlon (decaNLP):

    • Multitask Learning as Question Answering:
      Casts all NLP tasks as Question Answering problems.
    • decaNLP Tasks:
      1. Question Answering
      2. Machine Translation
      3. Summarization
      4. Natural Language Inference
      5. Sentiment Classification
      6. Semantic Role Labeling
      7. Relation Extraction
      8. Dialogue
      9. Semantic Parsing
      10. Commonsense Reasoning
    • Meta-Supervised Learning: From \(\{x, y\}\) to \(\{x, t, y\}\) (\(t\) is the task)
    • Use a question, \(q\), as a natural description of the task, \(t\), to allow the model to use linguistic information to connect tasks
    • \(y\) is the answer to \(q\) and \(x\) is the context necessary to answer \(q\)
    • Model Specifications for decaNLP:
      • No task-specific modules or parameters because we assume the task ID is not available
      • Must be able to adjust internally to perform disparate tasks
      • Should leave open the possibility of zero-shot inference for unseen tasks


Papers

  1. Fast Weights:
    Fast-Weights:



Observations, Ideas, Questions, etc.

  1. Observations from Papers/Blogs/etc.:
    • deep neural networks seemingly violate the common understanding of learning theory that large models with little regularization will not generalize well. The observed disconnect between NLL and 0/1 loss suggests that these high capacity models are not necessarily immune from overfitting, but rather, overfitting manifests in probabilistic error rather than classification error. paper
    • It is also interesting to see that the global average pooling operation can significantly increase the classification accuracy for both CNNs and CNTKs. From this observation, we suspect that many techniques that improve the performance of neural networks are in some sense universal, i.e., these techniques might benefit kernel methods as well lnk.
    • Is Optimization a Sufficient Language for Understanding Deep Learning?
      • Conventional View (CV) of Optimization:
        Find a solution of minimum possible value of the objective, as fast as possible.
      • If our goal is mathematical understanding of deep learning, then the CV of Opt is potentially insufficient.
    • Representable Does Not Imply Learnable.
    • Recurrent Models in-practice can be approximated with FeedForward Models:
      FF models seem to match or exceed the performance of Recurrent models on almost all tasks.
      Suggesting that Recurrent models extra expressiveness might not be needed/used.
      The following is conjectured: “Recurrent models trained in practice are effectively feed-forward”.
      • This paper (Stable Recurrent Models) proves that stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent.
    • The unlimited context offered by recurrent models is not strictly necessary for language modeling.
      i.e. it’s possible you don’t need a large amount of context to do well on the prediction task on average. Recent theoretical work offers some evidence in favor of this view.


  2. Ideas:
    Project Ideas:
    • NMF on Text Data
    • read old Schmidhuber/Hinton Papers and reapply them on current hardware/datasets
    • word-embeddings and topic-modeling and NMF, on ARXIV ML Papers
    • Generative Adversarial Framework for Speech Recognition

    Research Ideas:

    • Overfitting on NLL to explain Deep-NN generalization
    • DeepSets and Attention theoretical guarantees
    • Experimenting w/ Mutual Info (IB) w/ knowledge distillation
    • Language Modeling Decoding using Attention (main problem is beam size = greedy)
    • Experiment with Lots of data w/ simpler models VS Less data w/ advanced models
      (“A dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.”)
    • To measure the effect of depth: construct a dataset that requires a deep network to model efficiently
    • Weights that generalize the most have the least gradient when training on a new dataset (remember: cats vs dogs -> foxes)
    • K-Separable learning instead of linearly (2-)separable learning; the output layer dictates the configuration of the transformed input data to be “classified”
    • wrt. Karpathys “Unreasonable Effectiveness of RNNs” post and Y.Goldbergs “Unreasonable Effectiveness of (Char-Level) n-grams” post: Do the complex rules learned by an LSTM architecture (considering its inductive biases) e.g. neuron that counts, tracks indentation, etc. have better generalization than n-gram probabilities? Do these rules imply learning a better method, algorithm, mechanism, etc.?
      • Golberg claims that RNNs are impressive because of “context awareness” (in C-code generation syntax).
      • (compare the number \(n\) of n-tuples VS dimension size of \(h\))
        (hint: google seems to think it’s \(n=13\) beats infinite \(h\))
      • Is this why/how NNs generalize?
    • wrt. Unintended Memorization in Neural Networks (Blog+Paper): it proposes an attack to extract sensitive info from a model trained on private data (using a “canary”). I can refine the attack much further by exploiting the fact that the model is trained to maximize NLL and it will give the training data higher probability.
    • Can we use the ideas in the paper on differential privacy and unintended memorization in NNs to help learn more generalizable models/weights/patterns?

  3. Questions:
    • Does Hinge Loss Maximize a margin when used with any classifier?
    • How does LDA do feature extraction / dim-red?
    • Time series data is known to posses linearity?
    • How can we frame the Abstract Modeling Problem?
    • How do humans speak (generate sentences)? They definitely do not just randomly sample from the distribution of natural language. Then, how should we teach models to speak, respond, think, etc.?
    • Is it true that research “should” be hypothesis -> experiments and not the other way like in AI?
    • Where does “Game Theory” fit into AI, really?

  4. General Notes & Observations:
    • Three Schools of Learning: (1) Bayesians (2) Kernel People (3) Frequentists
    • Bayesians’, INCORRECTLY, claimed that:
      • Highly over-parametrised models fitted via maximum likelihood can’t possibly work, they will overfit, won’t generalise, etc.
      • Any model with infinite parameters should be strictly better than any large, but finite parametric model.
        (e.g. nonparametric models like kernel machines are a principled way to build models with effectively infinite number of parameters)
    • Don’t try averaging if you want to synchronize a bunch of clocks! (Ensemble Averaging/Interview Q)
      The noise is not Gaussian.
      Instead, you expect that many of them would be slightly wrong, and a few of them would have stopped or would be wildly wrong and by averaging you end making them all significantly wrong.
    • Generalization:
      • It seems that Occams Razor is equivalent to saying that a “non-economical” model is not a good model. So, can we use Inf-Th to quantify the information in these models?
        The idea that a simple model e.g. “birds fly”, is much better than a much more complicated and hard to encode model e.g. “birds fly except chicken, penguins, etc.”
    • Width vs Depth in NN Architectures:
      Thinking of the NN as running a computer program that performs a calculation, you can think of width as a measure of how much parallelization you can have in your computation, and depth as a measure of serialization.
    • A Hopfield net the size of a brain (connectivity patterns are quite diff, of course) could store a memory per second for 450 years.
    • Overfitting in the Brain: You can call it superstition or bad habits. Even teach some to animals.
    • Real world data prefers lower Kolmogorov complexity (and hence enables the ability to learn) is a very strange fundamental asymmetry in nature??
      It’s as puzzling as having so much matter than antimatter.
    • An SVM is, in a way, a type of neural network (you can learn a SVM solution through backpropagation)
    • In CNNs there are no FC layers, they are equivalent to \(1 \times 1\) convolutions link
    • In support of IB (Information Bottleneck) Theory: this paper suggests that Memorization happens early in training.

  5. Insights:
    • Utkarsh idea of co-adaptation is similar to DROPOUT Motivation: hidden units co-adapting to each other on training data
    • Attention Functions Properties: monotonicity, sparsity etc.
    • Think about Learning, Overfitting, and Regularization in terms of accidental regularities/patterns due to the particular sample
      • Process of Learning:
        1. Fit Most Common Pattern vs Fit Easiest Patterns?
        2. Fit Most Common Pattern vs Fit Easiest Patterns?
        3. Fit next Most Common Pattern vs Fit next Easiest Patterns? ..
        4. Fit patterns that exist per/sample (e.g. Noise)
      • Overfitting:
        Happens when there are patterns that manifest in the particular sample that might not have been that common when looking at a larger/different sample.
      • Regularization:
        Stops the model from learning the least-common/hardest patterns by putting some sort of threshold.
      • When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.
        The higher the capacity, the better it fits the sampling error.
      • This ties in nicely with the idea of “match your model capacity to the amount of data that you have and NOT to the target capacity.
    • Although Recurrent models do not have an “explicit way to model long and short range dependencies”, FastWeights does.
    • Although Recurrent models do not have an “explicit way to model long and short range dependencies”, FastWeights does.
  6. Experiments & Results: