Table of Contents



PyTorch:

CNN Tensor Shape: \([B, C, H, W]\)

CNNs



RNNs

  1. Modeling Sequences - Memory as a model property:

    Notes:

  2. Stability - Vanishing and Exploding Gradients:


Maths

  1. Metrics and Quasi-Metrics:
  2. Binary Relations (abstract algebra):
  3. Set Theory:
    • Number of subsets of a set of \(N\) elements \(= 2^N\)
    • Number of pairs (e.g. \((a,b)\)) of a set of \(N\) elements \(= N^2\)
      e.g. \(\mathbb{R} \times \mathbb{R} = \mathbb{R}^2\)
  4. Proof Methods:
  5. Mathematical Concepts - The Map of Mathematics:

    Further Reading

  6. From Topology to Algebraic Topology:
  7. Notes:
    • Dot Product Scale/Magnitude: the scale of the dot product increases as dimensions get larger.
      Assume that the components of \(q\) and \(k\) are independent random variables with mean \(0\) and variance \(1\). Then their dot product, \(q^Tk = \sum_{i=1}^{d_k} q_ik_i\), (where \(d_k = \vert k \vert\) is the dimension of \(k \in \mathbb{R}^{d_k} \iff q \in \mathbb{R}^{d_q}\)) has mean \(0\) and variance \(d_k\).
  8. Formulas:
  9. Linear Algebra:
    Matrices:
    • Symmetric Matrices:
      • can choose its eigenvectors to be orthonormal
    • PSD:
    • PD:


Statistics and Probability Theory

  1. ROC Curve:
    • A way to quantify how good a binary classifier separates two classes
    • True-Positive-Rate / False-Positive-Rate
    • Good classifier has a ROC curve that is near the top-left diagonal (hugging it)
    • A Bad Classifier has a ROC curve that is close to the diagonal line
    • It allows you to set the classification threshold
      • You can minimize False-positive rate or maximize the True-Positive Rate

    Notes:

    • ROC curves (& AUC) are useful even if the predicted probabilities are not “properly calibrated”

  2. AUC - AUROC:
    • Range \(= 0.5 - 1.0\), from poor to perfect

  3. Statistical Efficiency:
    Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance.
    Efficiencies are often defined using the variance or mean square error as the measure of desirability.
    An efficient estimator is also the minimum variance unbiased estimator (MVUE).

    • An Efficient Estimator has lower variance than an inefficient one
    • The use of an inefficient estimator gives results equivalent to those obtainable from a subset of data; and is therefor, wasteful of data
  4. Errors VS Residuals:
    The Error of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest.

    The Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.

Notes:



Optimization

  1. Sigmoid:
    \(\sigma(-x) = 1 - \sigma(x)\)

  2. Gradient-Based Optimization:
    Notes:
    • Gradients can’t propagate through argmax
    • Derivative for a max function:
      The derivative is defined (even though its not continuous at the threshold) by setting the value at the threshold/zero to be either the right or the left derivative.
      This is known as the Sub-Gradient and that’s why gradient descent still works.
    • Subgradient Descent
    • Hessian-Free Optimization | Conjugate Gradient Method (Hinton)
  3. Backpropagation:

    NOTES:

  4. Error Measures - Loss Functions:
    • Cross Entropy:
      • Deriving Binary Cross Entropy:
        It is the log-likelihood of a Bernoulli probability model.

        $$\begin{array}{c}{L(p)=p^{y}(1-p)^{1-y}} \\ {\log (L(p))=y \log p+(1-y) \log (1-p)}\end{array}$$

      • Cross Entropy > MSE for Classification

    Notes:

    • Loss VS Cost Function:
      • Loss is just the Error function from Caltech
      • Cost is more general than Loss: usually the sum of all the losses
      • Objective function is even more general, but a Cost might be a type of Objective Function
        • The Risk Function is an objective function is the expected loss
    • Loss Functions for Regression
    • MSE:
      The principle of mean square error can be derived from the principle of maximum likelihood (after we set a linear model where errors are normally distributed).
    • Hinge Loss and 0-1 Loss:
      • Hinge loss upper bounds 0-1 loss
      • It is the tightest convex upper bound on the 0/1 loss
      • Minimizing 0-1 loss is NP-hard in the worst-case
        img
    • Loss functions of common ML models:
      • maximize the posterior probabilities (e.g., naive Bayes)
      • maximize a fitness function (genetic programming)
      • maximize the total reward/value function (reinforcement learning)
      • maximize information gain/minimize child node impurities (CART decision tree classification)
      • minimize a mean squared error cost (or loss) function (CART, decision tree regression, linear regression, adaptive linear neurons, …
      • maximize log-likelihood or minimize cross-entropy loss (or cost) function
      • minimize hinge loss (support vector machine)

  5. Mathematical Properties, Aspects, Considerations:
    • The Composition of Invariant Functions:
      A function that is invariant to some transformation (e.g. rotation, permutation, etc.) can be composed by averaging over all transformations (rotations \(\rightarrow\) e.g. rotation invariant filters).
      Equivalently, for BoW, we average over all permutations (by averaging the words).
      This causes smearing.
    • Smearing in Invariant Functions:
      In the linear case, a rotational invariant function commutes with all rotations of the elements in \(\mathbb{R}\); Any commutative transformation should yield this; or a combo of commutative transformations; thus smearing.

      Implies that one should not use linear functions to aggregate over the set where we want some transformation invariance

    • Permutation Invariance:
    • The Weight vector of a linear signal is orthogonal to the decision boundary:
      The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
      Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:

      $$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$

    Identities:

    • Math Identities:

      $$\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{n}-y_{n}\right)^{2} = \frac{1}{N}\|\mathrm{Xw}-\mathrm{y}\|^{2}$$

      • \(\dfrac{\partial}{\partial y} \vert{x-y}\vert = - \text{sign}(x-y)\)

    Notes:

    • The well-behaved property from an optimization standpoint, implies that \(f''(x)\) doesn’t change too much or too rapidly, leading to a nearly quadratic function that is easy to optimize by gradient methods.
  6. The Method of Lagrange Multipliers:
    The constrained optimization problem:

    $$\min_{\mathbf{x}} f(\mathbf{x}) \text { subject to } g(\mathbf{x}) \leq 0$$

    is equivalent to the unconstrained optimization problem:

    $$\min_{\mathbf{x}}(f(\mathbf{x})+\lambda g(\mathbf{x}))$$

    where \(\lambda\) is a scalar called the Lagrange multiplier.

NOTES:




Machine Learning

  1. Theory:
    • ML from a Probabilistic Approach:
      When employing a probabilistic approach to doing “learning” (i.e. choosing the hypothesis), you are trying to find: What is the most probable hypothesis, given the Data.

    Why NNs are not enough?
    The gist of it is this: neural nets do pattern recognition, which achieves local generalization (which works great for supervised perception). But many simple problems require some (small) amount of abstract modeling, which modern neural nets can’t learn.

    Is there enough info in the labels to learn good, general features in Classification problems?

    Neural Tangent Kernel:

    NTK Theorem: A properly randomly initialized sufficiently wide deep neural network trained by gradient descent with infinitesimal step size (a.k.a. gradient flow) is equivalent to a kernel regression predictor with a deterministic kernel called neural tangent kernel (NTK).

    Thus, As width (of NN) \(\rightarrow \infty\), trajectory approaches the trajectory of GD for a kernel regression problem, where the (fixed) kernel in question is the so-called Neural Tangent Kernel (NTK). (For convolutional nets the kernel is Convolutional NTK or CNTK.)
    The paper proves that the evolution of an ANN during training can also be described by a kernel.

    Analysis and Discussion:

    • Start with: Fully-Connected Network, Any Depth, Lipschitz Non-Linearity
    • Gather all parameters in the network into one vector \(\theta\): initialized randomly, trained w/ GD
    • Since cost is non-convex, the analysis is difficult
    • Instead, study the network function \(f_{\theta} : \mathbb{R}^{n_0} \rightarrow \mathbb{R}^{n_L}\) which maps inputs to outputs
      We fully characterize the behavior of \(f_{\theta}\) in the infinite-width limit (# of neurons in hidden layer \(\rightarrow \infty\))
    • Behavior in the limit of the width:
      • In the limit, the network function has a Gaussian distribution at initialization
      • The effect of GD on a single point \(\boldsymbol{x}_ 0\) at initialization is to move that point and nearby points slightly
      • The difference between the two time-steps results in a smooth spike centered at \(\boldsymbol{x}_ 0\)
      • The difference is the same for different initializations.
      • As we increase the width of the network, they differences become even more similar
      • The behavior is linear i.e. adding another datapoint \(\boldsymbol{x}_ 1\) results in the two spikes being added up
    • This behavior in the limit can be nicely described by a kernel.
      The Neural Tangent Kernel (NTK):
      img
      • Defined: in terms of the derivatives of the function wrt the parameters \(\theta\)
      • Describes: how modifying the network function \(f_{\theta}\) at the point \(\boldsymbol{x}\) will influence another point \(\boldsymbol{y}\)
    • Properties:
      • Finite Width:
        Depends on the parameters, thus it is:
        • Random at initialization
        • Time-dependent: varies during training
      • Infinite width limit :

        $$\theta^{(L)}(x, y) \rightarrow \theta_{\infty}^{(L)}(x, y) \:\: n_i \rightarrow \infty \forall i \in [1, ..., L-1]$$

        Independent on the parameters, thus it is:

        • Deterministic: converges to a deterministic limit at initialization
        • Fixed: its rate of change during training \(\rightarrow 0\)

        This explains why the effect of GD was so similar for different initializations.

    • Now we have all the tools to fully describe the behavior of the network function during training:
      • Ex. Least-Squares Regression on 3 points:
        • Start w/ random Gaussian Process
        • Follow the kernel gradient of the cost wrt NTK
    • Kernel GD is simply a generalization of GD to Function Spaces
      • Because the cost is convex in function space the function will converge to the minimum if the kernel is Positive Definite
    • As width (of NN) \(\rightarrow \infty\), trajectory approaches the trajectory of GD for a kernel regression problem, where the (fixed) kernel in question is the so-called Neural Tangent Kernel (NTK). (For convolutional nets the kernel is Convolutional NTK or CNTK.)

    Notes:

    • Intuition of why DL Works:
      Circuit Theory: There are function you can compute with a “small” L-layer deep NN that shallower networks require exponentially more hidden units to compute. (comes from looking at networks as logic gates).
      • Example:
        Computing \(x_1 \text{XOR} x_2 \text{XOR} ... \text{XOR} x_n\) takes:
        • \(\mathcal{O}(log(n))\) in a tree representation.
          img
        • \(\mathcal{O}(2^n)\) in a one-hidden-layer network because you need to exhaustively enumerate all possible \(2^N\) configurations of the input bits that result in the \(\text{XOR}\) being \({1, 0}\).
          img
    • Curse of Dimensionality (ipynb)
    • The Hypothesis space of Neural Networks is Convex:
      Composition of affine-relu functions; induction.
    • Generalization in Deep Learning
    • Catastrophic Forgetting:
      • mitigating catastrophic forgetting [McCloskey and Cohen, 1989, Ratcliff, 1990, Kemker et al., 2017] by penalizing the norm of parameters when training on a new task [Kirkpatrick et al., 2017], the norm of the difference between parameters for previously learned tasks during parameter updates [Hashimoto et al., 2016], incrementally matching modes [Lee et al., 2017], rehearsing on old tasks [Robins, 1995], using adaptive memory buffers [Gepperth and Karaoguz, 2016], finding task-specific paths through networks [Fernando et al., 2017], and packing new tasks into already trained networks [Mallya and Lazebnik, 2017].
      • Overcoming catastrophic forgetting in neural networks (paper)

  2. The Big Formulations:
    ML Formulation:
    Improve on TASK T with respect to PERFORMANCE METRIC P based on EXPERIENCE E.

    Problems in ML:

  1. Types of Learning:
    • Multi-Task Learning: general term for training on multiple tasks
      • Joint Learning: by choosing mini-batches from two different tasks simultaneously/alternately
      • Pre-Training: first train on one task, then train on another

        widely used for word embeddings

    • Transfer Learning:
      a type of multi-task learning where we are focused on one task; by learning on another task then applying those models to our main task
    • Domain Adaptation:
      a type of transfer learning, where the output is the same, but we want to handle different inputs/topics/genres
    • Zero-Shot Learning:

    Notes:

    • Relationship between Supervised and Unsupervised Learning:
      Many ml algorithms can be used to perform both tasks. E.g., the chain rule of probability states that for a vector \(x \in \mathbb{R}^n\), the joint distribution can be decomposed as:
      \(p(\mathbf{x})=\prod_{i=1}^{n} p\left(\mathrm{x}_{i} \vert \mathrm{x}_{1}, \ldots, \mathrm{x}_{i-1}\right)\)
      which implies that we can solve the Unsupervised problem of modeling \(p(x)\) by splitting it into \(n\) supervised learning problems.
      Alternatively, we can solve the supervised learning problem of learning \(p(y \vert x)\) by using traditional unsupervised learning technologies to learn the joint distribution \(p(x, y)\), then inferring:
      \(p(y \vert \mathbf{x})=\frac{p(\mathbf{x}, y)}{\sum_{y} p\left(\mathbf{x}, y^{\prime}\right)}\)
    • Intuition on Why Unsupervised Learning works:
      • Goal: Learn Portuguese
      • For 1 month you listen to Portuguese on the radio (this is unlabeled data)
      • You develop an intuition for the language, phrases, and grammar (a model in your head)
      • It is easier to learn now from a tutor because you have a better (higher representation) of the data/language
  2. Linear Models:
    A Linear Model takes an input \(x\) and computes a signal \(s = \sum_{i=0}^d w_ix_i\) that is a linear combination of the input with weights, then apply a scoring function on the signal \(s\).
    • Linear Classifier as a Parametric Model:
      Linear classifiers \(f(x, W)=W x+b\) are an example of a parametric model that sums up the knowledge of the training data in the parameter: weight-matrix \(W\).
    • Scoring Function:
      • Linear Classification:
        \(h(x) = sign(s)\)
      • Linear Regression:
        \(h(x) = s\)
      • Logistic Regression:
        \(h(x) = \sigma(s)\)

    Notes:

    • The Weight vector of a linear signal is orthogonal to the decision boundary:
      The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
      Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:

      $$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$

  3. Bias-Variance Decomposition Theory:
    Bias-Variance for Neural-Networks:

    img
    Dealing with Bias and Variance for NN:

    • High Bias (\(E_{\text{train}}\)) \(\rightarrow\) (1) Bigger Net (2) Train longer (3) Different NN archit
    • High Variance (\(E_{\text{dev}}\)) \(\rightarrow\) (1) More Data (2) Regularization (3) Different NN archit

  4. Models:
    Parametric Models:
    A parametric model is a set of probability distributions indexed by a parameter \(\theta \in \Theta\). We denote this as:

    $$\{p(y ; \theta) \vert \theta \in \Theta\},$$

    where \(\theta\) is the parameter and \(\Theta\) is the Parameter-Space.

  5. Output Units/Functions:

  6. Model Varieties - Regression and Classification:
    Generalized Regression also known as Conditional Distribution Estimation:
    • Given \(x\), predict probability distribution \(p(y\vert x)\)
    • How do we represent the probability distribution?
      • We’ll consider parametric families of distributions
        • distribution represented by parameter vector
      • Examples:
        • Generalized Linear Models (GLM)
          • Logistic regression (Bernoulli distribution)
          • Probit regression (Bernoulli distribution)
          • Poisson regression (Poisson distribution)
          • Linear regression (Normal distribution, fixed variance)
        • Generalized Additive Models (GAM)
        • Gradient Boosting Machines (GBM) / AnyBoost

    Probabilistic Binary Classifiers:

    • Setting: \(X=\mathrm{R}^{d}, y=\{0,1\}\)
    • For each \(x\), need to predict a distribution on \(y=\{0,1\}\)
    • To define a distribution supported on \(\{0,1\}\), it is sufficient to specify the Bernoulli parameter \(\theta=p(y=1 \vert x)\)
    • We can refer to this distribution as \(\text{Bernoulli}(\theta)\)

    Linear Probabilistic Classifiers:

    • Setting: \(X=\mathrm{R}^{d}, y=\{0,1\}\)
    • Want prediction function to map each \(x \in \mathrm{R}^{d}\) to the right \(\theta \in[0,1]\)
    • We first extract information from \(x \in \mathrm{R}^{d}\) and summarize in a single number
      • That number is analogous to the Score in classification
    • For a linear method/model, this extraction is done with a linear function;

      $$\underbrace{x}_{\in \mathbb{R}^{D}} \mapsto \underbrace{w^{T} x}_{\in \mathrm{R}}$$

    • As usual, \(x \mapsto w^{T} x\) will include affine functions if we include a constant features in \(x\)
    • \(w^Tx\) is called the linear predictor
    • Still need to map this to \([0, 1]\); we do so by Transfer/Response/Inverse-Link function; usually the logistic function (Sigmoid)

      Its a function to map the linear predictor in \(\mathbb{R}\) to \([0,1]\):
      \(\underbrace{x}_ {\in \mathbf{R}^{D}} \mapsto \underbrace{w^{T}}_{\in R} \mapsto \underbrace{f\left(w^{T} x\right)}_{\in[0,1]}=\theta = p(y=1 \vert x)\)

    Learning:
    The hypothesis space/set:

    $$\mathcal{H}=\left\{x \mapsto f\left(w^{T} x\right) \vert w \in \mathbb{R}^{d}\right\}$$

    where the only “parameter” in this model is \(w \in \mathbb{R}^d\).
    We can choose \(w\) using maximum likelihood:
    Likelihood Scoring | Bernoulli Regression:

    • Suppose we have data \(\mathcal{D}=\left\{\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right)\right\}\)
    • The model likelihood for \(\mathcal{D}\):

      $$\begin{aligned} p_{w}(\mathcal{D}) &=\prod_{i=1}^{n} p_{w}\left(y_{i} \vert x_{i}\right)[\text { by independence }] \\ &=\prod_{i=1}^{n}\left[f\left(w^{T} x_{i}\right)\right]^{y_{i}}\left[1-f\left(w^{T} x_{i}\right)\right]^{1-y_{i}} \end{aligned}$$

      • This probability of each data-point \(p_w(y_i\|x_i)\) can be summed in the equation \(\left[f\left(w^{T} x_{i}\right)\right]^{y_{i}}\left[1-f\left(w^{T} x_{i}\right)\right]^{1-y_{i}}\) which capture both cases \(p_w(y_i = 1) = f\left(w^{T} x_{i}\right)\) and \(p_w(y_i = 0) = 1 - f\left(w^{T} x_{i}\right)\)
    • The log likelihood:

      $$\log p_{w}(\mathcal{D})=\sum_{i=1}^{n} y_{i} \log f\left(w^{T} x_{i}\right)+\left(1-y_{i}\right) \log \left[1-f\left(w^{T} x_{i}\right)\right]$$

    • Equivalently, minimize the objective function:

      $$J(w)=-\left[\sum_{i=1}^{n} y_{i} \log f\left(w^{T} x_{i}\right)+\left(1-y_{i}\right) \log \left[1-f\left(w^{T} x_{i}\right)\right]\right]$$

    Gaussian Linear Regression/Conditional Gaussian Regression:

    Generalized Regression as Statistical Learning:

    Generalized Linear Models:

  7. Bayesian Conditional Probabilistic Models:
  8. Recommendation Systems:
  9. Regularization:
    • Overfitting and Regularization:
      To address over-fitting:
      • Increase size of training dataset
      • Reduce number of features
      • Do Regularization:
        • Keep all the features but reduce the magnitude/values of the weights/parameters
        • Works well when we have a lot of features, each of which contributes a bit to predicting \(y\)
    • Regularization Theory:
    • Theoretical Justification for Regularization:
      A theoretical justification for regularization is that it attempts to impose Occam’s razor on the solution.
      From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.
    • Tikhonov Regularization: is essentially a trade-off between fitting the data and reducing a norm of the solution.

    Data Regularization:

    • The Design Matrix contains sample points in each row
    • Feature Scaling/Mean Normalization (of data):
      • Define the mean \(\mu_j\) of each feature of the datapoints \(x^{(i)}\):

      $$\mu_{j}=\frac{1}{m} \sum_{i=1}^{m} x_{j}^{(i)}$$

      • Replace each \(x_j^{(i)}\) with \(x_j - \mu_j\)
    • Centering: subtracting \(\mu\) from each row of \(X\)
    • Sphering: applying the transform \(X' = X \Sigma^{-1/2}\)
    • Whitening: Centering + Sphering (also known as Decorrelating feature space)

    Pre-processing for Deep Learning (data regularization)

    Ch.7 dl-book summary

  10. Aggregation - Ensemble Methods:

    • Boosting: create different hypothesis \(h_i\)s sequentially + make each new hypothesis decorrelated with previous hypothesis.
      • Assumes that this will be combined/ensembled
      • Ensures that each new model/hypothesis will give a different/independent output
  11. Kernels:
    • Kernels:
      • Polynomial Kernel of degree, exactly, \(d\):

        $$K(\mathbf{u}, \mathbf{v})=(\mathbf{u} \cdot \mathbf{v})^{d}$$

      • Polynomial Kernel of degree, up to, \(d\):

        $$K(\mathbf{u}, \mathbf{v})=(\mathbf{u} \cdot \mathbf{v}+1)^{d}$$

      • Gaussian Kernel:

        $$K(\vec{u}, \vec{v})=\exp \left(-\frac{\|\vec{u}-\vec{v}\|_ {2}^{2}}{2 \sigma^{2}}\right)$$

      • Sigmoid Kernel:

        $$K(\mathbf{u}, \mathbf{v})=\tanh (\eta \mathbf{u} \cdot \mathbf{v}+\nu)$$

    • Local Kernels: a kernel where \(k(u, v)\) is large when \(u=v\) and decreases as \(u\) and \(v\) grow further apart from each other.
      A local kernel can be thought of as a similarity function that performs template matching, by measuring how closely a test example \(x\) resembles each training example \(x^{(i)}\).
    • The kernel trick can NOT be applied to any learning algorithm

    • Kernel Regression Introduction
  12. Curse of Dimensionality:
    The Curse of Dimensionality, in general, refers to various phenomena, that arise when analyzing and organizing data in high-dimensional spaces, that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.

    Common Theme:
    When the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties (clustering); in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

    Sampling:
    The sampling density is proportional to \(N^{1/p}\), where \(p\) is the dimension of the input space and \(N\) is the sample size. Thus, if \(N_1 = 100\) represents a dense sample for a single input problem, then \(N_{10} = 100^{10}\) is the sample size required for the same sampling density with \(10\) inputs. Thus in high dimensions all feasible training samples sparsely populate the input space.

    Story on Geometry in high-dimensional spaces

  13. Notes:
    • Complexity:
      • Caching the activations of a NN:
        We need to cache the activation vectors of a NN after each layer \(Z^{[l]}\) because they are required in the backward computation.
    • Initializations:
      • Initializing NN:
        • Don’t initialize the weights to Zero. The symmetry of hidden units results in a similar computation for each hidden unit, making all the rows of the weight matrix to be equal (by induction).
        • It’s OK to initialize the bias term to zero.
        • Since a neuron takes the sum of \(N\) inputsXweights, if \(N\) is large, you want smaller \(w_i\)s. You want to initialize with a variance \(\propto \dfrac{1}{n}\) (i.e. multiply by \(\dfrac{1}{\sqrt{n}}\); \(n\) is the number of weights in previous layer).
          This doesnt solve but reduces vanishing/exploding gradient problem because \(z\) would take a similar distribution.
          • Xavier Initialization: assumes \(\tanh\) activation; ^ uses logic above; samples from normal distribution and multiplies by \(\dfrac{1}{\sqrt{n}}\).
          • If ReLU activation, it turns out to be better to make variance \(\propto \dfrac{2}{n}\) instead.
    • Training:
    • Feature Importance:
      • In linear models, feature importance can be calculated by the scale of the coefficients
      • In tree-based methods (such as random forest), important features are likely to appear closer to the root of the tree. We can get a feature’s importance for random forest by computing the averaging depth at which it appears across all trees in the forest
    • Probabilistic Calibration:
    • Complexity in ML:
      • Definitions of the complexity of an object (\(h\)):
        • Minimum Description Length (MDL): the number of bits for specifying an object.
        • Order of a Polynomial
      • Definitions of the complexity of a class of objects (\(\mathcal{H}\)):
        • Entropy
        • VC-dim
    • Gaussian Discriminant Analysis:
      • models $P(Y=y \vert X)$ as a logistic function.
      • is a generative model.
      • can be used to classify points without ever computing an exponential
      • decision boundary shapes:
        • Hyperplane
        • Nonlinear quadric surface (quadric = the isosurface of a quadratic function)
        • The empty set (the classifier always returns the same class)
      • Logistic Regression vs LDA? (ESL)
    • Geometry of Gaussian Distributions:
      • Multivariate Normal Distribution:
        • Isotropic:
          • I.E. Isosurfaces are spheres
          • Covariance Matrix \(\Sigma = \sigma^2 I\)
            where \(\sigma^2\) is the variance of any one feature.
    • The Bias Parameter:
    • When is does an ML problem become a Research Problem:
      A problem that you are trying to solve using ML becomes a research problem as opposed to those solved by applied practitioners when the only way to learn the problem is to improve the learning algorithm itself. This happens in two situations:
      1. You can fit the training data well but cannot improve generalization error:
        This happens when: It is not feasible to gather more data
      2. You cannot fit the training data well even when you increase the capacity of your models as much as you “feasibly” can
    • Bayesian Deep Learning:
    • Learning Problems:
      • Counting/Arithmetic
      • Copying
      • Identity Mapping
      • Pointing (Copying?)
    • Learning Features:
      • Higher-Order/k-Order Interactions
      • Global Dependencies
      • Local Dependencies
      • Self-Similarity
      • Non-Locality (Non-local Neural Networks (paper!))
      • Long-Range Dependencies
      • Memory: Long-Term, Short-Term (working memory)
        • Associative Retrieval
      • Mathematical (Symbolic?) Manipulation: Arithmetic?: Counting
    • Input Representations:
      • Localist Representations
      • Point Attractors


Computer Vision



NLP

The Norvig-Chomsky Debate
Modern Deep Learning Techniques Applied to Natural Language Processing (Amazing Resource)
Deep Learning for NLP: Advancements & Trends

  1. Language Modeling:
    Towards Better Language Modeling (Lec.9 highlight, 38m):

    To improve a Language Model:

    1. Better Inputs: Word \(\rightarrow\) Subword \(\rightarrow\) Char

      Subword Language Modeling , Mikolov et al. 2012
      Character-Aware Neural Language Model , Kim et al. 2015.
    2. Better Regularization/Preprocessing:
      Similar to computer vision, we can do both Regularization and Preprocessing on the data to increase its relevance to the true distribution.
      Preprocessing acts as a data augmentation technique. This allows us to achieve a Smoother distribution, since we are removing more common words and re-enforcing rarer words.
      Zoneout, Kruger et al. 2016
      Data Noising as Smoothing, Xie et al. 2016
      • Regularization:
        • Use Dropout (Zaremba, et al. 2014).
        • Use Stochastic FeedForward depth (Huang et al. 2016)
        • Use Norm Stabilization (Memisevic 2015) …
      • Preprocessing:
        • Randomly replacing words in a sentence with other words
        • Use bigram statistics to generate Kneser-Ney inspired replacement (Xie et al. 2016).
        • Replace a word with fixed drop rate
        • Replace a word with adaptive drop rate, by how rare two words appear together (i.e. “Humpty Dumpty”), and replace by a unigram draw over vocab
        • Replace a word with adaptive drop rate, and draw word from a proposal distribution (i.e. “New York”)
    3. Better Model (+ all above)

    Recurrent Neural Networks as Language Models:

    Notes:

    • The ML-Estimate of \(p(w_i \vert w_{i-1})\) \(= \dfrac{c(w_{i-1}\:, w_i)}{\sum_{w_i} c(w_{i-1}\:, w_i)}\)

  2. Neural Text Generation:
    • Traditionally:
      • Often Auto-regressive language models (ie. seq2seq)
      • These models generate text by sampling words sequentially, with each word conditioned on the previous word
      • Benchmarked on validation perplexity even though this is not a direct measure of the quality of the generated text
      • The models are typically trained via maximum likelihood and teacher forcing

        These methods are well-suited to optimizing perplexity but can result in poor sample quality since generating text requires conditioning on sequences of words that may have never been observed at training time

  3. NLP & DL (R. Sorcher):
  4. Text Classification:
    Word-Window Classification:

    CNN Text Classification:

  5. Coreference Resolution:
    Coreference Resolution: Identify all mentions that refer to the same real world entity.

    Applications:

    • Full text understanding:
      • information extraction, question answering, summarization, …
      • “He was born in 1961” (Who?)
    • Machine Translation:
      • languages have different features for gender, number, dropped pronouns, etc.
    • Dialogue Systems:
      “Book tickets to see James Bond
      Spectre is playing near you at 2:00 and 3:00 today. How many tickets would you like?”
      Two tickets for the showing at three

    An approach for Coref-Res in 2 steps:

    1. Detect the Mentions (easy)
      “Book tickets to see James Bond
      Spectre is playing near you at 2:00 and 3:00 today. How many tickets would you like?”
      Two tickets for the showing at three
    2. Cluster the Mentions (hard)
      “Book tickets to see James Bond
      Spectre is playing near you at 2:00 and 3:00 today. How many tickets would you like?”
      Two tickets for the showing at three

    Mention Detection:
    Mention: span of text referring to some entity.

    Types of Mention:

    1. Pronouns
      “I”, “your”, “it”, “she”, “him”
    2. Named Entities
      People, places, etc.
    3. Noun Phrases
      “a dog”, “the big fluffy cat stuck in the tree”

    Detection:
    Use other NLP systems for the detection task:

    1. Pronouns: POS-Tagger
    2. Named Entities: NER
    3. Noun Phrases: (Constituency) Parser

    Problem with Detection - Extra/bad mentions:
    Notice that the systems above will overmatch on possible mentions that don’t have a concrete entity that they refer to: e.g. [It] is sunny, [Every student], [No student], [The best donut in the world], [100 miles].

    Dealing with bad mentions:

    • Train a classifier to filter out spurious mentions
    • (more commonly) keep all mentions as “candidate mentions”
      • After your coreference system is done running discard all singleton mentions (i.e., ones that have not been marked as coreference with anything else)

    Continue Lecture (CS224N)

  6. Word Embeddings:
    Word Vectors:

    Notes:

    • Categorization is a method for Evaluating w2v Embeddings by creating categorize by clustering, then measuring the purity of the clusters

Notes:



Resources: