Gradient-Based Optimization

  1. Define Gradient Methods:
    Gradient Methods are algorithms for solving (optimization) minimization problems of the form:

    $$\min_{x \in \mathbb{R}^{n}} f(x)$$

    by doing local search (~ hill-climbing) with the search directions defined by the gradient of the function at the current point.

  2. Give examples of Gradient-Based Algorithms:
    1. Gradient Descent: minimizes arbitrary differentiable functions
    2. Conjugate Gradient: minimizes sparse linear systems w/ symmetric & PD matrices
    3. Coordinate Descent: minimizes functions of two variables
  3. What is Gradient Descent:
    GD is a first order, iterative algorithm to minimize an objective function \(J(\theta)\) parametrized by a models parameters \(\theta \in \mathbb{R}^d\) by updating the parameters in the opposite direction of the gradient of the objective function \(\nabla_{\theta} J(\theta)\) w.r.t. to the parameters.
  4. Explain it intuitively:
    1) Local Search from a starting location on a hill
    2) Feel around how a small movement/step around your location would change the height of the surrounding hill (is the ground higher or lower)
    3) Make the movement/step consistent as a small fixed step along some direction
    4) Measure the steepness of the hill at the new location in the chosen direction
    5) Do so by Approximating the steepness with some local information
    6) Measure the change in steepness from your current point to the new point
    6) Find the direction that decreases the steepness the most
    \(\iff\)

    1) Local Search from an initial point \(x_0\) on a function
    2) Explore the value of the function at different small nudges around \(x_0\)
    3) Make the nudges consistent as a small fixed step \(\delta\) along a normalized direction \(\hat{\boldsymbol{u}}\)
    4) Evaluate the function at the new location \(x_0 + \delta \hat{\boldsymbol{u}}\)
    5) Do so by Approximating the function w/ first-order information (Taylor expansion)
    6) Measure the change in value of the objective \(\Delta f\), from current point to the new point
    7) Find the direction \(\hat{\boldsymbol{u}}\) that minimizes the function the most

  5. Give its derivation:
    1) Start at \(\boldsymbol{x} = x_0\)
    2) We would like to know how would a small change in \(\boldsymbol{x}\), namely \(\Delta \boldsymbol{x}\) would affect the value of the function \(f(x)\). This will allow us to evaluate the function:

    $$f(\mathbf{x}+\Delta \mathbf{x})$$

    to find the direction that makes \(f\) decrease the fastest
    3) Let’s set up \(\Delta \boldsymbol{x}\), the change in \(\boldsymbol{x}\), as a fixed step \(\delta\) along some normalized direction \(\hat{\boldsymbol{u}}\):

    $$\Delta \boldsymbol{x} = \delta \hat{\boldsymbol{u}}$$

    4) Evaluate the function at the new location:

    $$f(\mathbf{x}+\Delta \mathbf{x}) = f(\mathbf{x}+\delta \hat{\boldsymbol{u}})$$

    5) Using the first-order approximation:

    $$\begin{aligned} f(\mathbf{x}+\delta \hat{\boldsymbol{u}}) &\approx f({\boldsymbol{x}}_0) + \left(({\boldsymbol{x}}_0 - \delta \hat{\boldsymbol{u}}) - {\boldsymbol{x}}_0 \right)^T \nabla_{\boldsymbol{x}} f({\boldsymbol{x}}_0) \\ &\approx f(x_0) + \delta \hat{\boldsymbol{u}}^T \nabla_x f(x_0) \end{aligned}$$

    6) The change in \(f\) is:

    $$\begin{aligned} \Delta f &= f(\boldsymbol{x}_ 0 + \Delta \boldsymbol{x}) - f(\boldsymbol{x}_ 0) \\ &= f(\boldsymbol{x}_ 0 + \delta \hat{\boldsymbol{u}}) - f(\boldsymbol{x}_ 0)\\ &= \delta \nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} + \mathcal{O}(\delta^2) \\ &= \delta \nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} \\ &\geq -\delta\|\nabla f(\boldsymbol{x}_ 0)\|_ 2 \end{aligned}$$

    where

    $$\nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} \in \left[-\|\nabla f(\boldsymbol{x}_ 0)\|_ 2, \|\nabla f(\boldsymbol{x}_ 0)\|_ 2\right]$$

    since \(\hat{\boldsymbol{u}}\) is a unit vector; either aligned with \(\nabla_x f(\boldsymbol{x}_ 0)\) or in the opposite direction; it contributes nothing to the magnitude of the dot product.
    7) So, the \(\hat{\boldsymbol{u}}\) that changes the above inequality to equality, achieves the largest negative value (moves the most downhill). That vector \(\hat{\boldsymbol{u}}\) is, then, the one in the negative direction of \(\nabla_x f(\boldsymbol{x}_ 0)\); the opposite direction of the gradient.
    \(\implies\)

    $$\hat{\boldsymbol{u}} = - \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2}$$

    Finally, The method of steepest/gradient descent proposes a new point to decrease the value of \(f\):

    $$\boldsymbol{x}^{\prime}=\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x})$$

    where \(\epsilon\) is the learning rate, defined as:

    $$\epsilon = \dfrac{\delta}{\left\|\nabla_{x} f(x)\right\|_ {2}}$$

  6. What is the learning rate?
    Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient; defined as:

    $$\epsilon = \dfrac{\delta}{\left\|\nabla_{x} f(x)\right\|_ {2}}$$

    1. Where does it come from?
      The learning rate comes from a modification of the step-size in the GD derivation.
      We get the learning rate by employing a simple idea:
      We have a fixed step-size \(\delta\) that dictated how much we should be moving in the direction of steepest descent. However, we would like to keep the step-size from being too small or overshooting. The idea is to make the step-size proportional to the magnitude of the gradient (i.e. some constant multiplied by the magnitude of the gradient):

      $$\delta = \epsilon \left\|\nabla_{x} f(x)\right\|_ {2}$$

      If we do so, we get a nice cancellation as follows:

      $$\begin{aligned}\Delta \boldsymbol{x} &= \delta \hat{\boldsymbol{u}} \\ &= -\delta \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2} \\ &= - \epsilon \left\|\nabla_{x} f(x)\right\|_ {2} \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2} \\ &= - \dfrac{\epsilon \left\|\nabla_{x} f(x)\right\|_ {2}}{\|\nabla_x f(x)\|_ 2} \nabla_x f(x) \\ &= - \epsilon \nabla_x f(x) \end{aligned}$$

      where now we have a fixed learning rate instead of a fixed step-size.

    1. What is its range?
      \([0.0001, 0.4]\)
    2. How do we choose the learning rate?
      1. Choose a starting lr from the range \([0.0001, 0.4]\) and adjust using cross-validation (lr is a HP).
      2. Do
        1. Set it to a small constant
        2. Smooth Functions (non-NN):
          1. Line Search: evaluate \(f\left(\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x})\right)\) for several values of \(\epsilon\) and choose the one that results in the smallest objective value.
            E.g. Secant Method, Newton-Raphson Method (may need Hessian, hard for large dims)
          2. Trust Region Method
        3. Non-Smooth Functions (non-NN):
          1. Direct Line Search
            E.g. golden section search
        4. Grid Search: is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. It is guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
        5. Population-Based Training (PBT): is an elegant implementation of using a genetic algorithm for hyper-parameter choice.
        6. Bayesian Optimization: is a global optimization method for noisy black-box functions.
      3. Use larger lr and use a learning rate schedule

  7. Describe the convergence of the algorithm:
    Gradient Descent converges when every element of the gradient is zero, or very close to zero within some threshold.

    With certain assumptions on \(f\) (convex, \(\nabla f\) lipschitz) and particular choices of \(\epsilon\) (chosen via line-search etc.), convergence to a local minimum can be guaranteed.
    Moreover, if \(f\) is convex, all local minima are global minimia, so convergence is to the global minimum.

  8. How does GD relate to Euler?
    Gradient descent can be viewed as applying Euler’s method for solving ordinary differential equations \({\displaystyle x'(t)=-\nabla f(x(t))}\) to a gradient flow.
  9. List the variants of GD:
    1. Batch Gradient Descent
    2. Stochastic GD
    3. Mini-Batch GD
    4. How do they differ? Why?:
      There are three variants of gradient descent, which differ in the amount of data used to compute the gradient. The amount of data imposes a trade-off between the accuracy of the parameter updates and the time it takes to perform the update.

  10. What is the problem of vanilla approaches to GD?
    All the variants described above, however, do not guarantee “good” convergence due to some challenges.

  11. List the different strategies for optimizing GD:
    1. Adapt our updates to the slope of our error function
    2. Adapt our updates to the importance of each individual parameter
  12. List the different variants for optimizing GD:
    1. Adapt our updates to the slope of our error function:
      1. Momentum
      2. Nesterov Accelerated Gradient
    2. Adapt our updates to the importance of each individual parameter:
      1. Adagrad
      2. Adadelta
      3. RMSprop
      4. Adam
      5. AdaMax
      6. Nadam
      7. AMSGrad


Maximum Margin Classifiers

  1. Define Margin Classifiers:
    A margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example.
  2. What is a Margin for a linear classifier?
    The margin of a linear classifier is the distance from the decision boundary to the nearest sample point.
  3. Give the motivation for margin classifiers:
    Non-margin classifiers (e.g. Centroid, Perceptron, LR) will converge to a correct classifier on linearly separable data; however, the classifier they converge to is not unique nor the best.

  4. Define the notion of the “best” possible classifier
    We assume that if we can maximize the distance between the data points to be classified and the hyperplane that classifies them, then we have reached a boundary that allows for the “best-fit”, i.e. allows for the most room for error.
  5. How can we achieve the “best” classifier?
    We enforce a constraint that achieves a classifier that has a maximum-margin.
  6. What unique vector is orthogonal to the hp? Prove it:
    The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
    Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:

    $$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$

  7. What do we mean by “signed distance”? Derive its formula:
    The signed distance is the minimum distance from a point to a hyperplane. We solve for the signed distance to achieve the following formula for it: \(d = \dfrac{\| w \cdot x_0 + b \|}{\|w\|},\)
    where we have an n-dimensional hyperplane: \(w \cdot x + b = 0\) and a point \(\mathbf{x}_ n\).
    Derivation:
    1. Suppose we have an affine hyperplane defined by \(w \cdot x + b\) and a point \(\mathbf{x}_ n\).
    2. Suppose that \(\mathbf{x} \in \mathbf{R}^n\) is a point satisfying \(w \cdot \mathbf{x} + b = 0\), i.e. it is a point on the plane.
    3. We construct the vector \(\mathbf{x}_ n−\mathbf{x}\) which points from \(\mathbf{x}\) to \(\mathbf{x}_ n\), and then, project (scalar projection==signed distance) it onto the unique vector perpendicular to the plane, i.e. \(w\),

      $$d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \left| \frac{(\mathbf{x}_ n-\mathbf{x})\cdot w}{\|w\|} \right| = \frac{|\mathbf{x}_ n \cdot w - \mathbf{x} \cdot w|}{\|w\|}.$$

    4. Since \(\mathbf{x}\) is a vector on the plane, it must satisfy \(w\cdot \mathbf{x}=-b\) so we get

      $$d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \frac{|\mathbf{x}_ n \cdot w +b|}{\|w\|}$$

    Thus, we conclude that if \(\|w\| = 1\) then the signed distance from a datapoint \(X_i\) to the hyperplane is \(\|wX_i + b\|\).

  8. Given the formula for signed distance, calculate the “distance of the point closest to the hyperplane”:
    Let \(X_n\) be the point that is closest to the plane, and let \(\hat{w} = \dfrac{w}{\|w\|}\).
    Take any point \(X\) on the plane, and let \(\vec{v}\) be the vector \(\vec{v} = X_n - X\).
    Now, the distance, \(d\) is equal to

    $$\begin{align} d & \ = \vert\hat{w}^{\top}\vec{v}\vert \\ & \ = \vert\hat{w}^{\top}(X_n - X)\vert \\ & \ = \vert\hat{w}^{\top}X_n - \hat{w}^{\top}X)\vert \\ & \ = \dfrac{1}{\|w\|}\vert w^{\top}X_n + b - w^{\top}X) - b\vert , & \text{we add and subtract the bias } b\\ & \ = \dfrac{1}{\|w\|}\vert (w^{\top}X_n + b) - (w^{\top}X + b)\vert \\ & \ = \dfrac{1}{\|w\|}\vert (w^{\top}X_n + b) - (0)\vert , & \text{from the eq. of the plane on a point on the plane} \\ & \ = \dfrac{\vert (w^{\top}X_n + b)\vert}{\|w\|} \end{align} $$

  9. Use geometric properties of the hp to Simplify the expression for the distance of the closest point to the hp, above
    First, we notice that for any given plane \(w^Tx = 0\), the equations, \(\gamma * w^Tx = 0\), where \(\gamma \in \mathbf{R}\) is a scalar, basically characterize the same plane and not many planes.
    This is because \(w^Tx = 0 \iff \gamma * w^Tx = \gamma * 0 \iff \gamma * w^Tx = 0\).
    The above implies that any model that takes input \(w\) and produces a margin, will have to be Scale Invariant.
    To get around this and simplify the analysis, I am going to consider all the representations of the same plane, and I am going to pick one where we normalize (re-scale) the weight \(w\) such that the signed distance (distance to the point closest to the margin) is equal to one:

    $$|w^Tx_n| > 0 \rightarrow |w^Tx_n| = 1$$

    , where \(x_n\) is the point closest to the plane.
    We constraint the hyperplane by normalizing \(w\) to this equation \(|w^Tx_i| = 1\) or with added bias, \(|w^Tx_i + b| = 1\).
    \(\implies\)

    $$\begin{align} d & \ = \dfrac{\vert (w^{\top}X_n + b)\vert}{\|w\|} \\ & \ = \dfrac{\vert (1)\vert}{\|w\|} , & \text{from the constraint on the distance of the closest point} \\ & \ = \dfrac{1}{\|w\|} \end{align} $$

  10. Characterize the margin, mathematically:
    we can characterize the margin, with its size, as the distance, \(\frac{1}{\|\mathbf{w}\|}\), between the hyperplane/boundary and the closest point to the plane \(\mathbf{x}_ n\), in both directions (multiply by 2) \(= \frac{2}{\|\mathbf{w}\|}\) ; given the condition we specified earlier \(\left|\mathbf{w}^{\top} \mathbf{x}_ {n} + b\right|=1\) for the closest point \(\mathbf{x}_ n\).
  11. Characterize the “Slab Existence”:
    The analysis done above allows us to conclusively prove that there exists a slab of width \(\dfrac{2}{\|w\|}\) containing no sample points where the hyperplane runs through (bisects) its center.
  12. Formulate the optimization problem of maximizing the margin wrt analysis above:
    We formulate the optimization problem of maximizing the margin by maximizing the distance, subject to the condition on how we derived the distance:

    $$\max_{\mathbf{w}} \dfrac{2}{\|\mathbf{w}\|} \:\:\: : \:\: \min _{n=1,2, \ldots, N}\left|\mathbf{w}^{\top} \mathbf{x}_{n}+b\right|=1$$

  13. Reformulate the optimization problem above to a more “friendly” version (wrt optimization -> put in standard form):
    We can reformulate by (1) Flipping and Minimizing, (2) Taking a square since it’s monotonic and convex, and (3) noticing that \(\left|\mathbf{w}^{T} \mathbf{x}_ {n}+b\right|=y_{n}\left(\mathbf{w}^{T} \mathbf{x}_ {n}+b\right)\) (since the signal and label must agree, their product will always be positive) and the \(\min\) operator can be replaced by ensuring that for all the points the condition \(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1\) holds proof (by contradiction) as:

    $$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$

    Now when we solve the “friendly” equation above, we will get the separating plane with the best possible margin (best=biggest).

    1. Give the final (standard) formulation of the “Optimization problem for maximum margin classifiers”:

      $$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$

    2. What kind of formulation is it (wrt optimization)? What are the parameters?
      The above problem is a Quadratic Program, in \(d + 1\)-dimensions and \(n\)-constraints, in standard form.

Hard-Margin SVMs

  1. Define:
    1. SVMs:
      1.*Support Vector Machines** (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
      The SVM is a Maximum Margin Classifier that aims to find the “maximum-margin hyperplane” that divides the group of points \({\displaystyle {\vec {x}}_{i}} {\vec {x}}_{i}\) for which \({\displaystyle y_{i}=1}\) from the group of points for which \({\displaystyle y_{i}=-1}\).
    2. Support Vectors:
      1.*Support Vectors** are the data-points that lie exactly on the margin (i.e. on the boundary of the slab).
      They satisfy \(\|w^TX' + b\| = 1, \forall\) support vectors \(X'\)
    3. Hard-Margin SVM:
      The Hard-Margin SVM is just a maximum-margin classifier with features and kernels (discussed later).
  2. Define the following wrt hard-margin SVM:
    1. Goal:
      Find weights ‘\(w\)’ and scalar ‘\(b\)’ that correctly classifies the data-points and, moreover, does so in the “best” possible way.
    2. Procedure:
      (1) Use a linear classifier (2) But, Maximize the Margin (3) Do so by Minimizing \(\|w\|\)
    3. Decision Function:

      $${\displaystyle f(x)={\begin{cases}1&{\text{if }}\ w\cdot X_i+\alpha>0\\0&{\text{otherwise}}\end{cases}}}$$

    4. Constraints:

      $$y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$

    5. The Optimization Problem:
      Find weights ‘\(w\)’ and scalar ‘\(b\)’ that minimize

      $$ \dfrac{1}{2} w^Tw$$

      Subject to

      $$y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$

      Formally,

      $$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$

    6. The Optimization Method:
      The SVM optimization problem reduces to a Quadratic Program.
  3. Elaborate on the generalization analysis:
    We notice that, geometrically, the hyperplane (the maximum margin classifier) is completely characterized by the support vectors (the vectors that lie on the margin).
    A very important conclusion arises.
    The maximum margin classifier (SVM) depends only on the number of support vectors and not on the dimension of the problem.
    This implies that the computation doesn’t scale up with the dimension and, also implies, that the kernel trick works very well.
  4. List the properties:
    1. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
    2. The hyperplane is determined solely by its support vectors.
    3. The SVM always converges on linearly separable data.
    4. The Hard-Margin SVM fails if the data is not linearly separable.
    5. The Hard-Margin SVM is quite sensitive to outliers
  5. Give the solution to the optimization problem for H-M SVM:
    To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
    The Lagrange Formulation:
    1. Formulate the Lagrangian:
      1. Take each inequality constraint and put them in the zero-form (equality with Zero)
      2. Multiply each inequality by a Lagrange Multiplier \(\alpha_n\)
      3. Add them to the objective function \(\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}\)
        The sign will be \(-\) (negative) simply because the inequality is \(\geq 0\)

      $$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$

    2. Optimize the objective independently, for each of the unconstrained variables:
      1. Gradient w.r.t. \(\mathbf{w}\):

        $$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$

      2. Derivative w.r.t. \(b\):

        $$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    3. Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
      1. Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. \(\alpha_n\) will become free of \(\mathbf{w}\) and \(b\):

        $$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$

      2. Notice that the first constraint \(\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}\) has-no-effect/doesn’t-constraint \(\alpha_n\) so it’s a vacuous constraint. However, not the second constraint \(\sum_{n=1}^{N} \alpha_{n} y_{n}=0\).
      3. Set the optimization objective and the constraints, a quadratic function in \(\alpha_n\):

      $$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    4. Set the problem as a Quadratic Programming problem:
      1. Change the maximization to minimization by flipping the signs:

        $$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$

      2. Isolate the Coefficients from the \(\alpha_n\)s and set in matrix-form:

        $$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text {linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text {linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text {lower bounds }} \leq \alpha \leq \underbrace{\infty}_{\text {upper bounds }} $$

        The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of \(\alpha_n\)s; and then, gives you back an \(\mathbf{\alpha}\).

      Equivalently:

      $$\min _{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text {subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$

    5. What method does it require to be solved:
      To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
    6. Formulate the Lagrangian:
      1. Take each inequality constraint and put them in the zero-form (equality with Zero)
      2. Multiply each inequality by a Lagrange Multiplier \(\alpha_n\)
      3. Add them to the objective function \(\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}\)
        The sign will be \(-\) (negative) simply because the inequality is \(\geq 0\)

      $$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$

    7. Optimize the objective for each variable:
      1. Gradient w.r.t. \(\mathbf{w}\):

        $$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$

      2. Derivative w.r.t. \(b\):

        $$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    8. Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
      1. Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. \(\alpha_n\) will become free of \(\mathbf{w}\) and \(b\):

        $$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$

      2. Notice that the first constraint \(\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}\) has-no-effect/doesn’t-constraint \(\alpha_n\) so it’s a vacuous constraint. However, not the second constraint \(\sum_{n=1}^{N} \alpha_{n} y_{n}=0\).
      3. Set the optimization objective and the constraints, a quadratic function in \(\alpha_n\):

      $$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    9. Set the problem as a Quadratic Programming problem:
      1. Change the maximization to minimization by flipping the signs:

        $$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$

      2. Isolate the Coefficients from the \(\alpha_n\)s and set in matrix-form:

        $$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text {linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text {linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text {lower bounds }} \leq \alpha \leq \underbrace{\infty}_ {\text {upper bounds }} $$

    10. What are the inputs and outputs to the Quadratic Program Package?
      The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of \(\alpha_n\)s; and then, gives you back an \(\mathbf{\alpha}\).
    11. Give the final form of the optimization problem in standard form:

      $$\min_{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text {subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$


Soft-Margin SVM

  1. Motivate the soft-margin SVM:
    The Hard-Margin SVM faces a few issues:
    1. The Hard-Margin SVM fails if the data is not linearly separable.
    2. The Hard-Margin SVM is quite sensitive to outliers

    The Soft-Margin SVM aims to fix/reconcile these problems.

  2. What is the main idea behind it?
    Allow some points to violate the margin, by introducing slack variables.
  3. Define the following wrt soft-margin SVM:
    1. Goal:
      Find weights ‘\(w\)’ and scalar ‘\(b\)’ that correctly classifies the data-points and, moreover, does so in the “best” possible way, but allow some points to violate the margin, by introducing slack variables.
    2. Procedure:
      (1) Use a linear classifier
      (2) But, Maximize the Margin
      (3) Do so by Minimizing \(\|w\|\)
      (4) But allow some points to penetrate the margin
    3. Decision Function:
    4. Constraints:

      $$y_i(wX_i + b) \geq 1 - \zeta_i, \forall i \in [1,n]$$

      where the \(\zeta_i\)s are slack variables.
      We, also, enforce the non-negativity constraint on the slack variables:

      $$\zeta_i \geq 0, \:\:\: \forall i \in [1, n]$$

      1. Why is there a non-negativity constraint?
        The non-negativity constraint forces the slack variables to be zero for all points that do not violate the original constraint:
        i.e. are not inside the slab.
    5. Objective/Cost Function:

      $$ R(w) = \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i$$

    6. The Optimization Problem:
      Find weights ‘\(w\)’, scalar ‘\(b\)’, and \(\zeta_i\)s that minimize

      $$ \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i$$

      Subject to

      $$y_i(wX_i + b) \geq 1 - \zeta_i, \zeta_i \geq 0, \forall i \in [1,n]$$

      Formally,

      $$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1 - \zeta_i, \:\: \zeta_i \geq 0, \forall i \in [1,n]$$

    7. The Optimization Method:
      The SVM optimization problem reduces to a Quadratic Program in \(d + n + 1\)-dimensions and \(2n\)-constraints.
    8. Properties:
      1. The Soft-Margin SVM will converge on non-linearly separable data.
  4. Specify the effects of the regularization hyperparameter \(C\):

      Small C Large C
    Desire Maximizing Margin = \(\dfrac{1}{|w|}\) keep most slack variables zero or small
    Danger underfitting (High Misclassification) overfitting (awesome training, awful test)
    outliers less sensitive very sensitive
    boundary more “flat” more sinuous
    1. Describe the effect wrt over/under fitting:
      Increase \(C\) hparam in SVM = causes overfitting
  5. How do we choose \(C\)?
    We choose ‘\(C\)’ with cross-validation.
  6. Give an equivalent formulation in the standard form objective for function estimation (what should it minimize?)
    In function estimation we prefer the standard-form objective to minimize (and trade-off); the loss + penalty form.
    We introduce a loss function to moderate the use of the slack variables (i.e. to avoid abusing the slack variables), namely, Hinge Loss:

    $${\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$

    The motivation: We motivate it by comparing it to the traditional \(0-1\) Loss function.
    Notice that the \(0-1\) loss is actually non-convex. It has an infinite slope at \(0\). On the other hand, the hinge loss is actually convex.

    Analysis wrt maximum margin:
    This function is zero if the constraint, \(y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\geq 1\), is satisfied, in other words, if \({\displaystyle {\vec {x}}_{i}} {\vec {x}}_{i}\) lies on the correct side of the margin.
    For data on the wrong side of the margin, the function’s value is proportional to the distance from the margin.

    Modified Objective Function:

    $$R(w) = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$

    Proof of equivalence:

    $$\begin{align} y_if\left(x_i\right) & \ \geq 1-\zeta_i, & \text{from 1st constraint } \\ \implies \zeta_i & \ \geq 1-y_if\left(x_i\right) \\ \zeta_i & \ \geq 1-y_if\left(x_i\right) \geq 0, & \text{from 2nd positivity constraint on} \zeta_i \\ \iff \zeta_i & \ \geq \max \{0, 1-y_if\left(x_i\right)\} \\ \zeta_i & \ = \max \{0, 1-y_if\left(x_i\right)\}, & \text{minimizing means } \zeta_i \text{reach lower bound}\\ \implies R(w) & \ = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}, & \text{plugging in and multplying } \lambda = \dfrac{1}{C} \end{align}$$

    The (reformulated) Optimization Problem:

    $$\min_{w, b}\dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$


Loss Functions

  1. Define:
    1. Loss Functions - Abstractly and Mathematically:
      Abstractly, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number, intuitively, representing some “cost” associated with the event.

      Formally, a loss function is a function \(L :(\hat{y}, y) \in \mathbb{R} \times Y \longmapsto L(\hat{y}, y) \in \mathbb{R}\) that takes as inputs the predicted value \(\hat{y}\) corresponding to the real data value \(y\) and outputs how different they are.

    2. Distance-Based Loss Functions:
      A Loss function \(L(\hat{y}, y)\) is called distance-based if it:
      1. Only depends on the residual:

        $$L(\hat{y}, y) = \psi(y-\hat{y}) \:\: \text{for some } \psi : \mathbb{R} \longmapsto \mathbb{R}$$

      2. Loss is \(0\) when residual is \(0\):

        $$\psi(0) = 0$$

      3. Describe an important property of dist-based losses:
        Translation Invariance:
        Distance-based losses are translation-invariant:

        $$L(\hat{y}+a, y+a) = L(\hat{y}, y)$$

      4. What are they used for?
        Regression.
    3. Relative Error - What does it lack?
      Relative-Error \(\dfrac{\hat{y}-y}{y}\) is a more natural loss but it is NOT translation-invariant.
  2. List 3 Regression Loss Functions
    1. MSE
    2. MAE
    3. Huber

  1. List 7 Classification Loss Functions
    1. \(0-1\) Loss
    2. Square Loss
    3. Hinge Loss
    4. Logistic Loss
    5. Cross-Entropy
    6. Exponential Loss
    7. Perceptron Loss


Information Theory

  1. What is Information Theory? In the context of ML?
    Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal.

    In the context of machine learning, we can also apply information theory to continuous variables where some of these message length interpretations do not apply, instead, we mostly use a few key ideas from information theory to characterize probability distributions or to quantify similarity between probability distributions.

  2. Describe the Intuition for Information Theory. Intuitively, how does the theory quantify information (list)?
    The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative.

    Thus, information theory quantifies information in a way that formalizes this intuition:

    1. Likely events should have low information content - in the extreme case, guaranteed events have no information at all
    2. Less likely events should have higher information content
    3. Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
  3. Measuring Information - Definitions and Formulas:
    1. In Shannons Theory, how do we quantify “transmitting 1 bit of information”?
      To transmit \(1\) bit of information means to divide the recipients Uncertainty by a factor of \(2\).
    2. What is the amount of information transmitted?
      The amount of information transmitted is the logarithm (base \(2\)) of the uncertainty reduction factor.
    3. What is the uncertainty reduction factor?
      It is the inverse of the probability of the event being communicated.
    4. What is the amount of information in an event \(x\)?
      The amount of information in an event \(\mathbf{x} = x\), called the Self-Information is:

      $$I(x) = \log (1/p(x)) = -\log(p(x))$$

  4. Define the Self-Information - Give the formula:
    The Self-Information or surprisal is a synonym for the surprise when a random variable is sampled.
    The Self-Information of an event \(\mathrm{x} = x\):

    $$I(x) = - \log P(x)$$

    1. What is it defined with respect to?
      Self-information deals only with a single outcome.
  5. Define Shannon Entropy - What is it used for?
    Shannon Entropy is defined as the average amount of information produced by a stochastic source of data.
    Equivalently, the amount of information that you get from one sample drawn from a given probability distribution \(p\).

    To quantify the amount of uncertainty in an entire probability distribution, we use Shannon Entropy.

    $$H(x) = {\displaystyle \operatorname {E}_{x \sim P} [I(x)]} = - {\displaystyle \operatorname {E}_{x \sim P} [\log P(X)] = -\sum_{i=1}^{n} p\left(x_{i}\right) \log p\left(x_{i}\right)}$$

    1. Describe how Shannon Entropy relate to distributions with a graph:
      img
  6. Define Differential Entropy:
    Differential Entropy is Shannons entropy of a continuous random variable \(x\)
  7. How does entropy characterize distributions?
    Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy.
  8. Define Relative Entropy - Give it’s formula:
    The Kullback–Leibler divergence (Relative Entropy) is a measure of how one probability distribution diverges from a second, expected probability distribution.

    Mathematically:

    $${\displaystyle D_{\text{KL}}(P\parallel Q)=\operatorname{E}_{x \sim P} \left[\log \dfrac{P(x)}{Q(x)}\right]=\operatorname{E}_{x \sim P} \left[\log P(x) - \log Q(x)\right]}$$

    1. Discrete:

    $${\displaystyle D_{\text{KL}}(P\parallel Q)=\sum_{i}P(i)\log \left({\frac {P(i)}{Q(i)}}\right)}$$

    1. Continuous:

    $${\displaystyle D_{\text{KL}}(P\parallel Q)=\int_{-\infty }^{\infty }p(x)\log \left({\frac {p(x)}{q(x)}}\right)\,dx,}$$

    1. Give an interpretation:
      1. Discrete variables:
        It is the extra amount of information needed to send a message containing symbols drawn from probability distribution \(P\), when we use a code that was designed to minimize the length of messages drawn from probability distribution \(Q\).
    2. List the properties:
      1. Non-Negativity:
        \({\displaystyle D_{\mathrm {KL} }(P\|Q) \geq 0}\)
      2. \({\displaystyle D_{\mathrm {KL} }(P\|Q) = 0 \iff}\) \(P\) and \(Q\) are:
        1. Discrete Variables:
          the same distribution
        2. Continuous Variables:
          equal “almost everywhere”
      3. Additivity of Independent Distributions:
        \({\displaystyle D_{\text{KL}}(P\parallel Q)=D_{\text{KL}}(P_{1}\parallel Q_{1})+D_{\text{KL}}(P_{2}\parallel Q_{2}).}\)
      4. \({\displaystyle D_{\mathrm {KL} }(P\|Q) \neq D_{\mathrm {KL} }(Q\|P)}\)

        This asymmetry means that there are important consequences to the choice of the ordering

      5. Convexity in the pair of PMFs \((p, q)\) (i.e. \({\displaystyle (p_{1},q_{1})}\) and \({\displaystyle (p_{2},q_{2})}\) are two pairs of PMFs):
        \({\displaystyle D_{\text{KL}}(\lambda p_{1}+(1-\lambda )p_{2}\parallel \lambda q_{1}+(1-\lambda )q_{2})\leq \lambda D_{\text{KL}}(p_{1}\parallel q_{1})+(1-\lambda )D_{\text{KL}}(p_{2}\parallel q_{2}){\text{ for }}0\leq \lambda \leq 1.}\)
    3. Describe it as a distance:
      Because the KL divergence is non-negative and measures the difference between two distributions, it is often conceptualized as measuring some sort of distance between these distributions.
      However, it is not a true distance measure because it is not symmetric.

      KL-div is, however, a Quasi-Metric, since it satisfies all the properties of a distance-metric except symmetry

    4. List the applications of relative entropy:
      Characterizing:
      1. Relative (Shannon) entropy in information systems
      2. Randomness in continuous time-series
      3. Information gain when comparing statistical models of inference
    5. How does the direction of minimization affect the optimization:
      Suppose we have a distribution \(p(x)\) and we wish to approximate it with another distribution \(q(x)\).
      We have a choice of minimizing either:
      1. \({\displaystyle D_{\text{KL}}(p\|q)} \implies q^\ast = \operatorname {arg\,min}_q {\displaystyle D_{\text{KL}}(p\|q)}\)
        Produces an approximation that usually places high probability anywhere that the true distribution places high probability.
      2. \({\displaystyle D_{\text{KL}}(q\|p)} \implies q^\ast \operatorname {arg\,min}_q {\displaystyle D_{\text{KL}}(q\|p)}\)
        Produces an approximation that rarely places high probability anywhere that the true distribution places low probability.

        which are different due to the asymmetry of the KL-divergence

  9. Define Cross Entropy - Give it’s formula:
    The Cross Entropy between two probability distributions \({\displaystyle p}\) and \({\displaystyle q}\) over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution \({\displaystyle q}\), rather than the “true” distribution \({\displaystyle p}\).

    $$H(p,q) = \operatorname{E}_{p}[-\log q]= H(p) + D_{\mathrm{KL}}(p\|q) =-\sum_{x }p(x)\,\log q(x)$$

    1. What does it measure?
      The average number of bits that need to be transmitted using a different probability distribution \(q\) (for encoding) than the “true” distribution \(p\), to convey the information in \(p\).
    2. How does it relate to relative entropy?
      It is similar to KL-Div but with an additional quantity - the entropy of \(p\).
    3. When are they equivalent?
      Minimizing the cross-entropy with respect to \(Q\) is equivalent to minimizing the KL divergence, because \(Q\) does not participate in the omitted term.
  10. Mutual Information:
    1. Definition:
      The Mutual Information (MI) of two random variables is a measure of the mutual dependence between the two variables.
      More specifically, it quantifies the “amount of information” (in bits) obtained about one random variable through observing the other random variable.
    2. What does it measure?
      It can be seen as a way of measuring the reduction in uncertainty (information content) of measuring a part of the system after observing the outcome of another parts of the system; given two R.Vs, knowing the value of one of the R.Vs in the system gives a corresponding reduction in (the uncertainty (information content) of) measuring the other one.
    3. Intuitive Definitions:
      • Measures the information that \(X\) and \(Y\) share:
        It measures how much knowing one of these variables reduces uncertainty about the other.
        • \(X, Y\) Independent \(\implies I(X; Y) = 0\): their MI is zero
        • \(X\) deterministic function of \(Y\) and vice versa \(\implies I(X; Y) = H(X) = H(Y)\) their MI is equal to entropy of each variable
      • It’s a Measure of the inherent dependence expressed in the joint distribution of \(X\) and \(Y\) relative to the joint distribution of \(X\) and \(Y\) under the assumption of independence.
        i.e. The price for encoding \({\displaystyle (X,Y)}\) as a pair of independent random variables, when in reality they are not.
    4. Properties:
      • The KL-divergence shows that \(I(X; Y)\) is equal to zero precisely when the joint distribution conicides with the product of the marginals i.e. when \(X\) and \(Y\) are independent.
      • The MI is non-negative: \(I(X; Y) \geq 0\)
        • It is a measure of the price for encoding \({\displaystyle (X,Y)}\) as a pair of independent random variables, when in reality they are not.
      • It is symmetric: \(I(X; Y) = I(Y; X)\)
      • Related to conditional and joint entropies:

        $${\displaystyle {\begin{aligned}\operatorname {I} (X;Y)&{}\equiv \mathrm {H} (X)-\mathrm {H} (X|Y)\\&{}\equiv \mathrm {H} (Y)-\mathrm {H} (Y|X)\\&{}\equiv \mathrm {H} (X)+\mathrm {H} (Y)-\mathrm {H} (X,Y)\\&{}\equiv \mathrm {H} (X,Y)-\mathrm {H} (X|Y)-\mathrm {H} (Y|X)\end{aligned}}}$$

        where \(\mathrm{H}(X)\) and \(\mathrm{H}(Y)\) are the marginal entropies, \(\mathrm{H}(X | Y)\) and \(\mathrm{H}(Y | X)\) are the conditional entopries, and \(\mathrm{H}(X, Y)\) is the joint entropy of \(X\) and \(Y\).

        • Note the analogy to the union, difference, and intersection of two sets:
          img
      • Related to KL-div of conditional distribution:

        $$\mathrm{I}(X ; Y)=\mathbb{E}_{Y}\left[D_{\mathrm{KL}}\left(p_{X | Y} \| p_{X}\right)\right]$$

    5. Applications:
    6. As KL-Divergence:
      Let \((X, Y)\) be a pair of random variables with values over the space \(\mathcal{X} \times \mathcal{Y}\) . If their joint distribution is \(P_{(X, Y)}\) and the marginal distributions are \(P_{X}\) and \(P_{Y},\) the mutual information is defined as:

      $$I(X ; Y)=D_{\mathrm{KL}}\left(P_{(X, Y)} \| P_{X} \otimes P_{Y}\right)$$

    7. In-terms of PMFs for discrete distributions:
      The mutual information of two jointly discrete random variables \(X\) and \(Y\) is calculated as a double sum:

      $$\mathrm{I}(X ; Y)=\sum_{y \in \mathcal{Y}} \sum_{x \in \mathcal{X}} p_{(X, Y)}(x, y) \log \left(\frac{p_{(X, Y)}(x, y)}{p_{X}(x) p_{Y}(y)}\right)$$

      where \({\displaystyle p_{(X,Y)}}\) is the joint probability mass function of \({\displaystyle X}\) X and \({\displaystyle Y}\), and \({\displaystyle p_{X}}\) and \({\displaystyle p_{Y}}\) are the marginal probability mass functions of \({\displaystyle X}\) and \({\displaystyle Y}\) respectively.

    8. In terms of PDFs for continuous distributions:
      In the case of jointly continuous random variables, the double sum is replaced by a double integral:

      $$\mathrm{I}(X ; Y)=\int_{\mathcal{Y}} \int_{\mathcal{X}} p_{(X, Y)}(x, y) \log \left(\frac{p_{(X, Y)}(x, y)}{p_{X}(x) p_{Y}(y)}\right) d x d y$$

      where \(p_{(X, Y)}\) is now the joint probability density function of \(X\) and \(Y\) and \(p_{X}\) and \(p_{Y}\) are the marginal probability density functions of \(X\) and \(Y\) respectively.

    9. Relation to PMI:
      The mutual information (MI) of the random variables \(X\) and \(Y\) is the expected value of the PMI (over all possible outcomes).
  11. Pointwise Mutual Information (PMI):
    1. Definition:
      The PMI of a pair of outcomes \(x\) and \(y\) belonging to discrete random variables \(X\) and \(Y\) quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:

      $$\operatorname{pmi}(x ; y) \equiv \log \frac{p(x, y)}{p(x) p(y)}=\log \frac{p(x | y)}{p(x)}=\log \frac{p(y | x)}{p(y)}$$

    2. Relation to MI:
      In contrast to mutual information (MI) which builds upon PMI, it refers to single events, whereas MI refers to the average of all possible events.
      The mutual information (MI) of the random variables \(X\) and \(Y\) is the expected value of the PMI (over all possible outcomes).

Recommendation Systems

  1. Describe the different algorithms for recommendation systems:

Ensemble Learning

  1. What are the two paradigms of ensemble methods?
    1. Parallel
    2. Sequential
  2. Random Forest VS GBM?
    The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

Data Processing and Analysis

  1. What are 3 data preprocessing techniques to handle outliers?
    1. Winsorizing/Winsorization (cap at threshold).
    2. Transform to reduce skew (using Box-Cox or similar).
    3. Remove outliers if you’re certain they are anomalies or measurement errors.
  2. Describe the strategies to dimensionality reduction?
    1. Feature Selection
    2. Feature Projection/Extraction
  3. What are 3 ways of reducing dimensionality?
    1. Removing Collinear Features
    2. Performing PCA, ICA, etc.
    3. Feature Engineering
    4. AutoEncoder
    5. Non-negative matrix factorization (NMF)
    6. LDA
    7. MSD
  4. List methods for Feature Selection
    1. Variance Threshold: normalize first (variance depends on scale)
    2. Correlation Threshold: remove the one with larger mean absolute correlation with other features.
    3. Genetic Algorithms
    4. Stepwise Search: bad performance, regularization much better, it’s a greedy algorithm (can’t account for future effects of each change)
    5. LASSO, Elastic-Net
  5. List methods for Feature Extraction
    1. PCA, ICA, CCA
    2. AutoEncoders
    3. LDA: LDA is a supervised linear transformation technique since the dependent variable (or the class label) is considered in the model. It Extracts the k new independent variables that maximize the separation between the classes of the dependent variable.
      1. Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
      2. Unlike PCA, LDA extracts the k new independent variables that maximize the separation between the classes of the dependent variable. LDA is a supervised linear transformation technique since the dependent variable (or the class label) is considered in the model.
    4. Latent Semantic Analysis
    5. Isomap
  6. How to detect correlation of “categorical variables”?
    1. Chi-Squared test: it is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.
  7. Feature Importance
    1. Use linear regression and select variables based on \(p\) values
    2. Use Random Forest, Xgboost and plot variable importance chart
    3. Lasso
    4. Measure information gain for the available set of features and select top \(n\) features accordingly.
    5. Use Forward Selection, Backward Selection, Stepwise Selection
    6. Remove the correlated variables prior to selecting important variables
    7. In linear models, feature importance can be calculated by the scale of the coefficients
    8. In tree-based methods (such as random forest), important features are likely to appear closer to the root of the tree. We can get a feature’s importance for random forest by computing the averaging depth at which it appears across all trees in the forest
  8. Capturing the correlation between continuous and categorical variable? If yes, how?
    Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.
    ANCOVA Explained
  9. What cross validation technique would you use on time series data set?
    Forward chaining strategy with k folds.
  10. How to deal with missing features? (Imputation?)
    1. Assign a unique category to missing values, who knows the missing values might decipher some trend.
    2. Remove them blatantly
    3. we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.
  11. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
    For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
  12. What are collinearity and multicollinearity?
    1. Collinearity occurs when two predictor variables (e.g., \(x_1\) and \(x_2\)) in a multiple regression have some correlation.
    2. Multicollinearity occurs when more than two predictor variables (e.g., \(x_1, x_2, \text{ and } x_3\)) are inter-correlated.
  13. What is data normalization and why do we need it?

ML/Statistical Models

  1. What are parametric models?
    Parametric models are those with a finite number of parameters. To predict new data, you only need to know the parameters of the model. Examples include linear regression, logistic regression, and linear SVMs.
  2. What is a classifier?
    A function that maps…

K-NN


PCA

  1. What is PCA?
    It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
  2. What is the Goal of PCA?
    Given points \(\mathbf{x}_ i \in \mathbf{R}^d\), find k-directions that capture most of the variation.
  3. List the applications of PCA:
    1. Find a small basis for representing variations in complex things.

      e.g. faces, genes.

    2. Reducing the number of dimensions makes some computations cheaper.
    3. Remove irrelevant dimensions to reduce over-fitting in learning algorithms.

      Like “subset selection” but the features are not axis aligned.
      They are linear combinations of input features.

    4. Represent the data with fewer parameters (dimensions)
  4. Give formulas for the following:
    1. Assumptions on \(X\):
      The analysis above is valid only for (1) \(X\) w/ samples in rows and variables in columns (2) \(X\) is centered (mean=0)
    2. SVD of \(X\):
      \(X = USV^{T}\)
    3. Principal Directions/Axes:
      \(V\)
    4. Principal Components (scores):
      \(US\)
    5. The \(j\)-th principal component:
      \(Xv_j = Us_j\)
  5. Define the transformation, mathematically:
    Mathematically, the transformation is defined by a set of \(p\)-dimensional vectors of weights or coefficients \({\displaystyle \mathbf {v}_ {(k)}=(v_{1},\dots ,v_{p})_ {(k)}}\) that map each row vector \({\displaystyle \mathbf {x}_ {(i)}}\) of \(X\) to a new vector of principal component scores \({\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_ {(i)}}\), given by:

    $${\displaystyle {t_{k}}_{(i)}=\mathbf {x}_ {(i)}\cdot \mathbf {v}_ {(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l}$$

    in such a way that the individual variables \({\displaystyle t_{1},\dots ,t_{l}}\) of \(t\) considered over the data set successively inherit the maximum possible variance from \(X\), with each coefficient vector \(v\) constrained to be a unit vector (where \(l\) is usually selected to be less than \({\displaystyle p}\) to reduce dimensionality).

  6. What does PCA produce/result in?
  7. Describe the PCA algorithm:
    1. Data Preprocessing:
      1. Training set: \(x^{(1)}, x^{(2)}, \ldots, x^{(m)}\)
      2. Preprocessing (feature scaling + mean normalization):
        1. mean normalization:
          \(\mu_{j}=\frac{1}{m} \sum_{i=1}^{m} x_{j}^{(i)}\)
          Replace each \(x_{j}^{(i)}\) with \(x_j^{(i)} - \mu_j\)
        2. feature scaling:
          If different features on different, scale features to have comparable range
          \(s_j = S.D(X_j)\) (the standard deviation of feature \(j\))
          Replace each \(x_{j}^{(i)}\) with \(\dfrac{x_j^{(i)} - \mu_j}{s_j}\)
    2. Computing the Principal Components:
      1. Compute the SVD of the matrix \(X = U S V^T\)
      2. Compute the Principal Components:

        $$T = US = XV$$

        Note: The \(j\)-th principal component is: \(Xv_j\)

      3. Choose the top \(k\) components singular values in \(S = S_k\)
      4. Compute the Truncated Principal Components:

        $$T_k = US_k$$

    3. Computing the Low-rank Approximation Matrix \(X_k\):
      1. Compute the reconstruction matrix:

        $$X_k = T_kV^T = US_kV^T$$

  8. Describe the Optimality of PCA:
    Optimal for Finding a lower dimensional subspace (PCs) that Minimizes the RSS of projection errors.
  9. List limitations of PCA:
    1. PCA is highly sensitive to the (relative) scaling of the data; no consensus on best scaling.
  10. Intuition:
    1. PCA can be thought of as fitting a \(p\)-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only a commensurately small amount of information.
    2. Its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data.

  11. Should you remove correlated features b4 PCA?
    Yes. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated. Discussion
  12. How can we measure the “Total Variance” of the data?
    1.*The Total Variance** of the data can be expressed as the sum of all the eigenvalues:

    $$\mathbf{Tr} \Sigma = \mathbf{Tr} (U \Lambda U^T) = \mathbf{Tr} (U^T U \Lambda) = \mathbf{Tr} \Lambda = \lambda_1 + \ldots + \lambda_n.$$

  13. How can we measure the “Total Variance” of the projected data?
    1.*The Total Variance** of the Projected data is:

    $$\mathbf{Tr} (P \Sigma P^T ) = \lambda_1 + \lambda_2 + \cdots + \lambda_k. $$

  14. How can we measure the “Error in the Projection”?
    1.*The Error in the Projection** could be measured with respect to variance.
    1. We define the ratio of variance “explained” by the projected data (equivalently, the ratio of information “retained”) as:

    $$\dfrac{\lambda_1 + \ldots + \lambda_k}{\lambda_1 + \ldots + \lambda_n}. $$

    1. What does it mean when this ratio is high?
      If the ratio is high, we can say that much of the variation in the data can be observed on the projected plane.
  15. How does PCA relate to CCA?
    1. CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while
    2. PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset.
  16. How does PCA relate to ICA?
    Independent component analysis (ICA) is directed to similar problems as principal component analysis, but finds additively separable components rather than successive approximations.

The Centroid Method

  1. Define “The Centroid”:
    In mathematics and physics, the centroid or geometric center of a plane figure is the arithmetic mean (“average”) position of all the points in the shape.

    The definition extends to any object in n-dimensional space: its centroid is the mean position of all the points in all of the coordinate directions.

  2. Describe the Procedure:
    Compute the mean (\(\mu_c\)) of all the vectors in class \(C\) and the mean (\(\mu_x\)) of all the vectors not in \(C\)

  3. What is the Decision Function:

    $$f(x) = (\mu_c - \mu_x) \cdot \vec{x} - (\mu_c - \mu_x) \cdot \dfrac{\mu_c + \mu_x}{2}$$

  4. Describe the Decision Boundary:
    The decision boundary is a Hyperplane that bisects the line segment with endpoints \(<\mu_c, \mu_x>\)

K-Means

  1. What is K-Means?
    It is a clustering algorithm. It aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean. It results in a partitioning of the data space into Voronoi Cells.
  2. What is the idea behind K-Means?
    1. Minimize the aggregate intra-cluster distance
    2. Equivalent to minimizing the variance
    3. Thus, it finds \(k-\)clusters with minimum aggregate variance
  3. What does K-Mean find?
    It finds \(k-\)clusters with minimum aggregate variance.
  4. Formal Description of the Model:
    Given a set of observations \(\left(\mathbf{x}_{1}, \mathbf{x} _{2}, \ldots, \mathbf{x}_{n}\right)\), \(\mathbf{x}_ i \in \mathbb{R}^d\), the algorithm aims to partition the \(n\) observations into \(k\) sets \(\mathbf{S}=\left\{S_{1}, S_{2}, \ldots, S_{k}\right\}\) so as to minimize the intra-cluster Sum-of-Squares (i.e. variance).
    1. What is the Objective?

      $$\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}=\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{k}\left|S_{i}\right| \operatorname{Var} S_{i}$$

      where \(\boldsymbol{\mu}_i\) is the mean of points in \(S_i\).

  5. Description of the Algorithm:
    1. Choose two random points, call them “Centroids”
    2. Assign the closest \(N/2\) points (Euclidean-wise) to each of the Centroids
    3. Compute the mean of each “group”/class of points
    4. Re-Assign the centroids to the newly computed Means ↑
    5. REPEAT!
  6. What is the Optimization method used? What class does it belong to?
    Coordinate descent. Expectation-Maximization.
  7. What is the Complexity of the algorithm?
    The original formulation of the problem is NP-Hard; however, EM algorithms (specifically, Coordinate-Descent) can be used as efficient heuristic algorithms that converge quickly to good local minima.
  8. Describe the convergence and prove it:
    Guaranteed to converge after a finite number of iterations to a local minimum.
    1. Proof:
      The Algorithm Minimizes a monotonically decreasing, Non-Negative Energy function on a finite Domain:
      By Monotone Convergence Theorem the objective Value Converges.

  9. Describe the Optimality of the Algorithm:
    1. Locally optimal:
      due to convergence property
    2. Non-Globally optimal:
      1. The objective function is non-convex
      2. Moreover, coordinate Descent doesn’t converge to global minimum on non-convex functions.
  10. Derive the estimated parameters of the algorithm:
    1. Objective Function:

      $$J(S, \mu)= \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}} \| \mathbf{x} -\mu_i \|^{2}$$

    2. Optimization Objective:

      $$\min _{\mu} \min _{S} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x} -\mu_{i}\right\|^{2}$$

    3. Derivation:
    4. Fix \(S = \hat{S}\), optimize \(\mu\):

      $$\begin{aligned} & \min _{\mu} \sum_{i=1}^{k} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mu_{i}-x_{j}\right\|^{2}\\ =& \sum_{i=1}^{k} \min _{\mu_i} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mathbf{x} - \mu_{i}\right\|^{2} \end{aligned}$$

      1. MLE:

        $$\min _{\mu_i} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mathbf{x} - \mu_{i}\right\|^{2}$$

        \(\implies\)

        $${\displaystyle \hat{\mu_i} = \dfrac{\sum_{\mathbf{x} \in \hat{S}_ {i}} \mathbf{x}}{\vert\hat{S}_ i\vert}}$$

    5. Fix \(\mu_i = \hat{\mu_i}, \forall i\), optimize \(S\):

      $$\arg \min _{S} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x} - \hat{\mu_{i}}\right\|^{2}$$

      \(\implies\)

      $$S_{i}^{(t)}=\left\{x_{p} :\left\|x_{p}-m_{i}^{(t)}\right\|^{2} \leq\left\|x_{p}-m_{j}^{(t)}\right\|^{2} \forall j, 1 \leq j \leq k\right\}$$

      1. MLE:
  11. When does K-Means fail to give good results?
    K-means clustering algorithm fails to give good results when:
    1. the data contains outliers
    2. the density spread of data points across the data space is different
    3. and the data points follow nonconvex shapes.

Naive Bayes

  1. Define:
    1. Naive Bayes:
      It is a simple technique used for constructing classifiers.
    2. Naive Bayes Classifiers:
      A family of simple probabilistic classifiers based on applying bayes theorem with strong (naive) independence assumptions between the features.
    3. Bayes Theorem:
      \(p(x\vert y) = \dfrac{p(y\vert x) p(x)}{p(y)}\)
  2. List the assumptions of Naive Bayes:
    1. Conditional Independence: the features are conditionally independent from each other given a class \(C_k\)
    2. Bag-of-words: The relative importance (positions) of the features do not matter
  3. List some properties of Naive Bayes:
    1. Not a Bayesian method
    2. It’s a Bayes Classifier: minimizes the probability of misclassification

  4. Define the Probabilistic Model for the method:
    Naive Bayes is a conditional probability model:
    given a problem instance to be classified represented by a vector \(\boldsymbol{x} = (x_1, \cdots, x_n)\) of \(n\) features/words (independent variables), it assigns to this instance probabilities:

    $$P(C_k\vert \boldsymbol{x}) = p(C_k\vert x_1, \cdots, x_n)$$

    for each of the \(k\) classes.

    Using Bayes theorem to decompose the conditional probability:

    $$p(C_k\vert \boldsymbol{x}) = \dfrac{p(\boldsymbol{x}\vert C_k) p(C_k)}{p(\boldsymbol{x})}$$

    Notice that the numerator is equivalent to the joint probability distribution:

    $$p\left(C_{k}\right) p\left(\mathbf{x} | C_{k}\right) = p\left(C_{k}, x_{1}, \ldots, x_{n}\right)$$

    Using the Chain-Rule for repeated application of the conditional probability, the joint probability model can be rewritten as:

    $$p(C_{k},x_{1},\dots ,x_{n})\, = p(x_{1}\mid x_{2},\dots ,x_{n},C_{k})p(x_{2}\mid x_{3},\dots ,x_{n},C_{k})\dots p(x_{n-1}\mid x_{n},C_{k})p(x_{n}\mid C_{k})p(C_{k})$$

    Using the Naive Conditional Independence assumptions:

    $$p\left(x_{i} | x_{i+1}, \ldots, x_{n}, C_{k}\right)=p\left(x_{i} | C_{k}\right)$$

    Thus, we can write the joint model as:

    $${\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\dots ,x_{n})&\varpropto p(C_{k},x_{1},\dots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}}$$

    Finally, the conditional distribution over the class variable \(C\) is:

    $${\displaystyle p(C_{k}\mid x_{1},\dots ,x_{n})={\frac {1}{Z}}p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})}$$

    where, \({\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})}\) is a constant scaling factor, a dependent only on the, known, feature variables \(x_i\)s.

  5. Construct the classifier. What are its components? Formally define it.
    The classifier is made of (1) the conditional probability model (above) \(\:\) and (2) A decision rule.

    The decision rule used is the MAP hypothesis: i.e. pick the hypothesis that is most probable (maximize the MAP estimate=posterior*prior).

    The classifier is the function that assigns a class label \(\hat{y} = C_k\) for some \(k\) as follows:

    $$\hat{y}=\underset{k \in\{1, \ldots, K\}}{\operatorname{argmax}} p\left(C_{k}\right) \prod_{i=1}^{n} p\left(x_{i} | C_{k}\right)$$

  6. What are the parameters to be estimated for the classifier?:
    1. The posterior probability of each feature/word given a class
    2. The prior probability of each class
  7. What method do we use to estimate the parameters?:
    Maximum Likelihood Estimation (MLE).
  8. What are the estimates for each of the following parameters?:

CNNs

  1. What is a CNN?
    A convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery.

    Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

    1. What kind of data does it work on? What is the mathematical property?
      In general, it works on data that have grid-like topology.
  2. What are the layers of a CNN?

    $$\text{Input} \longrightarrow \left[\text{Conv} \rightarrow \text{ReLU} \rightarrow \text{Pooling} \rightarrow \text{Norm}\right] \times N \longrightarrow \text{FC}$$

  3. What are the four important ideas and their benefits that the convolution affords CNNs:
    1. Sparse Interactions/Connectivity/Weights:
      Unlike FNNs, where every input unit is connected to every output unit, CNNs have sparse interactions. This is accomplished by making the kernel smaller than the input.

      Benefits:

      1. This means that we need to store fewer parameters, which both,
        1. Reduces the memory requirements of the model and
        2. Improves its statistical efficiency
      2. Also, Computing the output requires fewer operations
      3. In deep CNNs, the units in the deeper layers interact indirectly with large subsets of the input which allows modelling of complex interactions through sparse connections.
    2. Parameter Sharing:
      refers to using the same parameter for more than one function in a model.

      Benefits:

      1. This means that rather than learning a separate set of parameters for every location, we learn only one set of parameters.
        1. This does not affect the runtime of forward propagation—it is still \(\mathcal{O}(k \times n)\)
        2. But it does further reduce the storage requirements of the model to \(k\) parameters (\(k\) is usually several orders of magnitude smaller than \(m\))
    3. Equivariant Representations:
      For convolutions, the particular form of parameter sharing causes the layer to have a property called equivariance to translation.

      Thus, if we move the object in the input, its representation will move the same amount in the output.

      Benefits:

      1. It is most useful when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations (e.g. edge detection)
      2. Shifting the position of an object in the input doesn’t confuse the NN
      3. Robustness against translated inputs/images
    4. Accepts inputs of variable sizes

  4. What is the inspirational model for CNNs:
    Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex.
    Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.
  5. Describe the connectivity pattern of the neurons in a layer of a CNN:
    The neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner.
  6. Describe the process of a ConvNet:
    ConvNets transform the original image layer by layer from the original pixel values to the final class scores.
  7. Convolution Operation:
    1. Define:

    2. Formula (continuous):
    3. Formula (discrete):
    4. Define the following:
      1. Feature Map:
    5. Does the operation commute?
  8. Cross Correlation:
    1. Define:
    2. Formulae:
    3. What are the differences/similarities between convolution and cross-correlation:
  9. Write down the Convolution operation and the cross-correlation over two axes and:
    1. Convolution:
    2. Convolution (commutative):
    3. Cross-Correlation:
  10. The Convolutional Layer:
    1. What are the parameters and how do we choose them?
    2. Describe what happens in the forward pass:
    3. What is the output of the forward pass:
    4. How is the output configured?
  11. Spatial Arrangements:
    1. List the Three Hyperparameters that control the output volume:
    2. How to compute the spatial size of the output volume?
    3. How can you ensure that the input & output volume are the same?
    4. In the output volume, how do you compute the \(d\)-th depth slice:
  12. Calculate the number of parameters for the following config:

    Given:
    1. Input Volume: \(64\times64\times3\)
    1. Filters: \(15 7\times7\)
    1. Stride: \(2\)
    1. Pad: \(3\)

  13. Definitions:
    1. Receptive Field:
  14. Suppose the input volume has size \([ 32 × 32 × 3 ]\) and the receptive field (or the filter size) is \(5 × 5\) , then each neuron in the Conv Layer will have weights to a __Blank__ region in the input volume, for a total of __Blank__ weights:
  15. How can we achieve the greatest reduction in the spatial dims of the network (for classification):
  16. Pooling Layer:
    1. Define:
    2. List key ideas/properties and benefits:
    3. List the different types of Pooling:
      Answer
    4. List variations of pooling and their definitions:
    5. List the hyperparams of Pooling Layer:
    6. How to calculate the size of the output volume:
    7. How many parameters does the pooling layer have:
    8. What are other ways to perform downsampling:
  17. Weight Priors:
    1. Define “Prior Prob Distribution on the parameters”:
    2. Define “Weight Prior” and its types/classes:
    3. Describe the Conv Layer as a FC Layer using priors:
    4. What are the key insights of using this view:
  18. When do multi-channel convolutions commute?
    Answer
  19. Why do we use several different kernels in a given conv-layer?
  20. Strided Convolutions
    1. Define:
    2. What are they used for?
    3. What are they equivalent to?
    4. Formula:
  21. Zero-Padding:
    1. Definition/Usage:
    2. List the types of padding:
  22. Locally Connected Layers/Unshared Convolutions:
  23. Bias Parameter:
    1. How many bias terms are used per output channel in the tradional convolution:
  24. Dilated Convolutions
    1. Define:
    2. What are they used for?
  25. Stacked Convolutions
    1. Define:
    2. What are they used for?

Theory


RNNs

  1. What is an RNN?
    1. Definition:
      Recurrent Neural Networks (RNNs) are a family of neural networks for processing sequential data.
    2. What machine-type is the standard RNN? What is its function?
      The standard RNN is a nonlinear dynamical system that maps sequences to sequences.
    3. Describe the connections in an RNN? Why does it matter?
      In an RNN, the connections between units form a directed cycle, allowing it to exhibit dynamic temporal behavior.
  2. What is the big idea behind RNNs?
    RNNs share parameters across different time-steps of the sequence, which allows it to generalize well to examples of different sequence length.
    Such sharing is particularly important when a specific piece of information can occur at multiple positions within the sequence.
  3. Dynamical Systems:
    1. Standard Form (equation and graph):

      $$\boldsymbol{s}^{(t)}=f\left(\boldsymbol{s}^{(t-1)} ; \boldsymbol{\theta}\right) \tag{10.1}$$

      where \(\boldsymbol{s}^{(t)}\) is called the state of the system.

    2. Standard Form with an external signal:

      $$\boldsymbol{s}^{(t)}=f\left(\boldsymbol{s}^{(t-1)}, \boldsymbol{x}^{(t)} ; \boldsymbol{\theta}\right) \tag{10.4}$$

      where \(\boldsymbol{x}^{(t)}\) is an external signal, and the state now contains information about the whole past sequence.

    3. RNN as a Dynamical System (eq+graph):

      $$\boldsymbol{h}^{(t)}=f\left(\boldsymbol{h}^{(t-1)}, \boldsymbol{x}^{(t)} ; \boldsymbol{\theta}\right) \tag{10.5}$$

      where the variable \(\mathbf{h}\) represents the state.
      img

    4. What type of functions can be considered RNNs? (what property)
      Basically, any function containing recurrence can be considered an RNN.
  4. Unfolding Computational Graphs
    1. Definition:
      Unfolding maps the left to the right in the figure below (from figure 10.2) (both are computational graphs of a RNN without output \(\mathbf{o}\)):
      img
      where the black square indicates that an interaction takes place with a delay of \(1\) time step, from the state at time \(t\) to the state at time \(t + 1\).
    2. List the Advantages introduced by unfolding and the benefits:
      1. Regardless of the sequence length, the learned model always has the same input size.
        Because it is specified in terms of transition from one state to another state, rather than specified in terms of a variable-length history of states.
      2. It is possible to use the same transition function \(f\) with the same parameters at every time step.
        Thus, we can learn a single shared model \(f\) that operates on all time steps and all sequence lengths, rather than needing to learn a separate model \(g^{(t)}\) for all possible time steps;
        Benefits:
      • Allows generalization to sequence lengths that did not appear in the training set
      • Enables the model to be estimated to be estimated with far fewer training examples than would be required without parameter sharing.
    3. Graph and write the equations of Unfolding hidden recurrence:
      We can represent the unfolded recurrence after \(t\) steps with a function \(g^{(t)}\):

      $$\begin{aligned} \boldsymbol{h}^{(t)} &=g^{(t)}\left(\boldsymbol{x}^{(t)}, \boldsymbol{x}^{(t-1)}, \boldsymbol{x}^{(t-2)}, \ldots, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(1)}\right) \\ &=f\left(\boldsymbol{h}^{(t-1)}, \boldsymbol{x}^{(t)} ; \boldsymbol{\theta}\right) \end{aligned}$$

      The function \(g^{(t)}\) takes the whole past sequence \(\left(\boldsymbol{x}^{(t)}, \boldsymbol{x}^{(t-1)}, \boldsymbol{x}^{(t-2)}, \ldots, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(1)}\right)\) as input and produces the current state, but the unfolded recurrent structure allows us to factorize \(g^{(t)}\) into repeated applications of a function \(f\).

  5. Describe the State of the RNN, its usage, and extreme cases of the usage:
    The network typically learns to use \(\mathbf{h}^{(t)}\) as a kind of lossy summary of the task-relevant aspects of the past sequence of inputs up to \(t\).
    This summary is, in general, necessarily lossy, since it maps an arbitrary length sequence \(\left(\boldsymbol{x}^{(t)}, \boldsymbol{x}^{(t-1)}, \boldsymbol{x}^{(t-2)}, \ldots, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(1)}\right)\) to a fixed length vector \(h^{(t)}\).

    The most demanding situation (the extreme) is when we ask \(h^{(t)}\) to be rich enough to allow one to approximately recover/reconstruct the input sequence, as in AutoEncoders.

  6. RNN Architectures:
    1. List the three standard architectures of RNNs:
      1. Graph:
      2. Architecture:
      3. Equations:
      4. Total Loss:
      5. Complexity:
      6. Properties:
  7. Teacher Forcing:
    1. Definition:
      Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output \(y^{(t)}\) as input at time \(t + 1\).
    2. Motivation:
      Maximum Likelihood.
    3. Application:
      Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.
      Teacher forcing may still be applied to models that have hidden-to-hidden connections as long as they have connections from the output at one time step to values computed in the next time step. As soon as the hidden units become a function of earlier time steps, however, the BPTT algorithm is necessary. Some models may thus be trained with both teacher forcing and BPTT.
    4. Disadvantages:
      The disadvantage of strict teacher forcing arises if the network is going to be later used in an closed-loop mode, with the network outputs (or samples from the output distribution) fed back as input. In this case, the fed-back inputs that the network sees during training could be quite different from the kind of inputs that it will see at test time.
    5. Possible Solutions (Methods) for Mitigation:
      1. Train with both teacher-forced inputs and free-running inputs, for example by predicting the correct target a number of steps in the future through the unfolded recurrent output-to-input paths[^3].
      2. Another approach (Bengio et al., 2015b) to mitigate the gap between the inputs seen at training time and the inputs seen at test time randomly chooses to use generated values or actual data values as input. This approach exploits a curriculum learning strategy to gradually use more of the generated values as input.

Optimization

  1. Define the sigmoid function and some of its properties:
    \(\sigma(-x) = 1 - \sigma(x)\)
  2. Backpropagation:
    1. Definition:
      Backpropagation algorithms are a family of methods used to efficiently train artificial neural networks (ANNs) following a gradient descent approach that exploits the chain rule.
    2. Derive Gradient Descent Update:
      Answer
    3. Explain the difference kinds of gradient-descent optimization procedures:
      1. Batch Gradient Descent AKA Vanilla Gradient Descent, computes the gradient of the objective wrt. the parameters \(\theta\) for the entire dataset:

        $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J(\theta)$$

      2. SGD performs a parameter update for each data-point:

        $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i)} ; y^{(i)}\right)$$

      3. Mini-batch Gradient Descent a hybrid approach that perform updates for a, pre-specified, mini-batch of \(n\) training examples:

        $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i : i+n)} ; y^{(i : i+n)}\right)$$

    4. List the different optimizers and their properties:
      Answer
  3. Error-Measures:
    1. Define what an error measure is:
      Error Measures aim to answer the question:
      “What does it mean for \(h\) to approximate \(f\) (\(h \approx f\))?”
      The Error Measure: \(E(h, f)\)
      It is almost always defined point-wise: \(\mathrm{e}(h(\mathbf{X}), f(\mathbf{X}))\).
    2. List the 5 most common error measures and where they are used:

    3. Specific Questions:
  4. Show that the weight vector of a linear signal is orthogonal to the decision boundary?
    The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
    Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:

    $$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$

  5. What does it mean for a function to be well-behaved from an optimization pov?
    The well-behaved property from an optimization standpoint, implies that \(f''(x)\) doesn’t change too much or too rapidly, leading to a nearly quadratic function that is easy to optimize by gradient methods.
  6. Write \(\|\mathrm{Xw}-\mathrm{y}\|^{2}\) as a summation

    $$\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{n}-y_{n}\right)^{2} = \frac{1}{N}\|\mathrm{Xw}-\mathrm{y}\|^{2}$$

  7. Compute:
    1. \(\dfrac{\partial}{\partial y}\vert{x-y}\vert=\)
      \(\dfrac{\partial}{\partial y} \vert{x-y}\vert = - \text{sign}(x-y)\)
  8. State the difference between SGD and GD?
    Gradient Descent’s cost-function iterates over ALL training samples.
    Stochastic Gradient Descent’s cost-function only accounts for ONE training sample, chosen at random.
  9. When would you use GD over SDG, and vice-versa?
    GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large.
    That means GD is preferable for small datasets while SGD is preferable for larger ones.
  10. What is convex hull ?
    In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.
  11. OLS vs MLE
    They both estimate parameters in the model. They are the same in the case of normal distribution.

ML Theory

  1. Explain intuitively why Deep Learning works?
    Circuit Theory: There are function you can compute with a “small” L-layer deep NN that shallower networks require exponentially more hidden units to compute. (comes from looking at networks as logic gates).
    1. Example:
      Computing \(x_1 \text{XOR} x_2 \text{XOR} ... \text{XOR} x_n\) takes:
      1. \(\mathcal{O}(log(n))\) in a tree representation.
        img
      2. \(\mathcal{O}(2^n)\) in a one-hidden-layer network because you need to exhaustively enumerate all possible \(2^N\) configurations of the input bits that result in the \(\text{XOR}\) being \({1, 0}\).
        img
  2. List the different types of Learning Tasks and their definitions:
    1. Multi-Task Learning: general term for training on multiple tasks
      1. Joint Learning: by choosing mini-batches from two different tasks simultaneously/alternately
      2. Pre-Training: first train on one task, then train on another

        widely used for word embeddings

    2. Transfer Learning:
      a type of multi-task learning where we are focused on one task; by learning on another task then applying those models to our main task
    3. Domain Adaptation:
      a type of transfer learning, where the output is the same, but we want to handle different inputs/topics/genres
    4. Zero-Shot Learning: is a form of extending supervised learning to a setting of solving for example a classification problem when not enough labeled examples are available for all classes.

      “Zero-shot learning is being able to solve a task despite not having received any training examples of that task.” - Goodfellow

    answer

  3. Describe the relationship between supervised and unsupervised learning?
    answer
    Many ml algorithms can be used to perform both tasks. E.g., the chain rule of probability states that for a vector \(x \in \mathbb{R}^n\), the joint distribution can be decomposed as:
    \(p(\mathbf{x})=\prod_{i=1}^{n} p\left(\mathrm{x}_{i} | \mathrm{x}_{1}, \ldots, \mathrm{x}_{i-1}\right)\)
    which implies that we can solve the Unsupervised problem of modeling \(p(x)\) by splitting it into \(n\) supervised learning problems.
    Alternatively, we can solve the supervised learning problem of learning \(p(y \vert x)\) by using traditional unsupervised learning technologies to learn the joint distribution \(p(x, y)\), then inferring:
    \(p(y | \mathbf{x})=\frac{p(\mathbf{x}, y)}{\sum_{y} p\left(\mathbf{x}, y^{\prime}\right)}\)
  4. Describe the differences between Discriminative and Generative Models?
    1. Generative Model: learns the joint probability distribution, “probability of \(\boldsymbol{x}\) and \(y\)”:

      $$P(\mathbf{x}, y)$$

    2. Discriminative Model: learns the conditional probability distribution, “probability of \(y\) given \(\boldsymbol{x}\)”:

      $$P(y \vert \mathbf{x})$$

  5. Describe the curse of dimensionality and its effects on problem solving:
    It is a phenomena where many machine learning problems become exceedingly difficult when the number of dimensions in the data is high.

    The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases \(\rightarrow\) Sparsity:

    1. Common Theme - Sparsity: When the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
    2. Sparsity and Statistical Significance: This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.
    3. Sparsity and Clustering: Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties (clustering); in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
    4. Statistical Challenge: the number of possible configurations of \(x\) is much larger than the number of training examples
    5. Stastical Sampling: The sampling density is proportional to \(N^{1/p}\), where \(p\) is the dimension of the input space and \(N\) is the sample size. Thus, if \(N_1 = 100\) represents a dense sample for a single input problem, then \(N_{10} = 100^{10}\) is the sample size required for the same sampling density with \(10\) inputs. Thus in high dimensions all feasible training samples sparsely populate the input space.
    6. Further Reading
  6. How to deal with curse of dimensionality?
    1. Feature Selection
    2. Feature Extraction
  7. Describe how to initialize a NN and any concerns w/ reasons:
    1. Don’t initialize the weights to Zero. The symmetry of hidden units results in a similar computation for each hidden unit, making all the rows of the weight matrix to be equal (by induction).
    2. It’s OK to initialize the bias term to zero.
  8. Describe the difference between Learning and Optimization in ML:
    1. The problem of Reducing the training error on the training set is one of optimization.
    2. The problem of Reducing the training error, as well as, the generalization (test) error is one of learning.
  9. List the 12 Standard Tasks in ML:
    Answer
    1. Clustering
    2. Forecasting
    3. Dimension reduction
    4. Sequence Labeling
  10. What is the difference between inductive and deductive learning?
    1. Inductive learning is the process of using observations to draw conclusions
      1. It is a method of reasoning in which the premises are viewed as supplying some evidence for the truth of the conclusion.
      2. It goes from specific to general (“bottom-up logic”).
      3. The truth of the conclusion of an inductive argument may be probable, based upon the evidence given.
    2. Deductive learning is the process of using conclusions to form observations.
      1. It is the process of reasoning from one or more statements (premises) to reach a logically certain conclusion.
      2. It goes from general to specific (“top-down logic”).
      3. The conclusions reached (“observations”) are necessarily True.

Statistical Learning Theory

  1. Define Statistical Learning Theory:
    It is a framework for machine learning drawing from the fields of statistics and functional analysis that allows us, under certain assumptions, to study the question:

    How can we affect performance on the test set when we can only observe the training set?

    It is a statistical approach to Computational Learning Theory.

  2. What assumptions are made by the theory?
    1. The training and test data are generated by an unknown data generating distribution (over the product space \(Z = X \times Y\), denoted: \(p_{\text{data}}(z) = p(x,y)\)) called the data-generating process.
    2. The i.i.d assumptions:
      1. The data-points in each dataset are independent from each other
      2. The training and testing are both identically distributed (drawn from the same distribution)

    1. Define the i.i.d assumptions?
      A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
    2. Why assume a joint probability distribution \(p(x,y)\)?
      Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because \({\displaystyle y}\) is not a deterministic function of \({\displaystyle x}\), but rather a random variable with conditional distribution \({\displaystyle P(y|x)}\) for a fixed \({\displaystyle x}\).
    3. Why do we need to model \(y\) as a target-distribution and not a target-function?
      The ‘Target Function’ is not always a function because two ‘identical’ input points can be mapped to two different outputs (i.e. they have different labels).
  3. Give the Formal Definition of SLT:
  4. Define Empirical Risk Minimization:
    It is a (learning) principle in statistical learning theory that is based on approximating the Expected/True Risk (Generalization Error) by measuring the Empirical Risk (Training Error); i.e. the performance on the training-data.

    A learning algorithm that chooses the function \(f_S\) which minimizes the empirical risk is called Empirical Risk Minimization:

    $$R_{\text{emp}} = I_S[f] = \dfrac{1}{n} \sum_{i=1}^n V(f(\vec{x}_ i, y_i))$$

    $$f_{S} = \hat{h} = \arg \min _{h \in \mathcal{H}} R_{\mathrm{emp}}(h)$$

  5. What is the Complexity of ERM?
    NP-Hard for classification with \(0-1\) loss function, even for linear classifiers
    1. It can be solved efficiently when the minimal empirical risk is ZERO; i.e. the data is linearly separable

  6. Definitions:
    1. Generalization:
      The ability to do well on previously unobserved data:

      $$I[f] \approx I_S[f]$$

      “Good” generalization is achieved when the empirical risk approximates the expected risk well.

    2. Generalization Error:
      AKA: Expected Risk, Out-of-sample Error, \(E_{\text{out}}\)
      It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data defined as the expected value of the error on a new input, measured w.r.t. the data-generating probability distribution.

    3. Generalization Gap:
      It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data, defined as the difference between the expected risk and empirical risk:

      $$G =I\left[f_{n}\right]-I_{S}\left[f_{n}\right]$$

      1. Computing the Generalization Gap:
        Since the empirical risk/generalization error \(I[f_n]\) cannot be computed for an unknown distribution, the generalization gap cannot be computed either.
      2. What is the goal of SLT in the context of the Generalization Gap given that it can’t be computed?
        Instead, the goal of SLT is to bound/characterize the gap in probability:

        $$P_{G}=P\left(I\left[f_{n}\right]-I_{S}\left[f_{n}\right] \leq \epsilon\right) \geq 1-\delta_{n}$$

        I.E. goal is to characterize the probability \(1- \delta_n\) that the generalization gap is less than some error bound \(\epsilon\) (known as the learning rate)

    4. Achieving (“good”) Generalization:
      An algorithm is said to generalize when the expected risk is well approximated by the empirical risk:

      $$I\left[f_{n}\right] \approx I_{S}\left[f_{n}\right]$$

      Equivalently:

      $$E_{\text {out}}(g) \approx E_{\text {in}}(g)$$

      I.E. when the generalization gap approaches zero in the limit of data-points:

      $$\lim _{n \rightarrow \infty} G_{n}=\lim _{n \rightarrow \infty} I\left[f_{n}\right]-I_{S}\left[f_{n}\right]=0$$

    5. Empirical Distribution:
      AKA: Data-Generating Distribution
      is the discrete, uniform, joint distribution \(p_{\text{data}} = p(x,y)\) over the sample points.
  7. Describe the difference between Learning and Optimization in ML:
    1. Optimization: is concerned with the problem of reducing the training error on the training set
    2. Learning: is concerned with the problem of reducing the training error, as well as, the generalization (test) error
  8. Describe the difference between Generalization and Learning in ML:
    1. Generalization guarantee: tells us that it is likely that the following condition holds:

      $$E_{\mathrm{out}}(\hat{h}) \approx E_{\mathrm{in}}(\hat{h})$$

      I.E. that the empirical error tracks/approximates the expected/generalization error well.

    2. Learning: corresponds to the condition that \(\hat{h} \approx f\) the chosen hypothesis approximates the target function well, which in-turn corresponds to the condition:

      $$E_{\mathrm{out}}(\hat{h}) \approx 0$$

  9. How to achieve Learning?
    To achieve learning we need to achieve the condition \(E_{\mathrm{out}}(\hat{h}) \approx 0\), which we do by:
    1. \(E_{\mathrm{out}}(\hat{h}) \approx E_{\mathrm{in}}(\hat{h})\)
      A theoretical result achieved through Hoeffding Inequality
    2. \(E_{\mathrm{in}}(\hat{h}) \approx 0\)
      A practical result of Empirical Error Minimization
  10. What does the (VC) Learning Theory Achieve?
    1. Characterizing the feasibility of learning for infinite hypotheses
    2. Characterizing the Approximation-Generalization Tradeoff
  11. Why do we need the probabilistic framework?
  12. What is the Approximation-Generalization Tradeoff:
    It is a tradeoff between (1) How well we can approximate the target function \(f \:\:\) and (2) How well we can generalize to unseen data.
    Given the goal: Small \(E_{\text{out}}\), good approximation of \(f\) out of sample (not in-sample).
    The tradeoff is characterized by the complexity of the hypothesis space \(\mathcal{H}\):
    1. More Complex \(\mathcal{H}\): Better chance of approximating \(f\)
    2. Less Complex \(\mathcal{H}\): Better chance of generalizing out-of-sample
  13. What are the factors determining how well an ML-algo will perform?
    1. Ability to Approximate \(f\) well, in-sample | Make the training error small &
    2. Decrease the gap between \(E_{\text{in}}\) and \(E_{\text{out}}\) | Make gap between training and test error small
  14. Define the following and their usage/application & how they relate to each other:
    1. Underfitting:
      Occurs when the model cannot fit the training data well; high \(E_{\text{in}}\).
    2. Overfitting:
      Occurs when the gap between the training error and test error is too large.
    3. Capacity:
      a models ability to fit a high variety of functions (complexity).
      It allows us to control the amount of overfitting and underfitting
      1. Models with Low-Capacity:
        Underfitting. High-Bias. Struggle to fit training-data.
      2. Models with High-Capacity:
        Overfitting. High-Variance. Memorizes noise.
    4. Hypothesis Space:
      The set of functions that the learning algorithm is allowed to select as being the target function.
      Allows us to control the capacity of a model.
    5. VC-Dimension:
      The largest possible value of \(m\) for which there exists a training set of \(m\) different points that the classifier can label arbitrarily.
      Quantifies a models capacity.
      1. What does it measure?
        Measures the capacity of a binary classifier.
    6. Graph the relation between Error, and Capacity in the ctxt of (Underfitting, Overfitting, Training Error, Generalization Err, and Generalization Gap):
      img
  15. What is the most important result in SLT that show that learning is feasible?
    Shows that the discrepancy between training error and generalization error is bounded above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases.

Bias-Variance Decomposition Theory

  1. What is the Bias-Variance Decomposition Theory:
    It is an approach for the quantification of the Approximation-Generalization Tradeoff.
  2. What are the Assumptions made by the theory?
    1. Analysis is done over the entire data-distribution
    2. Real-Valued inputs, targets (can be extended)
    3. Target function \(f\) is known
    4. Uses MSE (can be extended)
  3. What is the question that the theory tries to answer? What assumption is important? How do you achieve the answer/goal?
    1. “How can \(\mathcal{H}\) approximate \(f\) overall? not just on our sample/training-data.”.
    2. We assume that the target function \(f\) is known.
    3. By taking the expectation over all possible realization of \(N\) data-points.
  4. What is the Bias-Variance Decomposition:
    It is the decomposition of the expected error as a sum of three concepts: bias, variance and irreducible error, each quantifying an aspect of the error.
  5. Define each term w.r.t. source of the error:
    1. Bias: the error from the erroneous/simplifying assumptions in the learning algorithm
    2. Variance: the error from sensitivity to small fluctuations in the training set
    3. Irreducible Error: the error resulting from the noise in the problem itself
  6. What does each of the following measure (error in)? Describe this measured quantity in words, mathematically. Describe Bias&Variance in Words as a question statement. Give their AKA in statistics.
    1. Bias:
      1. AKA: Approximation Error
      2. Measures the error in approximating the target function with the best possible hypothesis in \(\mathcal{H}\)
      3. The expected deviation from the true value of the function (or parameter)
      4. How well can \(\mathcal{H}\) approximate the target function \(f\)
    2. Variance:
      1. AKA: Estimation Error
      2. Measures the error in estimating the best possible hypothesis in \(\mathcal{H}\) with a particular hypothesis resulting from a specific training-set
      3. The deviation from the expected estimator value that any particular sampling of the data is likely to cause
      4. How well can we zoom in on a good \(h \in \mathcal{H}\)
    3. Irreducible Error: measures the inherent noise in the target \(y\)
  7. Give the Formal Definition of the Decomposition (Formula):
    Given any hypothesis \(\hat{f} = g^{\mathcal{D}}\) we select, we can decompose its expected risk on an unseen sample \(x\) as:

    $$\mathbb{E}\left[(y-\hat{f}(x))^{2}\right]=(\operatorname{Bias}[\hat{f}(x)])^{2}+\operatorname{Var}[\hat{f}(x)]+\sigma^{2}$$

    Where:

    1. Bias:

      $$\operatorname{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x)$$

    2. Variance:

      $$\operatorname{Var}[\hat{f}(x)] = \mathbb{E}\left[(\hat{f}(x))^2\right] - \mathbb{E}\left[\hat{f}(x)\right]^2$$

    3. What is the Expectation over?
      The expectation is over all different samplings of the data \(\mathcal{D}\)
  8. Define the Bias-Variance Tradeoff:
    It is the property of a set of predictive models whereby, models with a lower bias have a higher variance and vice-versa.
    1. Effects of Bias:
      1. High Bias: simple models \(\rightarrow\) underfitting
      2. Low Bias: complex models \(\rightarrow\) overfitting
    2. Effects of Variance:
      1. High Variance: complex models \(\rightarrow\) overfitting
      2. Low Variance: simple models \(\rightarrow\) underfitting
    3. Draw the Graph of the Tradeoff (wrt model capacity):
      img
  9. Derive the Bias-Variance Decomposition with explanations:

    $${\displaystyle {\begin{aligned}\mathbb{E}_{\mathcal{D}} {\big [}I[g^{(\mathcal{D})}]{\big ]}&=\mathbb{E}_{\mathcal{D}} {\big [}\mathbb{E}_{x}{\big [}(g^{(\mathcal{D})}-y)^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x} {\big [}\mathbb{E}_{\mathcal{D}}{\big [}(g^{(\mathcal{D})}-y)^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}}{\big [}(g^{(\mathcal{D})}- f -\varepsilon)^{2}{\big ]}{\big ]} \\&=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}} {\big [}(f+\varepsilon -g^{(\mathcal{D})}+\bar{g}-\bar{g})^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}} {\big [}(\bar{g}-f)^{2}{\big ]}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}(\bar{g}-f)\varepsilon {\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}\varepsilon (g^{(\mathcal{D})}-\bar{g}){\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})(\bar{g}-f){\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}+2(\bar{g}-f)\mathbb{E}_{\mathcal{D}} [\varepsilon ]\: +2\: \mathbb{E}_{\mathcal{D}} [\varepsilon ]\: \mathbb{E}_{\mathcal{D}} {\big [}g^{(\mathcal{D})}-\bar{g}{\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}g^{(\mathcal{D})}-\bar{g}{\big ]}(\bar{g}-f){\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\operatorname {Var} [y]+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\operatorname {Bias} [g^{(\mathcal{D})}]^{2}+\operatorname {Var} [y]+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\operatorname {Bias} [g^{(\mathcal{D})}]^{2}+\sigma ^{2}+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\\end{aligned}}}$$

    where:
    \(\overline{g}(\mathbf{x})=\mathbb{E}_{\mathcal{D}}\left[g^{(\mathcal{D})}(\mathbf{x})\right]\) is the average hypothesis over all realization of \(N\) data-points \(\mathcal{D}_ i\), and \({\displaystyle \varepsilon }\) and \({\displaystyle {\hat {f}}} = g^{(\mathcal{D})}\) are independent.

  10. What are the key Takeaways from the Tradeoff?
    Match the “Model Capacity/Complexity” to the “Data Resources”, NOT to the Target Complexity.
  11. What are the most common ways to negotiate the Tradeoff?
    1. Cross-Validation
    2. MSE of the Estimates
  12. How does the decomposition relate to Classification?
    A similar decomposition exists for:
    1. Classification with \(0-1\) loss
    2. Probabilistic Classification with MSE
  13. Increasing/Decreasing Bias&Variance:
    1. Adding Good Feature:
      1. Decrease Bias
    2. Adding Bad Feature:
      1. No effect
    3. Adding ANY Feature:
      1. Increase Variance
    4. Adding more Data:
      1. Decrease Variance
      2. May decrease Bias (if \(f \in \mathcal{H}\) | \(h\) can fit \(f\) exactly)
    5. Noise in Test Set:
      1. Affects Only Irreducible Err
    6. Noise in Training Set:
      1. Affects Bias and Variance
    7. Dimensionality Reduction:
      1. Decrease Variance
    8. Feature Selection:
      1. Decrease Variance
    9. Regularization:
      1. Increase Bias
      2. Decrease Variance
    10. Increasing # of Hidden Units in ANNs:
      1. Decrease Bias
      2. Increase Variance
    11. Increasing # of Hidden Layers in ANNs:
      1. Decrease Bias
      2. Increase Variance
    12. Increasing \(k\) in K-NN:
      1. Increase Bias
      2. Decrease Variance
    13. Increasing Depth in Decision-Trees:
      1. Increase Variance
    14. Boosting:
      1. Decrease Bias
    15. Bagging:
      1. Decrease Variance

Activation Functions

  1. Describe the Desirable Properties for activation functions:
    1. Non-Linearity:
      When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
    2. Range:
      When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.
    3. Continuously Differentiable:
      This property is desirable for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.
    4. Monotonicity:
      When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex.
    5. Smoothness with Monotonic Derivatives:
      These have been shown to generalize better in some cases.
    6. Approximating Identity near Origin:
      Equivalent to \({\displaystyle f(0)=0}\) and \({\displaystyle f'(0)=1}\), and \({\displaystyle f'}\) is continuous at \(0\).
      When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values. When the activation function does not approximate identity near the origin, special care must be used when initializing the weights.
    7. Zero-Centered Range:
      Has effects of centering the data (zero mean) by centering the activations. Makes learning easier.

  2. Describe the NON-Desirable Properties for activation functions:
    1. Saturation:
      An activation functions output, with finite range, may saturate near its tail or head (e.g. \(\{0, 1\}\) for sigmoid). This leads to a problem called vanishing gradient.
    2. Vanishing Gradients:
      Happens when the gradient of an activation function is very small/zero. This usually happens when the activation function saturates at either of its tails.
      The chain-rule will multiply the local gradient (of activation function) with the whole objective. Thus, when gradient is small/zero, it will “kill” the gradient \(\rightarrow\) no signal will flow through the neuron to its weights or to its data.
      Slows/Stops learning completely.
    3. Range Not Zero-Centered:
      This is undesirable since neurons in later layers of processing in a Neural Network would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. \(x>0\) elementwise in \(f=w^Tx+b\)), then the gradient on the weights \(w\) will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression \(f\)). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
      Makes optimization harder.

  3. List the different activation functions used in ML?
    Identity, Sigmoid, Tanh, ReLU, L-ReLU, ELU, SoftPlus
    img
    Names, Definitions, Properties (pros&cons), Derivatives, Applications, pros/cons:
    1. Sigmoid:

      $$S(z)=\frac{1}{1+e^{-z}} \\ S^{\prime}(z)=S(z) \cdot(1-S(z))$$

      img

      1. Properties:
        Never use as activation, use as an output unit for binary classification.
        1. Pros:
          1. Has a nice interpretation as the firing rate of a neuron
        2. Cons:
          1. They Saturate and kill gradients \(\rightarrow\) Gives rise to vanishing gradients \(\rightarrow\) Stop Learning
            1. Happens when initialization weights are too large
            2. or sloppy with data preprocessing
            3. Neurons Activation saturates at either tail of \(0\) or \(1\)
          2. Output NOT Zero-Centered \(\rightarrow\) Gradient updates go too far in different directions \(\rightarrow\) makes optimization harder
          3. The local gradient \((z * (1-z))\) achieves maximum at \(0.25\), when \(z = 0.5\). \(\rightarrow\) very time the gradient signal flows through a sigmoid gate, its magnitude always diminishes by one quarter (or more) \(\rightarrow\) with basic SGD, the lower layers of a network train much slower than the higher one
    2. Tanh:

      $$\tanh (z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\ \tanh ^{\prime}(z)=1-\tanh (z)^{2}$$

      img
      Properties:
      Strictly superior to Sigmoid (scaled version of sigmoid | stronger gradient). Good for activation.

      1. Pros:
        1. Zero Mean/Centered
      2. Cons:
        1. They Saturate and kill gradients \(\rightarrow\) Gives rise to vanishing gradients \(\rightarrow\) Stop Learning
    3. Relu:

      $$R(z)=\left\{\begin{array}{cc}{z} & {z>0} \\ {0} & {z<=0}\end{array}\right\} \\ R^{\prime}(z)=\left\{\begin{array}{ll}{1} & {z>0} \\ {0} & {z<0}\end{array}\right\}$$

      img
      Properties:
      The best for activation (Better gradients).

      1. Pros:
        1. Non-saturation of gradients which accelerates convergence of SGD
        2. Sparsity effects and induced regularization. discussion
        3. Not computationally expensive
      2. Cons:
        1. ReLU not zero-centered problem:
          The problem that ReLU is not zero-centered can be solved/mitigated by using batch normalization, which normalizes the signal before activation:

          From paper: We add the BN transform immediately before the nonlinearity, by normalizing \(x = Wu + b\); normalizing it is likely to produce activations with a stable distribution.

        2. Dying ReLUs (Dead Neurons):
          If a neuron gets clamped to zero in the forward pass (it doesn’t “fire” / \(x<0\)), then its weights will get zero gradient. Thus, if a ReLU neuron is unfortunately initialized such that it never fires, or if a neuron’s weights ever get knocked off with a large update during training into this regime (usually as a symptom of aggressive learning rates), then this neuron will remain permanently dead.
          1. cs231n Explanation
        3. Infinite Range:
          Can blow up the activation.
    4. Leaky Relu:

      $$R(z)=\left\{\begin{array}{cc}{z} & {z>0} \\ {\alpha z} & {z<=0}\end{array}\right\} \\ R^{\prime}(z)=\left\{\begin{array}{ll}{1} & {z>0} \\ {\alpha} & {z<0}\end{array}\right\}$$

      img
      Properties:
      Sometimes useful. Worth trying.

      1. Pros:
        1. Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).
      2. Cons:
        The consistency of the benefit across tasks is presently unclear.


Kernels

  1. Define “Local Kernel” and give an analogy to describe it:
  2. Write the following kernels:
    1. Polynomial Kernel of degree, up to, \(d\):
    2. Gaussian Kernel:
    3. Sigmoid Kernel:
    4. Polynomial Kernel of degree, exactly, \(d\):

Math

  1. What is a metric?
    A Metric (distance function) \(d\) is a function that defines a distance between each pair of elements of a set \(X\).
    A Metric induces a topology on a set; BUT, not all topologies can be generated by a metric.
    Mathematically, it is a function:
    \({\displaystyle d:X\times X\to [0,\infty )},\)
    that must satisfy the following properties:
    1. \({\displaystyle d(x,y)\geq 0}\) \(\:\:\:\:\:\:\:\) non-negativity or separation axiom
    2. \({\displaystyle d(x,y)=0\Leftrightarrow x=y}\) \(\:\:\:\:\:\:\:\) identity of indiscernibles
    3. \({\displaystyle d(x,y)=d(y,x)}\) \(\:\:\:\:\:\:\:\) symmetry
    4. \({\displaystyle d(x,z)\leq d(x,y)+d(y,z)}\) \(\:\:\:\:\:\:\:\) subadditivity or triangle inequality

      The first condition is implied by the others.
      Metric

  2. Describe Binary Relations and their Properties?
    A binary relation on a set \(A\) is a set of ordered pairs of elements of \(A\). In other words, it is a subset of the Cartesian product \(A^2 = A ×A\).
    The number of binary relations on a set of \(N\) elements is \(= 2^{N^2}\) Examples:
    1. “is greater than”
    2. “is equal to”
    3. A function \(f(x)\)
      Properties: (for a relation \(R\) and set \(X\))
    4. Reflexive: for all \(x\) in \(X\) it holds that \(xRx\)
    5. Symmetric: for all \(x\) and \(y\) in \(X\) it holds that if \(xRy\) then \(yRx\)
    6. Transitive: for all \(x\), \(y\) and \(z\) in \(X\) it holds that if \(xRy\) and \(yRz\) then \(xRz\)
      answer
  3. Formulas:
    1. Set theory:
      1. Number of subsets of a set of \(N\) elements:
        \(= 2^N\)
      2. Number of pairs \((a,b)\) of a set of N elements:
        \(= N^2\)
    2. Binomial Theorem:

      $$(x+y)^{n}=\sum_{k=0}^{n}{n \choose k}x^{n-k}y^{k}=\sum_{k=0}^{n}{n \choose k}x^{k}y^{n-k} \\={n \choose 0}x^{n}y^{0}+{n \choose 1}x^{n-1}y^{1}+{n \choose 2}x^{n-2}y^{2}+\cdots +{n \choose n-1}x^{1}y^{n-1}+{n \choose n}x^{0}y^{n},$$

    3. Binomial Coefficient:

      $${\binom {n}{k}}={\frac {n!}{k!(n-k)!}} = N \text{choose} k = N \text{choose} (n-k)$$

    4. Expansion of \(x^n - y^n =\)

      $$x^n - y^n = (x-y)(x^{n-1} + x^{n-2} y + ... + x y^{n-2} + y^{n-1})$$

    5. Number of ways to partition \(N\) data points into \(k\) clusters:
      1. Number of pairs (e.g. \((a,b)\)) of a set of \(N\) elements \(= N^2\)
      2. There are at most \(k^N\) ways to partition \(N\) data points into \(k\) clusters - there are \(N\) choose \(k\) clusters, precisely
    6. \(\log_x(y) =\)

      $$\log_x(y) = \dfrac{\ln(y)}{\ln(x)}$$

    7. The length of a vector \(\mathbf{x}\) along a direction (projection):
      1. Along a unit-length vector \(\hat{\mathbf{w}}\):
        \(\text{comp}_ {\hat{\mathbf{w}}}(\mathbf{x}) = \hat{\mathbf{w}}^T\mathbf{x}\)
      2. Along an unnormalized vector \(\mathbf{w}\):
        \(\text{comp}_ {\mathbf{w}}(\mathbf{x}) = \dfrac{1}{\|\mathbf{w}\|} \mathbf{w}^T\mathbf{x}\)
    8. \(\sum_{i=1}^{n} 2^{i}=\)
      \(\sum_{i=1}^{n} 2^{i}=2^{n+1}-2\)
  4. List 6 proof methods:
    1. Direct Proof
    2. Mathematical Induction
      1. Strong Induction
      2. Infinite Descent
    3. Contradiction
    4. Contraposition (\((p \implies q) \iff (!q \implies !p)\))
    5. Construction
    6. Combinatorial
    7. Exhaustion
    8. Non-Constructive proof (existence proofs) answer
  5. Important Formulas
    1. Projection \(\tilde{\mathbf{x}}\) of a vector \(\mathbf{x}\) onto another vector \(\mathbf{u}\):

      $$\tilde{\mathbf{x}} = \mathbf{x}\cdot \dfrac{\mathbf{u}}{\|\mathbf{u}\|_2} \mathbf{u}$$


Statistics

  1. ROC curve:
    1. Definition:
      A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
    2. Purpose:
      A way to quantify how good a binary classifier separates two classes.
    3. How do you create the plot?
      The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
      - \(\mathrm{TPR}=\frac{\mathrm{TP}}{\mathrm{P}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\)
      - \(\mathrm{FPR}=\frac{\mathrm{FP}}{\mathrm{N}}=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}\)
    4. How to identify a good classifier:
      A Good classifier has a ROC curve that is near the top-left diagonal (hugging it).
    5. How to identify a bad classifier:
      A Bad Classifier has a ROC curve that is close to the diagonal line.
    6. What is its application in tuning the model?
      It allows you to set the classification threshold:
      1. You can minimize False-positive rate or maximize the True-Positive Rate
  2. AUC - AUROC:
    1. Definition:
      When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’).
    2. Range:
      Range \(= 0.5 - 1.0\), from poor to perfect, with an uninformative classifier yielding \(0.5\)
    3. What does it measure:
      It is a measure of aggregated classification performance.
    4. Usage in ML:
      For model comparison.
  3. Define Statistical Efficiency (of an estimator)?
    Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance.
    Efficiencies are often defined using the variance or mean square error as the measure of desirability.
    An efficient estimator is also the minimum variance unbiased estimator (MVUE).

    1. An Efficient Estimator has lower variance than an inefficient one
    2. The use of an inefficient estimator gives results equivalent to those obtainable from a subset of data; and is therefor, wasteful of data
  4. Whats the difference between Errors and Residuals:
    The Error of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest.

    The Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.

    1. Compute the statistical errors and residuals of the univariate, normal distribution defined as \(X_{1}, \ldots, X_{n} \sim N\left(\mu, \sigma^{2}\right)\):
      1. Statistical Errors:

        $$e_{i}=X_{i}-\mu$$

      2. Residuals:

        $$r_{i}=X_{i}-\overline {X}$$

      3. Example in Univariate Distributions
  5. What is a biased estimator?
    We define the Bias of an estimator as:

    $$ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\mathbb{E}\left(\hat{\boldsymbol{\theta}}_{m}\right)-\boldsymbol{\theta} $$

    A Biased Estimator is an estimator \(\hat{\boldsymbol{\theta}}_ {m}\) such that:

    $$ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_ {m}\right) \geq 0$$

    1. Why would we prefer biased estimators in some cases?
      Mainly, due to the Bias-Variance Decomposition. The MSE takes into account both the bias and the variance and sometimes the biased estimator might have a lower variance than the unbiased one, which results in a total decrease in the MSE.
  6. What is the difference between “Probability” and “Likelihood”:
    Probabilities are the areas under a fixed distribution
    \(pr(\)data\(|\)distribution\()\)
    i.e. probability of some data (left hand side) given a distribution (described by the right hand side)
    Likelihoods are the y-axis values for fixed data points with distributions that can be moved..
    \(L(\)distribution\(|\)observation/data\()\)
    It is the likelihood of the parameter \(\theta\) for the data \(\mathcal{D}\).

    Likelihood is, basically, a specific probability that can only be calculated after the fact (of observing some outcomes). It is not normalized to \(1\) (it is not a probability). It is just a way to quantify how likely a set of observation is to occur given some distribution with some parameters; then you can manipulate the parameters to make the realization of the data more “likely” (it is precisely meant for that purpose of estimating the parameters); it is a function of the parameters.
    Probability, on the other hand, is absolute for all possible outcomes. It is a function of the Data.

  7. Estimators:
    1. Define:
      A Point Estimator or statistic is any function of the data.
    2. Formula:

      $$\hat{\boldsymbol{\theta}}_{m}=g\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)}\right)$$

    3. Whats a good estimator?
      A good estimator is a function whose output is close to the true underlying \(\theta\) that generated the training data.
    4. What are the Assumptions made regarding the estimated parameter:
      We assume that the true \(\boldsymbol{\theta}\) is fixed, and that \(\hat{\boldsymbol{\theta}}\) is a function of the data, which is drawn from a random process, making \(\hat{\boldsymbol{\theta}}\) a random variable.
  8. What is Function Estimation:
    Function Estimation/Approximation refers to estimation of the relationship between input and target data.
    I.E. We are trying to predict a variable \(y\) given an input vector \(x\), and we assume that there is a function \(f(x)\) that describes the approximate relationship between \(y\) and \(x\).
    If we assume that: \(y = f(x) + \epsilon\), where \(\epsilon\) is the part of \(y\) that is not predictable from \(x\); then we are interested in approximating \(f\) with a model or estimate \(\hat{f}\).
    1. Whats the relation between the Function Estimator \(\hat{f}\) and Point Estimator:
      Function estimation is really just the same as estimating a parameter \(\boldsymbol{\theta}\); the function estimator \(\hat{f}\) is simply a point estimator in function space.
  9. Define “marginal likelihood” (wrt naive bayes):
    Marginal likelihood is, the probability that the word ‘FREE’ is used in any message (not given any other condition?).

(Statistics) - MLE

  1. Clearly Define MLE and derive the final formula:
    MLE is a method/principle from which we can derive specific functions that are good estimators for different models.
    Likelihood in Parametric Models:
    Suppose we have a parametric model \(\{p(y ; \theta) | \theta \in \Theta\}\) and a sample \(D=\left\{y_{1}, \ldots, y_{n}\right\}\):
    1. The likelihood of parameter estimate \(\hat{\theta} \in \Theta\) for sample \(\mathcal{D}\) is:

      $$\begin{aligned} {\displaystyle {\mathcal {L}}(\theta \,;\mathcal{D} )} &= p(\mathcal{D} ; \theta) \\&=\prod_{i=1}^{n} p\left(y_{i} ; \theta\right)\end{aligned}$$

    2. In practice, we prefer to work with the log-likelihood. Same maximum but

      $$\log {\displaystyle {\mathcal {L}}(\theta \,;\mathcal{D} )}=\sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right)$$

      and sums are easier to work with than products.

    MLE for Parametric Models:
    The maximum likelihood estimator (MLE) for \(\theta\) in the (parametric) model \(\{p(y, \theta) | \theta \in \Theta\}\) is:

    $$\begin{aligned} \hat{\theta} &=\underset{\theta \in \Theta}{\arg \max } \log p(\mathcal{D}, \hat{\theta}) \\ &=\underset{\theta \in \Theta}{\arg \max } \sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right) \end{aligned}$$

    1. Write MLE as an expectation wrt the Empirical Distribution:
      Because the \(\text {arg max }\) does not change when we rescale the cost function, we can divide by \(m\) to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution \(\hat{p}_ {\text {data}}\) defined by the training data:

      $${\displaystyle \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \:\: \mathbb{E}_{\mathbf{x} \sim \hat{p} \text {data}} \log p_{\text {model}}(\boldsymbol{x} ; \boldsymbol{\theta})} \tag{5.59}$$

    2. Describe formally the relationship between MLE and the KL-divergence:
      MLE as Minimizing KL-Divergence between the Empirical dist. and the model dist.:
      We can interpret maximum likelihood estimation as minimizing the dissimilarity between the empirical distribution \(\hat{p}_ {\text {data}}\), defined by the training set, and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence.
      1. The KL-divergence is given by:

        $${\displaystyle D_{\mathrm{KL}}\left(\hat{p}_{\text {data}} \| p_{\text {model}}\right)=\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data}}}\left[\log \hat{p}_{\text {data}}(\boldsymbol{x})-\log p_{\text {model}}(\boldsymbol{x})\right]} \tag{5.60}$$

        The term on the left is a function only of the data-generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize:

      $${\displaystyle -\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data}}}\left[\log p_{\text {model}}(\boldsymbol{x})\right]} \tag{5.61}$$

      which is of course the same as the maximization in equation \(5.59\).

    3. Extend the argument to show the link between MLE and Cross-Entropy. Give an example:
      Minimizing this KL-divergence corresponds exactly to minimizing the cross-entropy between the distributions.
      Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model.
      E.g. MSE is the cross-entropy between the empirical distribution and a Gaussian model.

      Maximum likelihood thus becomes minimization of the negative log-likelihood(NLL), or equivalently, minimization of the cross-entropy.

    4. How does MLE relate to the model distribution and the empirical distribution?
      We can thus see maximum likelihood as an attempt to make the model distribution match the empirical distribution \(\hat{p} _ {\text {data}}\).
    5. What is the intuition behind using MLE?
      If I choose a hypothesis \(h\) underwhich the observed data is very plausible then the hypothesis is very likely.
    6. What does MLE find/result in?
      It finds the value of the parameter \(\theta\) that, if used (in the model) to generate the probability of the data, would make the data most “likely” to occur.
    7. What kind of problem is MLE and how to solve for it?
      It is an optimization problem that is solved by calculus for problems with closed-form solutions or with numerical methods (e.g. SGD).
    8. How does it relate to SLT:
      It corresponds to Empirical Risk Minimization.
    9. Explain clearly why we maximize the natural log of the likelihood
      1. Numerical Stability: change products to sums
      2. The logarithm of a member of the family of exponential probability distributions (which includes the ubiquitous normal) is polynomial in the parameters (i.e. max-likelihood reduces to least-squares for normal distributions)
        \(\log\left(\exp\left(-\frac{1}{2}x^2\right)\right) = -\frac{1}{2}x^2\)
      3. The latter form is both more numerically stable and symbolically easier to differentiate than the former. It increases the dynamic range of the optimization algorithm (allowing it to work with extremely large or small values in the same way).
      4. The logarithm is a monotonic transformation that preserves the locations of the extrema (in particular, the estimated parameters in max-likelihood are identical for the original and the log-transformed formulation)
      5. Gradient methods generally work better optimizing \(log_p(x)\) than \(p(x)\) because the gradient of \(log_p(x)\) is generally more well-scaled. link
        Justification: the gradient of the original term will include a \(e^{\vec{x}}\) multiplicative term that scales very quickly one way or another, requiring the step-size to equally scale/stretch in the opposite direction.

Text-Classification | Classical

  1. List some Classification Methods:
    1. (Hand-Coded)Rules-Based Algorithms: use rules based on combinations of words or other features.
      1. Can have high accuracy if the rules are carefully refined and maintained by experts.
      2. However, building and maintaining these rules is very hard.
    2. Supervised Machine Learning: using an ML algorithm that trains on a training set of (document, class) elements to train a classifier.
      1. Types of Classifiers:
        1. Naive Bayes
        2. Logistic Regression
        3. SVMs
        4. K-NNs
  2. List some Applications of Txt Classification:
    1. Spam Filtering: discerning spam emails form legitimate emails.
    2. Email Routing: sending an email sento to a genral address to a specfic affress based on the topic.
    3. Language Identification: automatiacally determining the genre of a piece of text.
    4. Readibility Assessment__: determining the degree of readability of a piece of text.
    5. Sentiment Analysis: determining the general emotion/feeling/attitude of the author of a piece of text.
    6. Authorship Attribution: determining which author wrote which piece of text.
    7. Age/Gender Identification: determining the age and/or gender of the author of a piece of text.

NLP

  1. List some problems in NLP:
    1. Question Answering (QA)
    2. Information Extraction (IE)
    3. Sentiment Analysis
    4. Machine Translation (MT)
    5. Spam Detection
    6. Parts-of-Speech (POS) Tagging
    7. Named Entity Recognition (NER)
    8. Conference Resolution
    9. Word Sense Disambugation (WSD)
    10. Parsing
    11. Paraphrasing
    12. Summarization
    13. Dialog
  2. List the Solved Problems in NLP:
    1. Spam Detection
    2. Parts-of-Speech (POS) Tagging
    3. Named Entity Recognition (NER)
  3. List the “within reach” problems in NLP:
    1. Sentiment Analysis
    2. Conference Resolution
    3. Word Sense Disambugation (WSD)
    4. Parsing
    5. Machine Translation (MT)
    6. Information Extraction (IE)
  4. List the Open Problems in NLP:
    1. Question Answering (QA)
    2. Paraphrasing
    3. Summarization
    4. Dialog
  5. Why is NLP hard? List Issues:
    1. Non-Standard English: “Great Job @ahmed_badary! I luv u 2!! were SOO PROUD of dis.”
    2. Segmentation Issues: “New York-New Haven” vs “New-York New-Haven”
    3. Idioms: “dark horse”, “getting cold feet”, “losing face”
    4. Neologisms: “unfriend”, “retweet”, “google”, “bromance”
    5. World Knowledge: “Ahmed and Zach are brothers”, “Ahmed and Zach are fathers”
    6. Tricky Entity Names: “Where is Life of Pie playing tonight?”, “Let it be was a hit song!”
  6. Define:
    1. Morphology:
      The study of words, how they are formed, and their relationship to other words in the same language.
    2. Morphemes:
      the small meaningful units that make up words.
    3. Stems:
      the core meaning-bearing units of words.
    4. Affixes:
      the bits and pieces that adhere to stems (often with grammatical functions).
    5. Stemming:
      is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
      The stem need not map to a valid root in the language.
    6. Lemmatization:
      reducing inflections or variant forms to base form.
  7. Topic Modeling vs Document Classification:
    • Text Classification is a form of supervised learning, hence the set of possible classes are known/defined in advance, and won’t change.
    • Topic Modeling is a form of unsupervised learning (akin to clustering), so the set of possible topics are unknown apriori. They’re defined as part of generating the topic models.

Language Modeling

  1. What is a Language Model?
  2. List some Applications of LMs:
  3. Traditional LMs:
    1. How are they setup?
    2. What do they depend on?
    3. What is the Goal of the LM task? (in the ctxt of the problem setup)
    4. What assumptions are made by the problem setup? Why?
    5. What are the MLE Estimates for probabilities of the following:
      1. Bi-Grams:

        $$p(w_2\vert w_1) = $$

      2. Tri-Grams:

        $$p(w_3\vert w_1, w_2) = $$

    6. What are the issues w/ Traditional Approaches?
  4. What+How can we setup some NLP tasks as LM tasks:
  5. How does the LM task relate to Reasoning/AGI:
  6. Evaluating LM models:
  7. LM DATA:
    1. How does the fact that LM is a time-series prediction problem affect the way we need to train/test:
    2. How should we choose a subset of articles for testing:
  8. List three approaches to Parametrizing LMs:
  9. What’s the main issue in LM modeling?
    1. The Bias-Variance Tradeoff of the following:
      1. N-Gram Models:
      2. RNNs:
      3. An Estimate s.t. it predicts the probability of a sentence by how many times it has seen it before:
        1. What happens in the limit of infinite data?
  10. What are the advantages of sub-word level LMs:
  11. What are the disadvantages of sub-word level LMs:
  12. What is a “Conditional LM”?
  13. Write the decomposition of the probability for the Conditional LM:
  14. Describe the Computational Bottleneck for Language Models:
  15. Describe/List some solutions to the Bottleneck:
  16. Complexity Comparison of the different solutions:

Regularization

  1. Define Regularization both intuitively and formally:
    Regularization can be, loosely, defined as: any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

    Formally, it is a set of techniques that impose certain restrictions on the hypothesis space (by adding information) in order to solve an ill-posed problem or to prevent overfitting.

  2. Define “well-posedness”:
    Hadamard defines Well-Posed Problems as having the properties (1) A Solution Exists (2) It is Unique (3) It’s behavior changes continuously with the initial conditions.

  3. Give four aspects of justification for regularization (theoretical):
  4. Describe an overview of regularization in DL. How does it usually work?
    In the context of DL, most regularization strategies are based on regularizing estimators, which usually works by trading increased bias for reduced variance.
    1. Intuitively, how can a regularizer be effective?
      An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing the bias.
  5. Describe the relationship between regularization and capacity:
    Regularization is a (more general) way of controlling a models capacity by allowing us to express preference for one function over another in the same hypothesis space; instead of including or excluding members from the hypothesis space completely.
  6. Describe the different approaches to regularization:
    1. Parameter Norm Penalties: \(L^p\) norms, early stopping
    2. Data Augmentation: noise robustness/injection, dropout
    3. Semi-supervised Learning
    4. Multi-task Learning
    5. Ensemble Learning: bagging, etc.
    6. Adversarial Training
    7. Infinite Priors: parameter tying and sharing
  7. List 9 regularization techniques:
    1. \(L^2\) regularization
    2. \(L^1\) regularization
    3. Dataset Augmentation
    4. Noise Injection
    5. Semi-supervised Learning
    6. Multi-task Learning
    7. Early Stopping
    8. Parameter Tying, Sharing
    9. Sparse Representations
    10. Ensemble Methods, Bagging etc. 1* Dropout
    11. Adversarial Training
    12. Tangent prop and Manifold Tangent Classifier

  1. Describe Parameter Norm Penalties (PNPs):
    Many regularization approaches are based on limiting the capacity of models by adding a parameter norm penalty \(\Omega(\boldsymbol{\theta})\) to the objective function \(J\). We denote the regularized objective function by \(\tilde{J}\):

    $$\tilde{J}(\boldsymbol{\theta} ; \boldsymbol{X}, \boldsymbol{y})=J(\boldsymbol{\theta} ; \boldsymbol{X}, \boldsymbol{y})+\alpha \Omega(\boldsymbol{\theta}) \tag{7.1}$$

    where \(\alpha \in[0, \infty)\) is a HP that weights the relative contribution of the norml penalty term, \(\Omega\), relative to the standard objective function \(J\).

  2. How do we deal with the Bias parameter in PNPs? Explain.
    In NN, we usually penalize only the weights of the affine transformation at each layer and we leave the biases unregularized.
    Biases typically require less data than the weights to fit accurately. The reason is that each weight specifies how TWO variables interact so fitting the weights well, requires observing both variables in a variety of conditions. However, each bias controls only a single variable, thus, we don’t induce too much variance by leaving the biases unregularized. If anything, regularizing the bias can introduce a significant amount of underfitting.
  3. Describe the tuning of the \(\alpha\) HP in NNs for different hidden layers:
    In the context of neural networks, it is sometimes desirable to use a separate penalty with a different \(\alpha\) coefficient for each layer of the network. Because it can be expensive to search for the correct value of multiple hyperparameters, it is still reasonable to use the same weight decay at all layers just to reduce the size of search space.
  4. Formally describe the \(L^2\) parameter regularization:
    It is a regularization strategy that drives the weights closer to the origin by adding a regularization term:

    $$\Omega(\mathbf{\theta}) = \frac{1}{2}\|\boldsymbol{w}\|_ {2}^{2}$$

    to the objective function.

    1. AKA:
      In statistics, \(L^2\) regularization is also known as Ridge Regression or Tikhonov Regularization.
      In ML, it is known as weight decay.
    2. Describe the regularization contribution to the gradient in a single step.
      The addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step, just before performing the usual gradient update.

      $$\boldsymbol{w} \leftarrow(1-\epsilon \alpha) \boldsymbol{w}-\epsilon \nabla_{\boldsymbol{w}} J(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y}) \tag{7.5}$$

    3. Describe the regularization contribution to the gradient. How does it scale?
      The effect of weight decay is to rescale \(\boldsymbol{w}^{\ast}\) along the axes defined by the eigenvector of \(\boldsymbol{H}\). Specifically, the component of \(\boldsymbol{w}^{\ast}\) that is aligned with the \(i\)-th eigenvector of \(\boldsymbol{H}\) is rescaled by a factor of \(\frac{\lambda_{i}}{\lambda_{i}+\alpha}\).
    4. How does weight decay relate to shrinking the individual weight wrt their size? What is the measure/comparison used?
      Only directions along which the parameters contribute significantly to reducing the objective function are preserved relatively intact. In directions that do not contribute to reducing the objective function, a small eigenvalue of the Hessian tells us that movement in this direction will not significantly increase the gradient. Components of the weight vector corresponding to such unimportant directions are decayed away through the use of the regularization throughout training.

      Condition Effect of Regularization
      \(\lambda_{i}>>\alpha\) Not much
      \(\lambda_{i}<<\alpha\) The weight value almost shrunk to \(0\)
  5. Draw a graph describing the effects of \(L^2\) regularization on the weights:
    img
  6. Describe the effects of applying weight decay to linear regression
  7. Derivation:
    We place a Gaussian Prior on the weights, with zero mean and equal variance \(\tau^2\):

    $$\begin{aligned} \hat{\theta}_ {\mathrm{MAP}} &=\arg \max_{\theta} \log P(y | \theta)+\log P(\theta) \\ &=\arg \max _{\boldsymbol{w}}\left[\log \prod_{i=1}^{n} \dfrac{1}{\sigma \sqrt{2 \pi}} e^{-\dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}}+\log \prod_{j=0}^{p} \dfrac{1}{\tau \sqrt{2 \pi}} e^{-\dfrac{w_{j}^{2}}{2 \tau^{2}}} \right] \\ &=\arg \max _{\boldsymbol{w}} \left[-\sum_{i=1}^{n} \dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}-\sum_{j=0}^{p} \dfrac{w_{j}^{2}}{2 \tau^{2}}\right] \\ &=\arg \min_{\boldsymbol{w}} \dfrac{1}{2 \sigma^{2}}\left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\dfrac{\sigma^{2}}{\tau^{2}} \sum_{j=0}^{p} w_{j}^{2}\right] \\ &=\arg \min_{\boldsymbol{w}} \left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\lambda \sum_{j=0}^{p} w_{j}^{2}\right] \\ &= \arg \min_{\boldsymbol{w}} \left[ \|XW - \boldsymbol{y}\|^2 + \lambda {\|\boldsymbol{w}\|_ 2}^2\right]\end{aligned}$$

  8. List the properties of \(L^2\) regularization:
    1. Notice that L2-regularization has a rotational invariance. This actually makes it more sensitive to irrelevant features.
    2. Adding L2-regularization to a convex function gives a strongly-convex function. So L2-regularization can make gradient descent converge much faster.
  9. Formally describe the \(L^1\) parameter regularization:
    \(L^1\) Regularization is another way to regulate the model by penalizing the size of its parameters; the technique adds a regularization term:

    $$\Omega(\boldsymbol{\theta})=\|\boldsymbol{w}\|_{1}=\sum_{i}\left|w_{i}\right| \tag{7.18}$$

    which is a sum of absolute values of the individual parameters.

    1. AKA:
      LASSO.
    2. Whats the regularized objective function?

      $$\tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=\alpha\|\boldsymbol{w}\|_ {1}+J(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y}) \tag{7.19}$$

    3. What is its gradient?

      $$\nabla_{\boldsymbol{w}} \tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=\alpha \operatorname{sign}(\boldsymbol{w})+\nabla_{\boldsymbol{w}} J(\boldsymbol{X}, \boldsymbol{y} ; \boldsymbol{w}) \tag{7.20}$$

    4. Describe the regularization contribution to the gradient compared to L2. How does it scale?
      The regularization contribution to the gradient no longer scales linearly with each \(w_i\); instead it is a constant factor with a sign = \(\text{sign}(w_i)\).
  10. List the properties and applications of \(L^1\) regularization:
    1. Induces Sparser Solutions
    2. Solutions may be non-unique

  11. Derivation:

    $$\begin{aligned} \hat{\theta}_ {\mathrm{MAP}} &=\arg \max_{\theta} \log P(y | \theta)+\log P(\theta) \\ &=\arg \max _{\boldsymbol{w}}\left[\log \prod_{i=1}^{n} \dfrac{1}{\sigma \sqrt{2 \pi}} e^{-\dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}}+\log \prod_{j=0}^{p} \dfrac{1}{2 b} e^{-\dfrac{\left|\theta_{j}\right|}{2 b}} \right] \\ &=\arg \max _{\boldsymbol{w}} \left[-\sum_{i=1}^{n} \dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}-\sum_{j=0}^{p} \dfrac{\left|w_{j}\right|}{2 b}\right] \\ &=\arg \min_{\boldsymbol{w}} \dfrac{1}{2 \sigma^{2}}\left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\dfrac{\sigma^{2}}{b} \sum_{j=0}^{p}\left|w_{j}\right|\right] \\ &=\arg \min_{\boldsymbol{w}} \left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\lambda \sum_{j=0}^{p}\left|w_{j}\right|\right] \\ &= \arg \min_{\boldsymbol{w}} \left[ \|XW - \boldsymbol{y}\|^2 + \lambda \|\boldsymbol{w}\|_ 1\right]\end{aligned}$$

  12. Analyze \(L^1\) vs \(L^2\) regularization:
    They both shrink the weights towards zero.

    \(L^2\) \(L^1\)
    Sensitive to irrelevant features Robust to irrelevant features
    Computationally Efficient Non-Differentiable at \(0\)
    No Sparse Solutions Produces Sparse Solutions
    No Feature Selection Built-in Feature Selection
    Unique Solution Possibly multiple solutions

  13. Describe Elastic Net Regularization. Why was it devised? Any properties?

    $$\Omega = \lambda\left(\alpha\|w\|_{1}+(1-\alpha)\|w\|_{2}^{2}\right), \alpha \in[0,1]$$

    1. Combines both \(L^1\) and \(L^2\)
    2. Used to produce sparse solutions, but to avoid the problem of \(L^1\) solutions being sometimes Non-Unique
      1. The problem mainly arises with correlated features
    3. Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights.
  14. Motivate Regularization for ill-posed problems:
    1. What is the property that needs attention?
      Under-Constrained Problems. Under-determined systems. Specifically, when \(X^TX\) is singular.
    2. What would the regularized solution correspond to in this case?
      Many forms of regularization correspond to solving inverting \(\boldsymbol{X}^{\top} \boldsymbol{X}+\alpha \boldsymbol{I}\) instead.
    3. Are there any guarantees for the solution to be well-posed? How/Why?
      This regularized matrix, \(\boldsymbol{X}^{\top} \boldsymbol{X}+\alpha \boldsymbol{I}\), is guaranteed to be invertible.

  15. Describe dataset augmentation and its techniques:
    Having more data is the most desirable thing to improving a machine learning model’s performance. In many cases, it is relatively easy to artificially generate data.
  16. When is dataset augmentation applicable?
    For certain problems like classification this approach is readily usable. E.g. for a classification task, we require the model to be invariant to certain types of transformations, of which we can generate data by applying them on our current dataset.
  17. When is it not?
    This approach is not applicable to many problems, especially those that require us to learn the true data-distribution first E.g. Density Estimation.
  18. Motivate the Noise Robustness property:
    Noise Robustness is an important property for ML models:
    1. For many classification and (some) regression tasks: the task should be possible to solve even if small random noise is added to the input (Local Constancy)
    2. Moreover, NNs prove not to be very robust to noise.
  19. How can Noise Robustness motivate a regularization technique?
    To increase noise robustness, we will need to somehow teach our models to ignore noise. That can be done by teaching it to model data that is, perhaps, more noisy than what we actually have; thus, regularizing it from purely and completely fitting the training data.
  20. How can we enhance noise robustness in NN?
    By Noise Injection in different parts of the network.

  21. Define “Semi-Supervised Learning”:
    Semi-Supervised Learning is a class of ML tasks and techniques that makes use of both unlabeled examples from \(P(\mathbf{x})\) and labeled examples from \(P(\mathbf{x}, \mathbf{y})\) to estimate \(P(\mathbf{y} | \mathbf{x})\) or predict \(\mathbf{y}\) from \(\mathbf{x}\).
    1. What does it refer to in the context of DL:
      In the context of Deep Learning, Semi-Supervised Learning usually refers to learning a representation \(\boldsymbol{h}=f(\boldsymbol{x})\).
    2. What is its goal?
      The goal is to learn a representation such that examples from the same class have similar representations.
    3. Give an example in classical ML:
      Using unsupervised learning to build representations: PCA, as a preprocessing step before applying a classifier, is a long-standing variant of this approach.
  22. Describe an approach to applying semi-supervised learning:
    Instead of separating the supervised and unsupervised criteria, we can instead have a generative model of \(P(\mathbf{x})\) (or \(P(\mathbf{x}, \mathbf{y})\)) which shares parameters with a discriminative model \(P(\mathbf{y} \vert \mathbf{x})\).
    The idea is to share the unsupervised/generative criterion with the supervised criterion to express a prior belief that the structure of \(P(\mathbf{x})\) (or \(P(\mathbf{x}, \mathbf{y})\)) is connected to the structure of \(P(\mathbf{y} \vert \mathbf{x})\), which is captured by the shared parameters.
    By controlling how much of the generative criterion is included in the total criterion, one can find a better trade-off than with a purely generative or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and Bengio, 2008).
  23. How can we interpret dropout wrt data augmentation?
    Dropout can be seen as a process of constructing new inputs by multiplying by noise.

  24. Describe Multitask Learning as regularization:
    The idea is to improve the generalization error by pooling together examples from multiple tasks. Similar to how more data leads to more generalization, using a part of the model for different tasks constrains that part to learn good values.
  25. List the types:

Misc.

  1. Explain Latent Dirichlet Allocation (LDA)
    Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying documents by subject matter.
    LDA is a generative model that represents documents as a mixture of topics that each have their own probability distribution of possible words.
    The “Dirichlet” distribution is simply a distribution of distributions. In LDA, documents are distributions of topics that are distributions of words.
  2. How to deal with curse of dimensionality
    1. Feature Selection
    2. Feature Extraction
  3. How to detect correlation of “categorical variables”?
    1. Chi-Squared test
  4. Define “marginal likelihood” (wrt naive bayes):
    Marginal likelihood is, the probability that the word ‘FREE’ is used in any message (not given any other condition?).
  5. KNN VS K-Means
    1. KNN: Supervised, Classification/Regression Algorithm
    2. K-Means: Unsupervised Clustering Algorithm
  6. When is Ridge regression favorable over Lasso regression for correlated features?
    If there are exists a subset consisting of few variables that have medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.?? (ESL authors)

    If features are correlated; it is so hard to determine which variables to drop, it is often better not to drop variables. Thus, use ridge over lasso since the latter produces non-unique solutions and might drop random features; while former, spreads weight more evenly.

  7. What is convex hull ?
    In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.
  8. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
    For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
  9. OLS vs MLE
    They both estimate parameters in the model. They are the same in the case of normal distribution.
  10. What are collinearity and multicollinearity?
    1. Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.
    2. Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.
  11. Describe ways to overcome scaling (scalability) issues:
    nystrom methods/low-rank kernel matrix approximations, random features, local by query/near neighbors
    • Topic Modeling vs Document Classification:
      • Text Classification is a form of supervised learning, hence the set of possible classes are known/defined in advance, and won’t change.
      • Topic Modeling is a form of unsupervised learning (akin to clustering), so the set of possible topics are unknown apriori. They’re defined as part of generating the topic models.

FeedForward Neural Network


Multilayer Perceptron


Deep Feedforward Neural Networks


AutoEncoders

  1. What is an AutoEncoder? What is its goal? (draw a diagram)
    An Autoencoder is an ANN used for unsupervised learning of efficient codings/representations.
    It aims to learn a good representation (encoding) for a set of data, typically lower-dimensional.
    img
  2. What type of NN is the Autoencoder?
    A Feedforward NN.
  3. Give Motivation for AutoEncoders:
    Autoencoder are a generalization of PCA to non-linear projections.
    Auto-Encoders allows us to deal with curved manifolds in the input space by using deep layers, where the code is a non-linear function of the input, and the reconstruction of the data from the code is, also, a non-linear function of the code.
  4. Why Deep AutoEncoders? What do they allow us to do?
    They provide a really nice way to do non-linear dimensionality reduction:
    • They provide flexible mappings both ways
    • The learning time is linear (or better) in the number of training examples
    • The final encoding model/Encoder is fairly compact and fast
  5. List the Advantages of Depth in AutoEncoders:
    • Computational Efficiency: Depth can exponentially reduce the computational cost of representing some functions
    • Statistical Efficiency: Depth can exponentially decrease the amount of training data needed to learn some functions
    • Representational Efficiency: Experimentally, deep Autoencoders yield better compression compared to shallow or linear Autoencoders
  6. List the Applications of AutoEncoders (and historical information):
    The applications of auto-encoders have changed overtime.
    This is due to the advances in the fields that auto-encoders were applied in, or to the incompetency of the auto-encoders.
    • Historically:
      • Information Retrieval
      • Anomaly Detection
    • Recently, auto-encoders are applied to:
      • Data-Denoising
      • Dimensionality Reduction (for data visualization)
      • Drug discovery
  7. Describe the Training of Deep AutoEncoders:

  8. Describe the Architecture of AutoEncoders:
    • An auto-encoder consists of:
      • An Encoding Function
      • A Decoding Function
      • A Distance Function
    • We choose the encoder and decoder to be parametric functions (typically neural networks), and to be differentiable with respect to the distance function, so the parameters of the encoding/decoding functions can be optimized to minimize the reconstruction loss, using Stochastic Gradient Descent.
    1. What is the simplest form of an AE:
      The simplest form of an Autoencoder is a feedforward neural network (similar to the multilayer perceptron (MLP)) – having an input layer, an output layer and one or more hidden layers connecting them – but with the output layer having the same number of nodes as the input layer, and with the purpose of reconstructing its own inputs (instead of predicting the target value \({\displaystyle Y}\) given inputs \({\displaystyle X}\)).
    2. What realm of “Learning” is employed for AEs?
      Self-Supervised Learning.
  9. Mathematical Description of the Structure of AutoEncoders:
    The encoder and the decoder in an auto-encoder can be defined as transitions \(\phi\) and \({\displaystyle \psi ,}\) such that:

    $$ {\displaystyle \phi :{\mathcal {X}}\rightarrow {\mathcal {F}}} \\ {\displaystyle \psi :{\mathcal {F}}\rightarrow {\mathcal {X}}} \\ {\displaystyle \phi ,\psi =\arg \min_{\phi ,\psi }\|X-(\psi \circ \phi )X\|^{2}}$$

    where \({\mathcal {X} = \mathbf{R}^d}\) is the input space, and \({\mathcal {F} = \mathbf{R}^p}\) is the latent (feature) space, and \(p < d\).

    The encoder takes the input \({\displaystyle \mathbf {x} \in \mathbb {R} ^{d}={\mathcal {X}}}\) and maps it to \({\displaystyle \mathbf {z} \in \mathbb {R} ^{p}={\mathcal {F}}}\):

    $${\displaystyle \mathbf {z} =\sigma (\mathbf {Wx} +\mathbf {b} )}$$

    • The image \(\mathbf{z}\) is referred to as code, latent variables, or latent representation.
    • \({\displaystyle \sigma }\) is an element-wise activation function such as a sigmoid function or a rectified linear unit.
    • \({\displaystyle \mathbf {W} }\) is a weight matrix
    • \({\displaystyle \mathbf {b} }\) is the bias.

    The Decoder maps \({\displaystyle \mathbf {z} }\) to the reconstruction \({\displaystyle \mathbf {x'} }\) of the same shape as \({\displaystyle \mathbf {x} }\):

    $${\displaystyle \mathbf {x'} =\sigma '(\mathbf {W'z} +\mathbf {b'} )}$$

    where \({\displaystyle \mathbf {\sigma '} ,\mathbf {W'} ,{\text{ and }}\mathbf {b'} }\) for the decoder may differ in general from those of the encoder.

    Autoencoders minimize reconstruction errors, such as the L-2 loss:

    $${\displaystyle {\mathcal {L}}(\mathbf {x} ,\psi ( \phi (\mathbf {x} ) ) ) = {\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )=\|\mathbf {x} -\mathbf {x'} \|^{2}=\|\mathbf {x} -\sigma '(\mathbf {W'} (\sigma (\mathbf {Wx} +\mathbf {b} ))+\mathbf {b'} )\|^{2}}$$

    where \({\displaystyle \mathbf {x} }\) is usually averaged over some input training set.

  10. Compare AutoEncoders and PCA (wrt what they learn):
  11. List the different Types of AEs
    • Vanilla Auto-Encoder
    • Sparse Auto-Encoder
    • Denoising Auto-Encoder
    • Variational Auto-Encoder (VAE)
    • Contractive Auto-Encoder.
  12. How can we use AEs for Initialization?
    After training an auto-encoder, we can use the encoder to compress the input data into it’s latent representation (which we can view as features) and input those to the neural-net (e.g. a classifier) for prediction.
    img
  13. Describe the Representational Power of AEs:
    • The universal approximator theorem guarantees that a feedforward neural network with at least one hidden layer can represent an approximation of any function (within a broad class) to an arbitrary degree of accuracy, provided that it has enough hidden units. This means that an Autoencoder with a single hidden layer is able to represent the identity function along the domain of the data arbitrarily well.
    • However, the mapping from input to code is shallow. This means that we are not able to enforce arbitrary constraints, such as that the code should be sparse.
    • A deep Autoencoder, with at least one additional hidden layer inside the encoder itself, can approximate any mapping from input to code arbitrarily well, given enough hidden units.

  14. Describe the progression (stages) of AE Architectures in CV:
    • Originally: Linear + nonlinearity (sigmoid).
    • Later: Deep, fully-connected.
    • Later: ReLU CNN (UpConv).
  15. List the Types of AEs wrt the Bottleneck/Code-Length:
    • Undercomplete:
    • Overcomplete:
    • Complete:
  16. What are Undercomplete AutoEncoders?
    An Undercomplete Autoencoder is one whose code dimension is less than the input dimension.
  17. What’s the motivation behind Undercomplete AEs?
    Learning an undercomplete representation forces the autoencoder to capture the most salient features of the training data.
  18. List the Challenges of Training AEs using the Auto-Encoding Objective (case-by-case):
    • If an Autoencoder succeeds in simply learning to set \(\psi(\phi (x)) = x\) everywhere, then it is not especially useful.
      Instead, Autoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately, and to copy only input that resembles the training data.
    • In Undercomplete Autoencoders If the encoder and decoder are allowed too much capacity, the Autoencoder can learn to perform the copying task without extracting useful information about the distribution of the data.
      Theoretically, one could imagine that an Autoencoder with a one-dimensional code but a very powerful nonlinear encoder could learn to represent each training example \(x^{(i)}\) with the code \(i\). This specific scenario does not occur in practice, but it illustrates clearly that an Autoencoder trained to perform the copying task can fail to learn anything useful about the dataset if the capacity of the Autoencoder is allowed to become too great.
    • A similar problem occurs in Complete AEs
    • As well as in the Overcomplete case, in which the hidden code has dimension greater than the input.
      In complete and overcomplete cases, even a linear encoder and linear decoder can learn to copy the input to the output without learning anything useful about the data distribution.
  19. What is the Main Method/Approach of addressing the Challenges above (Training AEs)?
    Regularized Autoencoders.



  1. What is an AutoEncoder? Why do we “Auto-Encode”? (Hint: it’s really a misnomer) Why is “AutoEncoder” a misnomer?
    AutoEncoders are FFNs with a narrow, low-dimensional bottleneck layer in the middle that learn are used to learn efficient data codings/representations.
    The objective for an AE is to learn to copy its input to its output. However, performing the copying task per se would be meaningless, and this is why usually autoencoders are restricted in ways that force them to reconstruct the input only approximately, prioritizing the most relevant aspects of the data to be copied.
    i.e. It’s not really a misnomer since the general objective is in the form of “auto-encoding” however, the main goal is not just to learn to reconstruct the data from the code but to learn useful information about the distribution of the data.
    The goal is not perform the copying task, but to perform the copying task for the purpose of __ extracting useful information about the distribution of the data__.
    The Goal: capture important information and learn richer representations..
  2. What’s the Big-Idea/Main-Point behind R-AEs and AEs in general?
    A regularized Autoencoder can be nonlinear and overcomplete but still learn something useful about the data distribution even if the model capacity is great enough to learn a trivial identity function.
    • What do DAEs learn to represent?
      They learn to represent a probability distribution.
    • How do Regularized AEs learn Manifolds?
      Regularized Autoencoders learn manifolds by balancing two opposing forces.
    • How can we Learn Manifolds w/ AEs?