Gradient-Based Optimization

  1. Define Gradient Methods:
    Gradient Methods are algorithms for solving (optimization) minimization problems of the form:

    $$\min_{x \in \mathbb{R}^{n}} f(x)$$

    by doing local search (~ hill-climbing) with the search directions defined by the gradient of the function at the current point.

  2. Give examples of Gradient-Based Algorithms:
    1. Gradient Descent: minimizes arbitrary differentiable functions
    2. Conjugate Gradient: minimizes sparse linear systems w/ symmetric & PD matrices
    3. Coordinate Descent: minimizes functions of two variables
  3. What is Gradient Descent:
    GD is a first order, iterative algorithm to minimize an objective function \(J(\theta)\) parametrized by a models parameters \(\theta \in \mathbb{R}^d\) by updating the parameters in the opposite direction of the gradient of the objective function \(\nabla_{\theta} J(\theta)\) w.r.t. to the parameters.
  4. Explain it intuitively:
    1) Local Search from a starting location on a hill
    2) Feel around how a small movement/step around your location would change the height of the surrounding hill (is the ground higher or lower)
    3) Make the movement/step consistent as a small fixed step along some direction
    4) Measure the steepness of the hill at the new location in the chosen direction
    5) Do so by Approximating the steepness with some local information
    6) Measure the change in steepness from your current point to the new point
    6) Find the direction that decreases the steepness the most
    \(\iff\)

    1) Local Search from an initial point \(x_0\) on a function
    2) Explore the value of the function at different small nudges around \(x_0\)
    3) Make the nudges consistent as a small fixed step \(\delta\) along a normalized direction \(\hat{\boldsymbol{u}}\)
    4) Evaluate the function at the new location \(x_0 + \delta \hat{\boldsymbol{u}}\)
    5) Do so by Approximating the function w/ first-order information (Taylor expansion)
    6) Measure the change in value of the objective \(\Delta f\), from current point to the new point
    7) Find the direction \(\hat{\boldsymbol{u}}\) that minimizes the function the most

  5. Give its derivation:
    1) Start at \(\boldsymbol{x} = x_0\)
    2) We would like to know how would a small change in \(\boldsymbol{x}\), namely \(\Delta \boldsymbol{x}\) would affect the value of the function \(f(x)\). This will allow us to evaluate the function:

    $$f(\mathbf{x}+\Delta \mathbf{x})$$

    to find the direction that makes \(f\) decrease the fastest
    3) Let’s set up \(\Delta \boldsymbol{x}\), the change in \(\boldsymbol{x}\), as a fixed step \(\delta\) along some normalized direction \(\hat{\boldsymbol{u}}\):

    $$\Delta \boldsymbol{x} = \delta \hat{\boldsymbol{u}}$$

    4) Evaluate the function at the new location:

    $$f(\mathbf{x}+\Delta \mathbf{x}) = f(\mathbf{x}+\delta \hat{\boldsymbol{u}})$$

    5) Using the first-order approximation:

    $$\begin{aligned} f(\mathbf{x}+\delta \hat{\boldsymbol{u}}) &\approx f({\boldsymbol{x}}_0) + \left(({\boldsymbol{x}}_0 - \delta \hat{\boldsymbol{u}}) - {\boldsymbol{x}}_0 \right)^T \nabla_{\boldsymbol{x}} f({\boldsymbol{x}}_0) \\ &\approx f(x_0) + \delta \hat{\boldsymbol{u}}^T \nabla_x f(x_0) \end{aligned}$$

    6) The change in \(f\) is:

    $$\begin{aligned} \Delta f &= f(\boldsymbol{x}_ 0 + \Delta \boldsymbol{x}) - f(\boldsymbol{x}_ 0) \\ &= f(\boldsymbol{x}_ 0 + \delta \hat{\boldsymbol{u}}) - f(\boldsymbol{x}_ 0)\\ &= \delta \nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} + \mathcal{O}(\delta^2) \\ &= \delta \nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} \\ &\geq -\delta\|\nabla f(\boldsymbol{x}_ 0)\|_ 2 \end{aligned}$$

    where

    $$\nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} \in \left[-\|\nabla f(\boldsymbol{x}_ 0)\|_ 2, \|\nabla f(\boldsymbol{x}_ 0)\|_ 2\right]$$

    since \(\hat{\boldsymbol{u}}\) is a unit vector; either aligned with \(\nabla_x f(\boldsymbol{x}_ 0)\) or in the opposite direction; it contributes nothing to the magnitude of the dot product.
    7) So, the \(\hat{\boldsymbol{u}}\) that changes the above inequality to equality, achieves the largest negative value (moves the most downhill). That vector \(\hat{\boldsymbol{u}}\) is, then, the one in the negative direction of \(\nabla_x f(\boldsymbol{x}_ 0)\); the opposite direction of the gradient.
    \(\implies\)

    $$\hat{\boldsymbol{u}} = - \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2}$$

    Finally, The method of steepest/gradient descent proposes a new point to decrease the value of \(f\):

    $$\boldsymbol{x}^{\prime}=\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x})$$

    where \(\epsilon\) is the learning rate, defined as:

    $$\epsilon = \dfrac{\delta}{\left\|\nabla_{x} f(x)\right\|_ {2}}$$

  6. What is the learning rate?
    Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient; defined as:

    $$\epsilon = \dfrac{\delta}{\left\|\nabla_{x} f(x)\right\|_ {2}}$$

    1. Where does it come from?
      The learning rate comes from a modification of the step-size in the GD derivation.
      We get the learning rate by employing a simple idea:
      We have a fixed step-size \(\delta\) that dictated how much we should be moving in the direction of steepest descent. However, we would like to keep the step-size from being too small or overshooting. The idea is to make the step-size proportional to the magnitude of the gradient (i.e. some constant multiplied by the magnitude of the gradient):

      $$\delta = \epsilon \left\|\nabla_{x} f(x)\right\|_ {2}$$

      If we do so, we get a nice cancellation as follows:

      $$\begin{aligned}\Delta \boldsymbol{x} &= \delta \hat{\boldsymbol{u}} \\ &= -\delta \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2} \\ &= - \epsilon \left\|\nabla_{x} f(x)\right\|_ {2} \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2} \\ &= - \dfrac{\epsilon \left\|\nabla_{x} f(x)\right\|_ {2}}{\|\nabla_x f(x)\|_ 2} \nabla_x f(x) \\ &= - \epsilon \nabla_x f(x) \end{aligned}$$

      where now we have a fixed learning rate instead of a fixed step-size.

    1. What is its range?
      \([0.0001, 0.4]\)
    2. How do we choose the learning rate?
      1. Choose a starting lr from the range \([0.0001, 0.4]\) and adjust using cross-validation (lr is a HP).
      2. Do
        1. Set it to a small constant
        2. Smooth Functions (non-NN):
          1. Line Search: evaluate \(f\left(\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x})\right)\) for several values of \(\epsilon\) and choose the one that results in the smallest objective value.
            E.g. Secant Method, Newton-Raphson Method (may need Hessian, hard for large dims)
          2. Trust Region Method
        3. Non-Smooth Functions (non-NN):
          1. Direct Line Search
            E.g. golden section search
        4. Grid Search: is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. It is guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
        5. Population-Based Training (PBT): is an elegant implementation of using a genetic algorithm for hyper-parameter choice.
        6. Bayesian Optimization: is a global optimization method for noisy black-box functions.
      3. Use larger lr and use a learning rate schedule

  7. Describe the convergence of the algorithm:
    Gradient Descent converges when every element of the gradient is zero, or very close to zero within some threshold.

    With certain assumptions on \(f\) (convex, \(\nabla f\) lipschitz) and particular choices of \(\epsilon\) (chosen via line-search etc.), convergence to a local minimum can be guaranteed.
    Moreover, if \(f\) is convex, all local minima are global minimia, so convergence is to the global minimum.

  8. How does GD relate to Euler?
    Gradient descent can be viewed as applying Euler’s method for solving ordinary differential equations \({\displaystyle x'(t)=-\nabla f(x(t))}\) to a gradient flow.
  9. List the variants of GD:
    1. Batch Gradient Descent
    2. Stochastic GD
    3. Mini-Batch GD
    4. How do they differ? Why?:
      There are three variants of gradient descent, which differ in the amount of data used to compute the gradient. The amount of data imposes a trade-off between the accuracy of the parameter updates and the time it takes to perform the update.

  10. What is the problem of vanilla approaches to GD?
    All the variants described above, however, do not guarantee “good” convergence due to some challenges.

  11. List the different strategies for optimizing GD:
    1. Adapt our updates to the slope of our error function
    2. Adapt our updates to the importance of each individual parameter
  12. List the different variants for optimizing GD:
    1. Adapt our updates to the slope of our error function:
      1. Momentum
      2. Nesterov Accelerated Gradient
    2. Adapt our updates to the importance of each individual parameter:
      1. Adagrad
      2. Adadelta
      3. RMSprop
      4. Adam
      5. AdaMax
      6. Nadam
      7. AMSGrad


Maximum Margin Classifiers

  1. Define Margin Classifiers:
    A margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example.
  2. What is a Margin for a linear classifier?
    The margin of a linear classifier is the distance from the decision boundary to the nearest sample point.
  3. Give the motivation for margin classifiers:
    Non-margin classifiers (e.g. Centroid, Perceptron, LR) will converge to a correct classifier on linearly separable data; however, the classifier they converge to is not unique nor the best.

  4. Define the notion of the “best” possible classifier
    We assume that if we can maximize the distance between the data points to be classified and the hyperplane that classifies them, then we have reached a boundary that allows for the “best-fit”, i.e. allows for the most room for error.
  5. How can we achieve the “best” classifier?
    We enforce a constraint that achieves a classifier that has a maximum-margin.
  6. What unique vector is orthogonal to the hp? Prove it:
    The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
    Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:

    $$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$

  7. What do we mean by “signed distance”? Derive its formula:
    The signed distance is the minimum distance from a point to a hyperplane. We solve for the signed distance to achieve the following formula for it: \(d = \dfrac{\| w \cdot x_0 + b \|}{\|w\|},\)
    where we have an n-dimensional hyperplane: \(w \cdot x + b = 0\) and a point \(\mathbf{x}_ n\).
    Derivation:
    1. Suppose we have an affine hyperplane defined by \(w \cdot x + b\) and a point \(\mathbf{x}_ n\).
    2. Suppose that \(\mathbf{x} \in \mathbf{R}^n\) is a point satisfying \(w \cdot \mathbf{x} + b = 0\), i.e. it is a point on the plane.
    3. We construct the vector \(\mathbf{x}_ n−\mathbf{x}\) which points from \(\mathbf{x}\) to \(\mathbf{x}_ n\), and then, project (scalar projection==signed distance) it onto the unique vector perpendicular to the plane, i.e. \(w\),

      $$d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \left| \frac{(\mathbf{x}_ n-\mathbf{x})\cdot w}{\|w\|} \right| = \frac{|\mathbf{x}_ n \cdot w - \mathbf{x} \cdot w|}{\|w\|}.$$

    4. Since \(\mathbf{x}\) is a vector on the plane, it must satisfy \(w\cdot \mathbf{x}=-b\) so we get

      $$d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \frac{|\mathbf{x}_ n \cdot w +b|}{\|w\|}$$

    Thus, we conclude that if \(\|w\| = 1\) then the signed distance from a datapoint \(X_i\) to the hyperplane is \(\|wX_i + b\|\).

  8. Given the formula for signed distance, calculate the “distance of the point closest to the hyperplane”:
    Let \(X_n\) be the point that is closest to the plane, and let \(\hat{w} = \dfrac{w}{\|w\|}\).
    Take any point \(X\) on the plane, and let \(\vec{v}\) be the vector \(\vec{v} = X_n - X\).
    Now, the distance, \(d\) is equal to

    $$\begin{align} d & \ = \vert\hat{w}^{\top}\vec{v}\vert \\ & \ = \vert\hat{w}^{\top}(X_n - X)\vert \\ & \ = \vert\hat{w}^{\top}X_n - \hat{w}^{\top}X)\vert \\ & \ = \dfrac{1}{\|w\|}\vert w^{\top}X_n + b - w^{\top}X) - b\vert , & \text{we add and subtract the bias } b\\ & \ = \dfrac{1}{\|w\|}\vert (w^{\top}X_n + b) - (w^{\top}X + b)\vert \\ & \ = \dfrac{1}{\|w\|}\vert (w^{\top}X_n + b) - (0)\vert , & \text{from the eq. of the plane on a point on the plane} \\ & \ = \dfrac{\vert (w^{\top}X_n + b)\vert}{\|w\|} \end{align} $$

  9. Use geometric properties of the hp to Simplify the expression for the distance of the closest point to the hp, above
    First, we notice that for any given plane \(w^Tx = 0\), the equations, \(\gamma * w^Tx = 0\), where \(\gamma \in \mathbf{R}\) is a scalar, basically characterize the same plane and not many planes.
    This is because \(w^Tx = 0 \iff \gamma * w^Tx = \gamma * 0 \iff \gamma * w^Tx = 0\).
    The above implies that any model that takes input \(w\) and produces a margin, will have to be Scale Invariant.
    To get around this and simplify the analysis, I am going to consider all the representations of the same plane, and I am going to pick one where we normalize (re-scale) the weight \(w\) such that the signed distance (distance to the point closest to the margin) is equal to one:

    $$|w^Tx_n| > 0 \rightarrow |w^Tx_n| = 1$$

    , where \(x_n\) is the point closest to the plane.
    We constraint the hyperplane by normalizing \(w\) to this equation \(|w^Tx_i| = 1\) or with added bias, \(|w^Tx_i + b| = 1\).
    \(\implies\)

    $$\begin{align} d & \ = \dfrac{\vert (w^{\top}X_n + b)\vert}{\|w\|} \\ & \ = \dfrac{\vert (1)\vert}{\|w\|} , & \text{from the constraint on the distance of the closest point} \\ & \ = \dfrac{1}{\|w\|} \end{align} $$

  10. Characterize the margin, mathematically:
    we can characterize the margin, with its size, as the distance, \(\frac{1}{\|\mathbf{w}\|}\), between the hyperplane/boundary and the closest point to the plane \(\mathbf{x}_ n\), in both directions (multiply by 2) \(= \frac{2}{\|\mathbf{w}\|}\) ; given the condition we specified earlier \(\left|\mathbf{w}^{\top} \mathbf{x}_ {n} + b\right|=1\) for the closest point \(\mathbf{x}_ n\).
  11. Characterize the “Slab Existence”:
    The analysis done above allows us to conclusively prove that there exists a slab of width \(\dfrac{2}{\|w\|}\) containing no sample points where the hyperplane runs through (bisects) its center.
  12. Formulate the optimization problem of maximizing the margin wrt analysis above:
    We formulate the optimization problem of maximizing the margin by maximizing the distance, subject to the condition on how we derived the distance:

    $$\max_{\mathbf{w}} \dfrac{2}{\|\mathbf{w}\|} \:\:\: : \:\: \min _{n=1,2, \ldots, N}\left|\mathbf{w}^{\top} \mathbf{x}_{n}+b\right|=1$$

  13. Reformulate the optimization problem above to a more “friendly” version (wrt optimization -> put in standard form):
    We can reformulate by (1) Flipping and Minimizing, (2) Taking a square since it’s monotonic and convex, and (3) noticing that \(\left|\mathbf{w}^{T} \mathbf{x}_ {n}+b\right|=y_{n}\left(\mathbf{w}^{T} \mathbf{x}_ {n}+b\right)\) (since the signal and label must agree, their product will always be positive) and the \(\min\) operator can be replaced by ensuring that for all the points the condition \(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1\) holds proof (by contradiction) as:

    $$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$

    Now when we solve the “friendly” equation above, we will get the separating plane with the best possible margin (best=biggest).

    1. Give the final (standard) formulation of the “Optimization problem for maximum margin classifiers”:

      $$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$

    2. What kind of formulation is it (wrt optimization)? What are the parameters?
      The above problem is a Quadratic Program, in \(d + 1\)-dimensions and \(n\)-constraints, in standard form.

Hard-Margin SVMs

  1. Define:
    1. SVMs:
      1.*Support Vector Machines** (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
      The SVM is a Maximum Margin Classifier that aims to find the “maximum-margin hyperplane” that divides the group of points \({\displaystyle {\vec {x}}_{i}} {\vec {x}}_{i}\) for which \({\displaystyle y_{i}=1}\) from the group of points for which \({\displaystyle y_{i}=-1}\).
    2. Support Vectors:
      1.*Support Vectors** are the data-points that lie exactly on the margin (i.e. on the boundary of the slab).
      They satisfy \(\|w^TX' + b\| = 1, \forall\) support vectors \(X'\)
    3. Hard-Margin SVM:
      The Hard-Margin SVM is just a maximum-margin classifier with features and kernels (discussed later).
  2. Define the following wrt hard-margin SVM:
    1. Goal:
      Find weights ‘\(w\)’ and scalar ‘\(b\)’ that correctly classifies the data-points and, moreover, does so in the “best” possible way.
    2. Procedure:
      (1) Use a linear classifier (2) But, Maximize the Margin (3) Do so by Minimizing \(\|w\|\)
    3. Decision Function:

      $${\displaystyle f(x)={\begin{cases}1&{\text{if }}\ w\cdot X_i+\alpha>0\\0&{\text{otherwise}}\end{cases}}}$$

    4. Constraints:

      $$y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$

    5. The Optimization Problem:
      Find weights ‘\(w\)’ and scalar ‘\(b\)’ that minimize

      $$ \dfrac{1}{2} w^Tw$$

      Subject to

      $$y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$

      Formally,

      $$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$

    6. The Optimization Method:
      The SVM optimization problem reduces to a Quadratic Program.
  3. Elaborate on the generalization analysis:
    We notice that, geometrically, the hyperplane (the maximum margin classifier) is completely characterized by the support vectors (the vectors that lie on the margin).
    A very important conclusion arises.
    The maximum margin classifier (SVM) depends only on the number of support vectors and not on the dimension of the problem.
    This implies that the computation doesn’t scale up with the dimension and, also implies, that the kernel trick works very well.
  4. List the properties:
    1. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
    2. The hyperplane is determined solely by its support vectors.
    3. The SVM always converges on linearly separable data.
    4. The Hard-Margin SVM fails if the data is not linearly separable.
    5. The Hard-Margin SVM is quite sensitive to outliers
  5. Give the solution to the optimization problem for H-M SVM:
    To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
    The Lagrange Formulation:
    1. Formulate the Lagrangian:
      1. Take each inequality constraint and put them in the zero-form (equality with Zero)
      2. Multiply each inequality by a Lagrange Multiplier \(\alpha_n\)
      3. Add them to the objective function \(\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}\)
        The sign will be \(-\) (negative) simply because the inequality is \(\geq 0\)

      $$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$

    2. Optimize the objective independently, for each of the unconstrained variables:
      1. Gradient w.r.t. \(\mathbf{w}\):

        $$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$

      2. Derivative w.r.t. \(b\):

        $$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    3. Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
      1. Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. \(\alpha_n\) will become free of \(\mathbf{w}\) and \(b\):

        $$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$

      2. Notice that the first constraint \(\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}\) has-no-effect/doesn’t-constraint \(\alpha_n\) so it’s a vacuous constraint. However, not the second constraint \(\sum_{n=1}^{N} \alpha_{n} y_{n}=0\).
      3. Set the optimization objective and the constraints, a quadratic function in \(\alpha_n\):

      $$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    4. Set the problem as a Quadratic Programming problem:
      1. Change the maximization to minimization by flipping the signs:

        $$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$

      2. Isolate the Coefficients from the \(\alpha_n\)s and set in matrix-form:

        $$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text {linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text {linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text {lower bounds }} \leq \alpha \leq \underbrace{\infty}_{\text {upper bounds }} $$

        The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of \(\alpha_n\)s; and then, gives you back an \(\mathbf{\alpha}\).

      Equivalently:

      $$\min _{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text {subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$

    5. What method does it require to be solved:
      To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
    6. Formulate the Lagrangian:
      1. Take each inequality constraint and put them in the zero-form (equality with Zero)
      2. Multiply each inequality by a Lagrange Multiplier \(\alpha_n\)
      3. Add them to the objective function \(\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}\)
        The sign will be \(-\) (negative) simply because the inequality is \(\geq 0\)

      $$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$

    7. Optimize the objective for each variable:
      1. Gradient w.r.t. \(\mathbf{w}\):

        $$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$

      2. Derivative w.r.t. \(b\):

        $$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    8. Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
      1. Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. \(\alpha_n\) will become free of \(\mathbf{w}\) and \(b\):

        $$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$

      2. Notice that the first constraint \(\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}\) has-no-effect/doesn’t-constraint \(\alpha_n\) so it’s a vacuous constraint. However, not the second constraint \(\sum_{n=1}^{N} \alpha_{n} y_{n}=0\).
      3. Set the optimization objective and the constraints, a quadratic function in \(\alpha_n\):

      $$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    9. Set the problem as a Quadratic Programming problem:
      1. Change the maximization to minimization by flipping the signs:

        $$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$

      2. Isolate the Coefficients from the \(\alpha_n\)s and set in matrix-form:

        $$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text {linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text {linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text {lower bounds }} \leq \alpha \leq \underbrace{\infty}_ {\text {upper bounds }} $$

    10. What are the inputs and outputs to the Quadratic Program Package?
      The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of \(\alpha_n\)s; and then, gives you back an \(\mathbf{\alpha}\).
    11. Give the final form of the optimization problem in standard form:

      $$\min_{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text {subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$


Soft-Margin SVM

  1. Motivate the soft-margin SVM:
    The Hard-Margin SVM faces a few issues:
    1. The Hard-Margin SVM fails if the data is not linearly separable.
    2. The Hard-Margin SVM is quite sensitive to outliers

    The Soft-Margin SVM aims to fix/reconcile these problems.

  2. What is the main idea behind it?
    Allow some points to violate the margin, by introducing slack variables.
  3. Define the following wrt soft-margin SVM:
    1. Goal:
      Find weights ‘\(w\)’ and scalar ‘\(b\)’ that correctly classifies the data-points and, moreover, does so in the “best” possible way, but allow some points to violate the margin, by introducing slack variables.
    2. Procedure:
      (1) Use a linear classifier
      (2) But, Maximize the Margin
      (3) Do so by Minimizing \(\|w\|\)
      (4) But allow some points to penetrate the margin
    3. Decision Function:
    4. Constraints:

      $$y_i(wX_i + b) \geq 1 - \zeta_i, \forall i \in [1,n]$$

      where the \(\zeta_i\)s are slack variables.
      We, also, enforce the non-negativity constraint on the slack variables:

      $$\zeta_i \geq 0, \:\:\: \forall i \in [1, n]$$

      1. Why is there a non-negativity constraint?
        The non-negativity constraint forces the slack variables to be zero for all points that do not violate the original constraint:
        i.e. are not inside the slab.
    5. Objective/Cost Function:

      $$ R(w) = \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i$$

    6. The Optimization Problem:
      Find weights ‘\(w\)’, scalar ‘\(b\)’, and \(\zeta_i\)s that minimize

      $$ \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i$$

      Subject to

      $$y_i(wX_i + b) \geq 1 - \zeta_i, \zeta_i \geq 0, \forall i \in [1,n]$$

      Formally,

      $$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1 - \zeta_i, \:\: \zeta_i \geq 0, \forall i \in [1,n]$$

    7. The Optimization Method:
      The SVM optimization problem reduces to a Quadratic Program in \(d + n + 1\)-dimensions and \(2n\)-constraints.
    8. Properties:
      1. The Soft-Margin SVM will converge on non-linearly separable data.
  4. Specify the effects of the regularization hyperparameter \(C\):

      Small C Large C
    Desire Maximizing Margin = \(\dfrac{1}{|w|}\) keep most slack variables zero or small
    Danger underfitting (High Misclassification) overfitting (awesome training, awful test)
    outliers less sensitive very sensitive
    boundary more “flat” more sinuous
    1. Describe the effect wrt over/under fitting:
      Increase \(C\) hparam in SVM = causes overfitting
  5. How do we choose \(C\)?
    We choose ‘\(C\)’ with cross-validation.
  6. Give an equivalent formulation in the standard form objective for function estimation (what should it minimize?)
    In function estimation we prefer the standard-form objective to minimize (and trade-off); the loss + penalty form.
    We introduce a loss function to moderate the use of the slack variables (i.e. to avoid abusing the slack variables), namely, Hinge Loss:

    $${\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$

    The motivation: We motivate it by comparing it to the traditional \(0-1\) Loss function.
    Notice that the \(0-1\) loss is actually non-convex. It has an infinite slope at \(0\). On the other hand, the hinge loss is actually convex.

    Analysis wrt maximum margin:
    This function is zero if the constraint, \(y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\geq 1\), is satisfied, in other words, if \({\displaystyle {\vec {x}}_{i}} {\vec {x}}_{i}\) lies on the correct side of the margin.
    For data on the wrong side of the margin, the function’s value is proportional to the distance from the margin.

    Modified Objective Function:

    $$R(w) = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$

    Proof of equivalence:

    $$\begin{align} y_if\left(x_i\right) & \ \geq 1-\zeta_i, & \text{from 1st constraint } \\ \implies \zeta_i & \ \geq 1-y_if\left(x_i\right) \\ \zeta_i & \ \geq 1-y_if\left(x_i\right) \geq 0, & \text{from 2nd positivity constraint on} \zeta_i \\ \iff \zeta_i & \ \geq \max \{0, 1-y_if\left(x_i\right)\} \\ \zeta_i & \ = \max \{0, 1-y_if\left(x_i\right)\}, & \text{minimizing means } \zeta_i \text{reach lower bound}\\ \implies R(w) & \ = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}, & \text{plugging in and multplying } \lambda = \dfrac{1}{C} \end{align}$$

    The (reformulated) Optimization Problem:

    $$\min_{w, b}\dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$


Loss Functions

  1. Define:
    1. Loss Functions - Abstractly and Mathematically:
      Abstractly, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number, intuitively, representing some “cost” associated with the event.

      Formally, a loss function is a function \(L :(\hat{y}, y) \in \mathbb{R} \times Y \longmapsto L(\hat{y}, y) \in \mathbb{R}\) that takes as inputs the predicted value \(\hat{y}\) corresponding to the real data value \(y\) and outputs how different they are.

    2. Distance-Based Loss Functions:
      A Loss function \(L(\hat{y}, y)\) is called distance-based if it:
      1. Only depends on the residual:

        $$L(\hat{y}, y) = \psi(y-\hat{y}) \:\: \text{for some } \psi : \mathbb{R} \longmapsto \mathbb{R}$$

      2. Loss is \(0\) when residual is \(0\):

        $$\psi(0) = 0$$

      3. What are they used for?
        Regression.
      4. Describe an important property of dist-based losses:
        Translation Invariance:
        Distance-based losses are translation-invariant:

        $$L(\hat{y}+a, y+a) = L(\hat{y}, y)$$

    3. Relative Error - What does it lack?
      Relative-Error \(\dfrac{\hat{y}-y}{y}\) is a more natural loss but it is NOT translation-invariant.
  2. List 3 Regression Loss Functions
    1. MSE
    2. MAE
    3. Huber

  1. List 7 Classification Loss Functions
    1. \(0-1\) Loss
    2. Square Loss
    3. Hinge Loss
    4. Logistic Loss
    5. Cross-Entropy
    6. Exponential Loss
    7. Perceptron Loss


Information Theory

  1. What is Information Theory? In the context of ML?
    Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal.

    In the context of machine learning, we can also apply information theory to continuous variables where some of these message length interpretations do not apply, instead, we mostly use a few key ideas from information theory to characterize probability distributions or to quantify similarity between probability distributions.

  2. Describe the Intuition for Information Theory. Intuitively, how does the theory quantify information (list)?
    The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative.

    Thus, information theory quantifies information in a way that formalizes this intuition:

    1. Likely events should have low information content - in the extreme case, guaranteed events have no information at all
    2. Less likely events should have higher information content
    3. Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
  3. Measuring Information - Definitions and Formulas:
    1. In Shannons Theory, how do we quantify “transmitting 1 bit of information”?
      To transmit \(1\) bit of information means to divide the recipients Uncertainty by a factor of \(2\).
    2. What is the amount of information transmitted?
      The amount of information transmitted is the logarithm (base \(2\)) of the uncertainty reduction factor.
    3. What is the uncertainty reduction factor?
      It is the inverse of the probability of the event being communicated.
    4. What is the amount of information in an event \(x\)?
      The amount of information in an event \(\mathbf{x} = x\), called the Self-Information is:

      $$I(x) = \log (1/p(x)) = -\log(p(x))$$

  4. Define the Self-Information - Give the formula:
    The Self-Information or surprisal is a synonym for the surprise when a random variable is sampled.
    The Self-Information of an event \(\mathrm{x} = x\):

    $$I(x) = - \log P(x)$$

    1. What is it defined with respect to?
      Self-information deals only with a single outcome.
  5. Define Shannon Entropy - What is it used for?
    Shannon Entropy is defined as the average amount of information produced by a stochastic source of data.
    Equivalently, the amount of information that you get from one sample drawn from a given probability distribution \(p\).

    To quantify the amount of uncertainty in an entire probability distribution, we use Shannon Entropy.

    $$H(x) = {\displaystyle \operatorname {E}_{x \sim P} [I(x)]} = - {\displaystyle \operatorname {E}_{x \sim P} [\log P(X)] = -\sum_{i=1}^{n} p\left(x_{i}\right) \log p\left(x_{i}\right)}$$

    1. Describe how Shannon Entropy relate to distributions with a graph:
      img
  6. Define Differential Entropy:
    Differential Entropy is Shannons entropy of a continuous random variable \(x\)
  7. How does entropy characterize distributions?
    Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy.
  8. Define Relative Entropy - Give it’s formula:
    The Kullback–Leibler divergence (Relative Entropy) is a measure of how one probability distribution diverges from a second, expected probability distribution.

    Mathematically:

    $${\displaystyle D_{\text{KL}}(P\parallel Q)=\operatorname{E}_{x \sim P} \left[\log \dfrac{P(x)}{Q(x)}\right]=\operatorname{E}_{x \sim P} \left[\log P(x) - \log Q(x)\right]}$$

    1. Discrete:

    $${\displaystyle D_{\text{KL}}(P\parallel Q)=\sum_{i}P(i)\log \left({\frac {P(i)}{Q(i)}}\right)}$$

    1. Continuous:

    $${\displaystyle D_{\text{KL}}(P\parallel Q)=\int_{-\infty }^{\infty }p(x)\log \left({\frac {p(x)}{q(x)}}\right)\,dx,}$$

    1. Give an interpretation:
      1. Discrete variables:
        It is the extra amount of information needed to send a message containing symbols drawn from probability distribution \(P\), when we use a code that was designed to minimize the length of messages drawn from probability distribution \(Q\).
    2. List the properties:
      1. Non-Negativity:
        \({\displaystyle D_{\mathrm {KL} }(P\|Q) \geq 0}\)
      2. \({\displaystyle D_{\mathrm {KL} }(P\|Q) = 0 \iff}\) \(P\) and \(Q\) are:
        1. Discrete Variables:
          the same distribution
        2. Continuous Variables:
          equal “almost everywhere”
      3. Additivity of Independent Distributions:
        \({\displaystyle D_{\text{KL}}(P\parallel Q)=D_{\text{KL}}(P_{1}\parallel Q_{1})+D_{\text{KL}}(P_{2}\parallel Q_{2}).}\)
      4. \({\displaystyle D_{\mathrm {KL} }(P\|Q) \neq D_{\mathrm {KL} }(Q\|P)}\)

        This asymmetry means that there are important consequences to the choice of the ordering

      5. Convexity in the pair of PMFs \((p, q)\) (i.e. \({\displaystyle (p_{1},q_{1})}\) and \({\displaystyle (p_{2},q_{2})}\) are two pairs of PMFs):
        \({\displaystyle D_{\text{KL}}(\lambda p_{1}+(1-\lambda )p_{2}\parallel \lambda q_{1}+(1-\lambda )q_{2})\leq \lambda D_{\text{KL}}(p_{1}\parallel q_{1})+(1-\lambda )D_{\text{KL}}(p_{2}\parallel q_{2}){\text{ for }}0\leq \lambda \leq 1.}\)
    3. Describe it as a distance:
      Because the KL divergence is non-negative and measures the difference between two distributions, it is often conceptualized as measuring some sort of distance between these distributions.
      However, it is not a true distance measure because it is not symmetric.

      KL-div is, however, a Quasi-Metric, since it satisfies all the properties of a distance-metric except symmetry

    4. List the applications of relative entropy:
      Characterizing:
      1. Relative (Shannon) entropy in information systems
      2. Randomness in continuous time-series
      3. Information gain when comparing statistical models of inference
    5. How does the direction of minimization affect the optimization:
      Suppose we have a distribution \(p(x)\) and we wish to approximate it with another distribution \(q(x)\).
      We have a choice of minimizing either:
      1. \({\displaystyle D_{\text{KL}}(p\|q)} \implies q^\ast = \operatorname {arg\,min}_q {\displaystyle D_{\text{KL}}(p\|q)}\)
        Produces an approximation that usually places high probability anywhere that the true distribution places high probability.
      2. \({\displaystyle D_{\text{KL}}(q\|p)} \implies q^\ast \operatorname {arg\,min}_q {\displaystyle D_{\text{KL}}(q\|p)}\)
        Produces an approximation that rarely places high probability anywhere that the true distribution places low probability.

        which are different due to the asymmetry of the KL-divergence

  9. Define Cross Entropy - Give it’s formula:
    The Cross Entropy between two probability distributions \({\displaystyle p}\) and \({\displaystyle q}\) over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution \({\displaystyle q}\), rather than the “true” distribution \({\displaystyle p}\).

    $$H(p,q) = \operatorname{E}_{p}[-\log q]= H(p) + D_{\mathrm{KL}}(p\|q) =-\sum_{x }p(x)\,\log q(x)$$

    1. What does it measure?
      The average number of bits that need to be transmitted using a different probability distribution \(q\) (for encoding) than the “true” distribution \(p\), to convey the information in \(p\).
    2. How does it relate to relative entropy?
      It is similar to KL-Div but with an additional quantity - the entropy of \(p\).
    3. When are they equivalent?
      Minimizing the cross-entropy with respect to \(Q\) is equivalent to minimizing the KL divergence, because \(Q\) does not participate in the omitted term.

Recommendation Systems

  1. Describe the different algorithms for recommendation systems:

Ensemble Learning

  1. What are the two paradigms of ensemble methods?
    1. Parallel
    2. Sequential
  2. Random Forest VS GBM?
    The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

Data Processing and Analysis

  1. What are 3 data preprocessing techniques to handle outliers?
    1. Winsorizing/Winsorization (cap at threshold).
    2. Transform to reduce skew (using Box-Cox or similar).
    3. Remove outliers if you’re certain they are anomalies or measurement errors.
  2. Describe the strategies to dimensionality reduction?
    1. Feature Selection
    2. Feature Projection/Extraction
  3. What are 3 ways of reducing dimensionality?
    1. Removing Collinear Features
    2. Performing PCA, ICA, etc.
    3. Feature Engineering
    4. AutoEncoder
    5. Non-negative matrix factorization (NMF)
    6. LDA
    7. MSD
  4. List methods for Feature Selection
    1. Variance Threshold: normalize first (variance depends on scale)
    2. Correlation Threshold: remove the one with larger mean absolute correlation with other features.
    3. Genetic Algorithms
    4. Stepwise Search: bad performance, regularization much better, it’s a greedy algorithm (can’t account for future effects of each change)
    5. LASSO, Elastic-Net
  5. List methods for Feature Extraction
    1. PCA, ICA, CCA
    2. AutoEncoders
    3. LDA: LDA is a supervised linear transformation technique since the dependent variable (or the class label) is considered in the model. It Extracts the k new independent variables that maximize the separation between the classes of the dependent variable.
      1. Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
      2. Unlike PCA, LDA extracts the k new independent variables that maximize the separation between the classes of the dependent variable. LDA is a supervised linear transformation technique since the dependent variable (or the class label) is considered in the model.
    4. Latent Semantic Analysis
    5. Isomap
  6. How to detect correlation of “categorical variables”?
    1. Chi-Squared test: it is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.
  7. Feature Importance
    1. Use linear regression and select variables based on \(p\) values
    2. Use Random Forest, Xgboost and plot variable importance chart
    3. Lasso
    4. Measure information gain for the available set of features and select top \(n\) features accordingly.
    5. Use Forward Selection, Backward Selection, Stepwise Selection
    6. Remove the correlated variables prior to selecting important variables
    7. In linear models, feature importance can be calculated by the scale of the coefficients
    8. In tree-based methods (such as random forest), important features are likely to appear closer to the root of the tree. We can get a feature’s importance for random forest by computing the averaging depth at which it appears across all trees in the forest
  8. Capturing the correlation between continuous and categorical variable? If yes, how?
    Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.
  9. What cross validation technique would you use on time series data set?
    Forward chaining strategy with k folds.
  10. How to deal with missing features? (Imputation?)
    1. Assign a unique category to missing values, who knows the missing values might decipher some trend.
    2. Remove them blatantly
    3. we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.
  11. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
    For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
  12. What are collinearity and multicollinearity?
    1. Collinearity occurs when two predictor variables (e.g., \(x_1\) and \(x_2\)) in a multiple regression have some correlation.
    2. Multicollinearity occurs when more than two predictor variables (e.g., \(x_1, x_2, \text{ and } x_3\)) are inter-correlated.

ML/Statistical Models

  1. What are parametric models?
    Parametric models are those with a finite number of parameters. To predict new data, you only need to know the parameters of the model. Examples include linear regression, logistic regression, and linear SVMs.
  2. What is a classifier?
    A function that maps…

K-NN


PCA

  1. What is PCA?
    It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
  2. What is the Goal of PCA?
    Given points \(\mathbf{x}_ i \in \mathbf{R}^d\), find k-directions that capture most of the variation.
  3. List the applications of PCA:
    1. Find a small basis for representing variations in complex things.

      e.g. faces, genes.

    2. Reducing the number of dimensions makes some computations cheaper.
    3. Remove irrelevant dimensions to reduce over-fitting in learning algorithms.

      Like “subset selection” but the features are not axis aligned.
      They are linear combinations of input features.

    4. Represent the data with fewer parameters (dimensions)
  4. Give formulas for the following:
    1. Assumptions on \(X\):
      The analysis above is valid only for (1) \(X\) w/ samples in rows and variables in columns (2) \(X\) is centered (mean=0)
    2. SVD of \(X\):
      \(X = USV^{T}\)
    3. Principal Directions/Axes:
      \(V\)
    4. Principal Components (scores):
      \(US\)
    5. The \(j\)-th principal component:
      \(Xv_j = Us_j\)
  5. Define the transformation, mathematically:
    Mathematically, the transformation is defined by a set of \(p\)-dimensional vectors of weights or coefficients \({\displaystyle \mathbf {v}_ {(k)}=(v_{1},\dots ,v_{p})_ {(k)}}\) that map each row vector \({\displaystyle \mathbf {x}_ {(i)}}\) of \(X\) to a new vector of principal component scores \({\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_ {(i)}}\), given by:

    $${\displaystyle {t_{k}}_{(i)}=\mathbf {x}_ {(i)}\cdot \mathbf {v}_ {(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l}$$

    in such a way that the individual variables \({\displaystyle t_{1},\dots ,t_{l}}\) of \(t\) considered over the data set successively inherit the maximum possible variance from \(X\), with each coefficient vector \(v\) constrained to be a unit vector (where \(l\) is usually selected to be less than \({\displaystyle p}\) to reduce dimensionality).

  6. What does PCA produce/result in?
  7. Describe the PCA algorithm:
    1. Data Preprocessing:
      1. Training set: \(x^{(1)}, x^{(2)}, \ldots, x^{(m)}\)
      2. Preprocessing (feature scaling + mean normalization):
        1. mean normalization:
          \(\mu_{j}=\frac{1}{m} \sum_{i=1}^{m} x_{j}^{(i)}\)
          Replace each \(x_{j}^{(i)}\) with \(x_j^{(i)} - \mu_j\)
        2. feature scaling:
          If different features on different, scale features to have comparable range
          \(s_j = S.D(X_j)\) (the standard deviation of feature \(j\))
          Replace each \(x_{j}^{(i)}\) with \(\dfrac{x_j^{(i)} - \mu_j}{s_j}\)
    2. Computing the Principal Components:
      1. Compute the SVD of the matrix \(X = U S V^T\)
      2. Compute the Principal Components:

        $$T = US = XV$$

        Note: The \(j\)-th principal component is: \(Xv_j\)

      3. Choose the top \(k\) components singular values in \(S = S_k\)
      4. Compute the Truncated Principal Components:

        $$T_k = US_k$$

    3. Computing the Low-rank Approximation Matrix \(X_k\):
      1. Compute the reconstruction matrix:

        $$X_k = T_kV^T = US_kV^T$$

  8. Describe the Optimality of PCA:
    Optimal for Finding a lower dimensional subspace (PCs) that Minimizes the RSS of projection errors.
  9. List limitations of PCA:
    1. PCA is highly sensitive to the (relative) scaling of the data; no consensus on best scaling.
  10. Intuition:
    1. PCA can be thought of as fitting a \(p\)-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only a commensurately small amount of information.
    2. Its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data.

  11. Should you remove correlated features b4 PCA?
    Yes. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated. Discussion
  12. How can we measure the “Total Variance” of the data?
    1.*The Total Variance** of the data can be expressed as the sum of all the eigenvalues:

    $$\mathbf{Tr} \Sigma = \mathbf{Tr} (U \Lambda U^T) = \mathbf{Tr} (U^T U \Lambda) = \mathbf{Tr} \Lambda = \lambda_1 + \ldots + \lambda_n.$$

  13. How can we measure the “Total Variance” of the projected data?
    1.*The Total Variance** of the Projected data is:

    $$\mathbf{Tr} (P \Sigma P^T ) = \lambda_1 + \lambda_2 + \cdots + \lambda_k. $$

  14. How can we measure the “Error in the Projection”?
    1.*The Error in the Projection** could be measured with respect to variance.
    1. We define the ratio of variance “explained” by the projected data (equivalently, the ratio of information “retained”) as:

    $$\dfrac{\lambda_1 + \ldots + \lambda_k}{\lambda_1 + \ldots + \lambda_n}. $$

    1. What does it mean when this ratio is high?
      If the ratio is high, we can say that much of the variation in the data can be observed on the projected plane.
  15. How does PCA relate to CCA?
    1. CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while
    2. PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset.
  16. How does PCA relate to ICA?
    Independent component analysis (ICA) is directed to similar problems as principal component analysis, but finds additively separable components rather than successive approximations.

The Centroid Method


K-Means

  1. What is K-Means?
    It is a clustering algorithm. It aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean. It results in a partitioning of the data space into Voronoi Cells.
  2. What is the idea behind K-Means?
    1. Minimize the aggregate intra-cluster distance
    2. Equivalent to minimizing the variance
    3. Thus, it finds \(k-\)clusters with minimum aggregate variance
  3. What does K-Mean find?
    It finds \(k-\)clusters with minimum aggregate variance.
  4. Formal Description of the Model:
    Given a set of observations \(\left(\mathbf{x}_{1}, \mathbf{x} _{2}, \ldots, \mathbf{x}_{n}\right)\), \(\mathbf{x}_ i \in \mathbb{R}^d\), the algorithm aims to partition the \(n\) observations into \(k\) sets \(\mathbf{S}=\left\{S_{1}, S_{2}, \ldots, S_{k}\right\}\) so as to minimize the intra-cluster Sum-of-Squares (i.e. variance).
    1. What is the Objective?

      $$\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}=\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{k}\left|S_{i}\right| \operatorname{Var} S_{i}$$

      where \(\boldsymbol{\mu}_i\) is the mean of points in \(S_i\).

  5. Description of the Algorithm:
    1. Choose two random points, call them “Centroids”
    2. Assign the closest \(N/2\) points (Euclidean-wise) to each of the Centroids
    3. Compute the mean of each “group”/class of points
    4. Re-Assign the centroids to the newly computed Means ↑
    5. REPEAT!
  6. What is the Optimization method used? What class does it belong to?
    Coordinate descent. Expectation-Maximization.
  7. What is the Complexity of the algorithm?
    The original formulation of the problem is NP-Hard; however, EM algorithms (specifically, Coordinate-Descent) can be used as efficient heuristic algorithms that converge quickly to good local minima.
  8. Describe the convergence and prove it:
    Guaranteed to converge after a finite number of iterations to a local minimum.
    1. Proof:
      The Algorithm Minimizes a monotonically decreasing, Non-Negative Energy function on a finite Domain:
      By Monotone Convergence Theorem the objective Value Converges.

  9. Describe the Optimality of the Algorithm:
    1. Locally optimal: due to convergence property
    2. Non-Globally optimal:
      1. The objective function is non-convex
      2. Moreover, coordinate Descent doesn’t converge to global minimum on non-convex functions.
  10. Derive the estimated parameters of the algorithm:
    1. Objective Function:

      $$J(S, \mu)= \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}} \| \mathbf{x} -\mu_i \|^{2}$$

    2. Optimization Objective:

      $$\min _{\mu} \min _{S} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x} -\mu_{i}\right\|^{2}$$

    3. Derivation:
    4. Fix \(S = \hat{S}\), optimize \(\mu\):

      $$\begin{aligned} & \min _{\mu} \sum_{i=1}^{k} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mu_{i}-x_{j}\right\|^{2}\\ =& \sum_{i=1}^{k} \min _{\mu_i} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mathbf{x} - \mu_{i}\right\|^{2} \end{aligned}$$

      1. MLE:

        $$\min _{\mu_i} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mathbf{x} - \mu_{i}\right\|^{2}$$

        \(\implies\)

        $${\displaystyle \hat{\mu_i} = \dfrac{\sum_{\mathbf{x} \in \hat{S}_ {i}} \mathbf{x}}{\vert\hat{S}_ i\vert}}$$

    5. Fix \(\mu_i = \hat{\mu_i}, \forall i\), optimize \(S\):

      $$\arg \min _{S} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x} - \hat{\mu_{i}}\right\|^{2}$$

      \(\implies\)

      $$S_{i}^{(t)}=\left\{x_{p} :\left\|x_{p}-m_{i}^{(t)}\right\|^{2} \leq\left\|x_{p}-m_{j}^{(t)}\right\|^{2} \forall j, 1 \leq j \leq k\right\}$$

      1. MLE:
  11. When does K-Means fail to give good results?
    K-means clustering algorithm fails to give good results when:
    1. the data contains outliers
    2. the density spread of data points across the data space is different
    3. and the data points follow nonconvex shapes.

Naive Bayes

  1. Define:
    1. Naive Bayes:
      It is a simple technique used for constructing classifiers.
    2. Naive Bayes Classifiers:
      A family of simple probabilistic classifiers based on applying bayes theorem with strong (naive) independence assumptions between the features.
    3. Bayes Theorem:
      \(p(x\vert y) = \dfrac{p(y\vert x) p(x)}{p(y)}\)
  2. List the assumptions of Naive Bayes:
    1. Conditional Independence: the features are conditionally independent from each other given a class \(C_k\)
    2. Bag-of-words: The relative importance (positions) of the features do not matter
  3. List some properties of Naive Bayes:
    1. Not a Bayesian method
    2. It’s a Bayes Classifier: minimizes the probability of misclassification

  4. Define the Probabilistic Model for the method:
    Naive Bayes is a conditional probability model:
    given a problem instance to be classified represented by a vector \(\boldsymbol{x} = (x_1, \cdots, x_n)\) of \(n\) features/words (independent variables), it assigns to this instance probabilities:

    $$P(C_k\vert \boldsymbol{x}) = p(C_k\vert x_1, \cdots, x_n)$$

    for each of the \(k\) classes.

    Using Bayes theorem to decompose the conditional probability:

    $$p(C_k\vert \boldsymbol{x}) = \dfrac{p(\boldsymbol{x}\vert C_k) p(C_k)}{p(\boldsymbol{x})}$$

    Notice that the numerator is equivalent to the joint probability distribution:

    $$p\left(C_{k}\right) p\left(\mathbf{x} | C_{k}\right) = p\left(C_{k}, x_{1}, \ldots, x_{n}\right)$$

    Using the Chain-Rule for repeated application of the conditional probability, the joint probability model can be rewritten as:

    $$p(C_{k},x_{1},\dots ,x_{n})\, = p(x_{1}\mid x_{2},\dots ,x_{n},C_{k})p(x_{2}\mid x_{3},\dots ,x_{n},C_{k})\dots p(x_{n-1}\mid x_{n},C_{k})p(x_{n}\mid C_{k})p(C_{k})$$

    Using the Naive Conditional Independence assumptions:

    $$p\left(x_{i} | x_{i+1}, \ldots, x_{n}, C_{k}\right)=p\left(x_{i} | C_{k}\right)$$

    Thus, we can write the joint model as:

    $${\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\dots ,x_{n})&\varpropto p(C_{k},x_{1},\dots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}}$$

    Finally, the conditional distribution over the class variable \(C\) is:

    $${\displaystyle p(C_{k}\mid x_{1},\dots ,x_{n})={\frac {1}{Z}}p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})}$$

    where, \({\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})}\) is a constant scaling factor, a dependent only on the, known, feature variables \(x_i\)s.

  5. Construct the classifier. What are its components? Formally define it.
    The classifier is made of (1) the conditional probability model (above) \(\:\) and (2) A decision rule.

    The decision rule used is the MAP hypothesis: i.e. pick the hypothesis that is most probable (maximize the MAP estimate=posterior*prior).

    The classifier is the function that assigns a class label \(\hat{y} = C_k\) for some \(k\) as follows:

    $$\hat{y}=\underset{k \in\{1, \ldots, K\}}{\operatorname{argmax}} p\left(C_{k}\right) \prod_{i=1}^{n} p\left(x_{i} | C_{k}\right)$$

  6. What are the parameters to be estimated for the classifier?:
    1. The posterior probability of each feature/word given a class
    2. The prior probability of each class
  7. What method do we use to estimate the parameters?:
    Maximum Likelihood Estimation (MLE).
  8. What are the estimates for each of the following parameters?:

CNNs

  1. What is a CNN?
    A convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery.

    Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

    1. What kind of data does it work on? What is the mathematical property?
      In general, it works on data that have grid-like topology.
  2. What are the layers of a CNN?

    $$\text{Input} \longrightarrow \left[\text{Conv} \rightarrow \text{ReLU} \rightarrow \text{Pooling} \rightarrow \text{Norm}\right] \times N \longrightarrow \text{FC}$$

  3. What are the four important ideas and their benefits that the convolution affords CNNs:
    1. Sparse Interactions/Connectivity/Weights:
      Unlike FNNs, where every input unit is connected to every output unit, CNNs have sparse interactions. This is accomplished by making the kernel smaller than the input.

      Benefits:

      1. This means that we need to store fewer parameters, which both,
        1. Reduces the memory requirements of the model and
        2. Improves its statistical efficiency
      2. Also, Computing the output requires fewer operations
      3. In deep CNNs, the units in the deeper layers interact indirectly with large subsets of the input which allows modelling of complex interactions through sparse connections.
    2. Parameter Sharing:
      refers to using the same parameter for more than one function in a model.

      Benefits:

      1. This means that rather than learning a separate set of parameters for every location, we learn only one set of parameters.
        1. This does not affect the runtime of forward propagation—it is still \(\mathcal{O}(k \times n)\)
        2. But it does further reduce the storage requirements of the model to \(k\) parameters (\(k\) is usually several orders of magnitude smaller than \(m\))
    3. Equivariant Representations:
      For convolutions, the particular form of parameter sharing causes the layer to have a property called equivariance to translation.

      Thus, if we move the object in the input, its representation will move the same amount in the output.

      Benefits:

      1. It is most useful when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations (e.g. edge detection)
      2. Shifting the position of an object in the input doesn’t confuse the NN
      3. Robustness against translated inputs/images
    4. Accepts inputs of variable sizes

  4. What is the inspirational model for CNNs:
    Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex.
    Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.
  5. Describe the connectivity pattern of the neurons in a layer of a CNN:
    The neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner.
  6. Describe the process of a ConvNet:
    ConvNets transform the original image layer by layer from the original pixel values to the final class scores.
  7. Convolution Operation:
    1. Define:

    2. Formula (continuous):
    3. Formula (discrete):
    4. Define the following:
      1. Feature Map:
    5. Does the operation commute?
  8. Cross Correlation:
    1. Define:
    2. Formulae:
    3. What are the differences/similarities between convolution and cross-correlation:
  9. Write down the Convolution operation and the cross-correlation over two axes and:
    1. Convolution:
    2. Convolution (commutative):
    3. Cross-Correlation:
  10. The Convolutional Layer:
    1. What are the parameters and how do we choose them?
    2. Describe what happens in the forward pass:
    3. What is the output of the forward pass:
    4. How is the output configured?
  11. Spatial Arrangements:
    1. List the Three Hyperparameters that control the output volume:
    2. How to compute the spatial size of the output volume?
    3. How can you ensure that the input & output volume are the same?
    4. In the output volume, how do you compute the \(d\)-th depth slice:
  12. Calculate the number of parameters for the following config:

    Given:
    1. Input Volume: \(64\times64\times3\)
    1. Filters: \(15 7\times7\)
    1. Stride: \(2\)
    1. Pad: \(3\)

  13. Definitions:
    1. Receptive Field:
  14. Suppose the input volume has size \([ 32 × 32 × 3 ]\) and the receptive field (or the filter size) is \(5 × 5\) , then each neuron in the Conv Layer will have weights to a __Blank__ region in the input volume, for a total of __Blank__ weights:
  15. How can we achieve the greatest reduction in the spatial dims of the network (for classification):
  16. Pooling Layer:
    1. Define:
    2. List key ideas/properties and benefits:
    3. List the different types of Pooling:
      Answer
    4. List variations of pooling and their definitions:
    5. List the hyperparams of Pooling Layer:
    6. How to calculate the size of the output volume:
    7. How many parameters does the pooling layer have:
    8. What are other ways to perform downsampling:
  17. Weight Priors:
    1. Define “Prior Prob Distribution on the parameters”:
    2. Define “Weight Prior” and its types/classes:
    3. Describe the Conv Layer as a FC Layer using priors:
    4. What are the key insights of using this view:
  18. When do multi-channel convolutions commute?
    Answer
  19. Why do we use several different kernels in a given conv-layer?
  20. Strided Convolutions
    1. Define:
    2. What are they used for?
    3. What are they equivalent to?
    4. Formula:
  21. Zero-Padding:
    1. Definition/Usage:
    2. List the types of padding:
  22. Locally Connected Layers/Unshared Convolutions:
  23. Bias Parameter:
    1. How many bias terms are used per output channel in the tradional convolution:
  24. Dilated Convolutions
    1. Define:
    2. What are they used for?
  25. Stacked Convolutions
    1. Define:
    2. What are they used for?

Theory


RNNs

  1. What is an RNN?
    1. Definition:
    2. What machine-type is the standard RNN:
  2. What is the big idea behind RNNs?
  3. Dynamical Systems:
    1. Standard Form:
    2. RNN as a Dynamical System:
  4. Unfolding Computational Graphs
    1. Definition:
    2. List the Advantages introduced by unfolding and the benefits:
    3. Graph and write the equations of Unfolding hidden recurrence:
  5. Describe the State of the RNN, its usage, and extreme cases of the usage:
  6. RNN Architectures:
    1. List the three standard architectures of RNNs:
      1. Graph:
      2. Architecture:
      3. Equations:
      4. Total Loss:
      5. Complexity:
      6. Properties:
  7. Teacher Forcing:
    1. Definition:
    2. Application:
    3. Disadvantages:
    4. Possible Solutions for Mitigation:

Optimization

  1. Define the sigmoid function and some of its properties:
    \(\sigma(-x) = 1 - \sigma(x)\)
  2. Backpropagation:
    1. Definition:
      Backpropagation algorithms are a family of methods used to efficiently train artificial neural networks (ANNs) following a gradient descent approach that exploits the chain rule.
    2. Derive Gradient Descent Update:
      Answer
    3. Explain the difference kinds of gradient-descent optimization procedures:
      1. Batch Gradient Descent AKA Vanilla Gradient Descent, computes the gradient of the objective wrt. the parameters \(\theta\) for the entire dataset:

        $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J(\theta)$$

      2. SGD performs a parameter update for each data-point:

        $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i)} ; y^{(i)}\right)$$

      3. Mini-batch Gradient Descent a hybrid approach that perform updates for a, pre-specified, mini-batch of \(n\) training examples:

        $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i : i+n)} ; y^{(i : i+n)}\right)$$

    4. List the different optimizers and their properties:
      Answer
  3. Error-Measures:
    1. Define what an error measure is:
      Error Measures aim to answer the question:
      “What does it mean for \(h\) to approximate \(f\) (\(h \approx f\))?”
      The Error Measure: \(E(h, f)\)
      It is almost always defined point-wise: \(\mathrm{e}(h(\mathbf{X}), f(\mathbf{X}))\).
    2. List the 5 most common error measures and where they are used:

    3. Specific Questions:
  4. Show that the weight vector of a linear signal is orthogonal to the decision boundary?
    The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
    Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:

    $$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$

  5. What does it mean for a function to be well-behaved from an optimization pov?
    The well-behaved property from an optimization standpoint, implies that \(f''(x)\) doesn’t change too much or too rapidly, leading to a nearly quadratic function that is easy to optimize by gradient methods.
  6. Write \(\|\mathrm{Xw}-\mathrm{y}\|^{2}\) as a summation

    $$\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{n}-y_{n}\right)^{2} = \frac{1}{N}\|\mathrm{Xw}-\mathrm{y}\|^{2}$$

  7. Compute:
    1. \(\dfrac{\partial}{\partial y}\vert{x-y}\vert=\)
      \(\dfrac{\partial}{\partial y} \vert{x-y}\vert = - \text{sign}(x-y)\)
  8. State the difference between SGD and GD?
    Gradient Descent’s cost-function iterates over ALL training samples.
    Stochastic Gradient Descent’s cost-function only accounts for ONE training sample, chosen at random.
  9. When would you use GD over SDG, and vice-versa?
    GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large.
    That means GD is preferable for small datasets while SGD is preferable for larger ones.
  10. What is convex hull ?
    In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.
  11. OLS vs MLE
    They both estimate parameters in the model. They are the same in the case of normal distribution.

ML Theory

  1. Explain intuitively why Deep Learning works?
    Circuit Theory: There are function you can compute with a “small” L-layer deep NN that shallower networks require exponentially more hidden units to compute. (comes from looking at networks as logic gates).
    1. Example:
      Computing \(x_1 \text{XOR} x_2 \text{XOR} ... \text{XOR} x_n\) takes:
      1. \(\mathcal{O}(log(n))\) in a tree representation.
        img
      2. \(\mathcal{O}(2^n)\) in a one-hidden-layer network because you need to exhaustively enumerate all possible \(2^N\) configurations of the input bits that result in the \(\text{XOR}\) being \({1, 0}\).
        img
  2. List the different types of Learning Tasks and their definitions:
    1. Multi-Task Learning: general term for training on multiple tasks
      1. Joint Learning: by choosing mini-batches from two different tasks simultaneously/alternately
      2. Pre-Training: first train on one task, then train on another

        widely used for word embeddings

    2. Transfer Learning:
      a type of multi-task learning where we are focused on one task; by learning on another task then applying those models to our main task
    3. Domain Adaptation:
      a type of transfer learning, where the output is the same, but we want to handle different inputs/topics/genres
    4. Zero-Shot Learning: is a form of extending supervised learning to a setting of solving for example a classification problem when not enough labeled examples are available for all classes.

      “Zero-shot learning is being able to solve a task despite not having received any training examples of that task.” - Goodfellow

    answer

  3. Describe the relationship between supervised and unsupervised learning?
    answer
    Many ml algorithms can be used to perform both tasks. E.g., the chain rule of probability states that for a vector \(x \in \mathbb{R}^n\), the joint distribution can be decomposed as:
    \(p(\mathbf{x})=\prod_{i=1}^{n} p\left(\mathrm{x}_{i} | \mathrm{x}_{1}, \ldots, \mathrm{x}_{i-1}\right)\)
    which implies that we can solve the Unsupervised problem of modeling \(p(x)\) by splitting it into \(n\) supervised learning problems.
    Alternatively, we can solve the supervised learning problem of learning \(p(y \vert x)\) by using traditional unsupervised learning technologies to learn the joint distribution \(p(x, y)\), then inferring:
    \(p(y | \mathbf{x})=\frac{p(\mathbf{x}, y)}{\sum_{y} p\left(\mathbf{x}, y^{\prime}\right)}\)
  4. Describe the differences between Discriminative and Generative Models?
    1. Generative Model: learns the joint probability distribution, “probability of \(\boldsymbol{x}\) and \(y\)”:

      $$P(\mathbf{x}, y)$$

    2. Discriminative Model: learns the conditional probability distribution, “probability of \(y\) given \(\boldsymbol{x}\)”:

      $$P(y \vert \mathbf{x})$$

  5. Describe the curse of dimensionality and its effects on problem solving:
    It is a phenomena where many machine learning problems become exceedingly difficult when the number of dimensions in the data is high.

    The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases \(\rightarrow\) Sparsity:

    1. Common Theme - Sparsity: When the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
    2. Sparsity and Statistical Significance: This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.
    3. Sparsity and Clustering: Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties (clustering); in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
    4. Statistical Challenge: the number of possible configurations of \(x\) is much larger than the number of training examples
    5. Stastical Sampling: The sampling density is proportional to \(N^{1/p}\), where \(p\) is the dimension of the input space and \(N\) is the sample size. Thus, if \(N_1 = 100\) represents a dense sample for a single input problem, then \(N_{10} = 100^{10}\) is the sample size required for the same sampling density with \(10\) inputs. Thus in high dimensions all feasible training samples sparsely populate the input space.
    6. Further Reading
  6. How to deal with curse of dimensionality?
    1. Feature Selection
    2. Feature Extraction
  7. Describe how to initialize a NN and any concerns w/ reasons:
    1. Don’t initialize the weights to Zero. The symmetry of hidden units results in a similar computation for each hidden unit, making all the rows of the weight matrix to be equal (by induction).
    2. It’s OK to initialize the bias term to zero.
  8. Describe the difference between Learning and Optimization in ML:
    1. The problem of Reducing the training error on the training set is one of optimization.
    2. The problem of Reducing the training error, as well as, the generalization (test) error is one of learning.
  9. List the 12 Standard Tasks in ML:
    Answer
    1. Clustering
    2. Forecasting
    3. Dimension reduction
    4. Sequence Labeling
  10. What is the difference between inductive and deductive learning?
    1. Inductive learning is the process of using observations to draw conclusions
      1. It is a method of reasoning in which the premises are viewed as supplying some evidence for the truth of the conclusion.
      2. It goes from specific to general (“bottom-up logic”).
      3. The truth of the conclusion of an inductive argument may be probable, based upon the evidence given.
    2. Deductive learning is the process of using conclusions to form observations.
      1. It is the process of reasoning from one or more statements (premises) to reach a logically certain conclusion.
      2. It goes from general to specific (“top-down logic”).
      3. The conclusions reached (“observations”) are necessarily True.

Statistical Learning Theory

  1. Define Statistical Learning Theory:
    It is a framework for machine learning drawing from the fields of statistics and functional analysis that allows us, under certain assumptions, to study the question:

    How can we affect performance on the test set when we can only observe the training set?

    It is a statistical approach to Computational Learning Theory.

  2. What assumptions are made by the theory?
    1. The training and test data are generated by an *unknown__1. __data generating distribution (over the product space \(Z = X \times Y\), denoted: \(p_{\text{data}}(z) = p(x,y)\)) called the data-generating process.
    2. The i.i.d assumptions:
      1. The data-points in each dataset are independent from each other
      2. The training and testing are both identically distributed (drawn from the same distribution)

    1. Define the i.i.d assumptions?
      A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
    2. Why assume a joint probability distribution \(p(x,y)\)?
      Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because \({\displaystyle y}\) is not a deterministic function of \({\displaystyle x}\), but rather a random variable with conditional distribution \({\displaystyle P(y|x)}\) for a fixed \({\displaystyle x}\).
    3. Why do we need to model \(y\) as a target-distribution and not a target-function?
      The ‘Target Function’ is not always a function because two ‘identical’ input points can be mapped to two different outputs (i.e. they have different labels).
  3. Give the Formal Definition of SLT:
  4. Define Empirical Risk Minimization:
    It is a (learning) principle in statistical learning theory that is based on approximating the Expected/True Risk (Generalization Error) by measuring the Empirical Risk (Training Error); i.e. the performance on the training-data.

    A learning algorithm that chooses the function \(f_S\) which minimizes the empirical risk is called Empirical Risk Minimization:

    $$R_{\text{emp}} = I_S[f] = \dfrac{1}{n} \sum_{i=1}^n V(f(\vec{x}_ i, y_i))$$

    $$f_{S} = \hat{h} = \arg \min _{h \in \mathcal{H}} R_{\mathrm{emp}}(h)$$

  5. What is the Complexity of ERM?
    NP-Hard for classification with \(0-1\) loss function, even for linear classifiers
    1. It can be solved efficiently when the minimal empirical risk is ZERO; i.e. the data is linearly separable

  6. Definitions:
    1. Generalization:
      The ability to do well on previously unobserved data:

      $$I[f] \approx I_S[f]$$

      “Good” generalization is achieved when the empirical risk approximates the expected risk well.

    2. Generalization Error:
      AKA: Expected Risk, Out-of-sample Error, \(E_{\text{out}}\)
      It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data defined as the expected value of the error on a new input, measured w.r.t. the data-generating probability distribution.

    3. Generalization Gap:
      It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data, defined as the difference between the expected risk and empirical risk:

      $$G =I\left[f_{n}\right]-I_{S}\left[f_{n}\right]$$

      1. Computing the Generalization Gap:
        Since the empirical risk/generalization error \(I[f_n]\) cannot be computed for an *unknown__1. __distribution, the generalization gap cannot be computed either.
      2. What is the goal of SLT in the context of the Generalization Gap given that it can’t be computed?
        Instead, the goal of SLT is to bound/characterize the gap in probability:

        $$P_{G}=P\left(I\left[f_{n}\right]-I_{S}\left[f_{n}\right] \leq \epsilon\right) \geq 1-\delta_{n}$$

        I.E. goal is to characterize the probability \(1- \delta_n\) that the generalization gap is less than some error bound \(\epsilon\) (known as the learning rate)

    4. Achieving (“good”) Generalization:
      An algorithm is said to generalize when the expected risk is well approximated by the empirical risk:

      $$I\left[f_{n}\right] \approx I_{S}\left[f_{n}\right]$$

      Equivalently:

      $$E_{\text {out}}(g) \approx E_{\text {in}}(g)$$

      I.E. when the generalization gap approaches zero in the limit of data-points:

      $$\lim _{n \rightarrow \infty} G_{n}=\lim _{n \rightarrow \infty} I\left[f_{n}\right]-I_{S}\left[f_{n}\right]=0$$

    5. Empirical Distribution:
      AKA: Data-Generating Distribution
      is the discrete, uniform, joint distribution \(p_{\text{data}} = p(x,y)\) over the sample points.
  7. Describe the difference between Learning and Optimization in ML:
    1. Optimization: is concerned with the problem of reducing the training error on the training set
    2. Learning: is concerned with the problem of reducing the training error, as well as, the generalization (test) error
  8. Describe the difference between Generalization and Learning in ML:
    1. Generalization guarantee: tells us that it is likely that the following condition holds:

      $$E_{\mathrm{out}}(\hat{h}) \approx E_{\mathrm{in}}(\hat{h})$$

      I.E. that the empirical error tracks/approximates the expected/generalization error well.

    2. Learning: corresponds to the condition that \(\hat{h} \approx f\) the chosen hypothesis approximates the target function well, which in-turn corresponds to the condition:

      $$E_{\mathrm{out}}(\hat{h}) \approx 0$$

  9. How to achieve Learning?
    To achieve learning we need to achieve the condition \(E_{\mathrm{out}}(\hat{h}) \approx 0\), which we do by:
    1. \(E_{\mathrm{out}}(\hat{h}) \approx E_{\mathrm{in}}(\hat{h})\)
      A theoretical result achieved through Hoeffding Inequality
    2. \(E_{\mathrm{in}}(\hat{h}) \approx 0\)
      A practical result of Empirical Error Minimization
  10. What does the (VC) Learning Theory Achieve?
    1. Characterizing the feasibility of learning for infinite hypotheses
    2. Characterizing the Approximation-Generalization Tradeoff
  11. Why do we need the probabilistic framework?
  12. What is the Approximation-Generalization Tradeoff:
    It is a tradeoff between (1) How well we can approximate the target function \(f \:\:\) and (2) How well we can generalize to unseen data.
    Given the goal: Small \(E_{\text{out}}\), good approximation of \(f\) out of sample (not in-sample).
    The tradeoff is characterized by the complexity of the hypothesis space \(\mathcal{H}\):
    1. More Complex \(\mathcal{H}\): Better chance of approximating \(f\)
    2. Less Complex \(\mathcal{H}\): Better chance of generalizing out-of-sample
  13. What are the factors determining how well an ML-algo will perform?
    1. Ability to Approximate \(f\) well, in-sample | Make the training error small &
    2. Decrease the gap between \(E_{\text{in}}\) and \(E_{\text{out}}\) | Make gap between training and test error small
  14. Define the following and their usage/application & how they relate to each other:
    1. Underfitting:
      Occurs when the model cannot fit the training data well; high \(E_{\text{in}}\).
    2. Overfitting:
      Occurs when the gap between the training error and test error is too large.
    3. Capacity:
      a models ability to fit a high variety of functions (complexity).
      It allows us to control the amount of overfitting and underfitting
      1. Models with Low-Capacity:
        Underfitting. High-Bias. Struggle to fit training-data.
      2. Models with High-Capacity:
        Overfitting. High-Variance. Memorizes noise.
    4. Hypothesis Space:
      The set of functions that the learning algorithm is allowed to select as being the target function.
      Allows us to control the capacity of a model.
    5. VC-Dimension:
      The largest possible value of \(m\) for which there exists a training set of \(m\) different points that the classifier can label arbitrarily.
      Quantifies a models capacity.
      1. What does it measure?
        Measures the capacity of a binary classifier.
    6. Graph the relation between Error, and Capacity in the ctxt of (Underfitting, Overfitting, Training Error, Generalization Err, and Generalization Gap):
      img
  15. What is the most important result in SLT that show that learning is feasible?
    Shows that the discrepancy between training error and generalization error is bounded above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases.

Bias-Variance Decomposition Theory

  1. What is the Bias-Variance Decomposition Theory:
    It is an approach for the quantification of the Approximation-Generalization Tradeoff.
  2. What are the Assumptions made by the theory?
    1. Analysis is done over the entire data-distribution
    2. Real-Valued inputs, targets (can be extended)
    3. Target function \(f\) is known
    4. Uses MSE (can be extended)
  3. What is the question that the theory tries to answer? What assumption is important? How do you achieve the answer/goal?
    1. “How can \(\mathcal{H}\) approximate \(f\) overall? not just on our sample/training-data.”.
    2. We assume that the target function \(f\) is known.
    3. By taking the expectation over all possible realization of \(N\) data-points.
  4. What is the Bias-Variance Decomposition:
    It is the decomposition of the expected error as a sum of three concepts: bias, variance and irreducible error, each quantifying an aspect of the error.
  5. Define each term w.r.t. source of the error:
    1. Bias: the error from the erroneous/simplifying assumptions in the learning algorithm
    2. Variance: the error from sensitivity to small fluctuations in the training set
    3. Irreducible Error: the error resulting from the noise in the problem itself
  6. What does each of the following measure? Describe it in Words? Give their AKA in statistics?
    1. Bias:
      1. AKA: Approximation Error
      2. Measures the error in approximating the target function with the best possible hypothesis in \(\mathcal{H}\)
      3. The expected deviation from the true value of the function (or parameter)
      4. How well can \(\mathcal{H}\) approximate the target function \(f\)
    2. Variance:
      1. AKA: Estimation Error
      2. Measures the error in estimating the best possible hypothesis in \(\mathcal{H}\) with a particular hypothesis resulting from a specific training-set
      3. The deviation from the expected estimator value that any particular sampling of the data is likely to cause
      4. How well can we zoom in on a good \(h \in \mathcal{H}\)
    3. Irreducible Error: measures the inherent noise in the target \(y\)
  7. Give the Formal Definition of the Decomposition (Formula):
    Given any hypothesis \(\hat{f} = g^{\mathcal{D}}\) we select, we can decompose its expected risk on an unseen sample \(x\) as:

    $$\mathbb{E}\left[(y-\hat{f}(x))^{2}\right]=(\operatorname{Bias}[\hat{f}(x)])^{2}+\operatorname{Var}[\hat{f}(x)]+\sigma^{2}$$

    Where:

    1. Bias:

      $$\operatorname{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x)$$

    2. Variance:

      $$\operatorname{Var}[\hat{f}(x)] = \mathbb{E}\left[(\hat{f}(x))^2\right] - \mathbb{E}\left[\hat{f}(x)\right]^2$$

    3. What is the Expectation over?
      The expectation is over all different samplings of the data \(\mathcal{D}\)
  8. Define the Bias-Variance Tradeoff:
    It is the property of a set of predictive models whereby, models with a lower bias have a higher variance and vice-versa.
    1. Effects of Bias:
      1. High Bias: simple models \(\rightarrow\) underfitting
      2. Low Bias: complex models \(\rightarrow\) overfitting
    2. Effects of Variance:
      1. High Variance: complex models \(\rightarrow\) overfitting
      2. Low Variance: simple models \(\rightarrow\) underfitting
    3. Draw the Graph of the Tradeoff (wrt model capacity):
      img
  9. Derive the Bias-Variance Decomposition with explanations:

    $${\displaystyle {\begin{aligned}\mathbb{E}_{\mathcal{D}} {\big [}I[g^{(\mathcal{D})}]{\big ]}&=\mathbb{E}_{\mathcal{D}} {\big [}\mathbb{E}_{x}{\big [}(g^{(\mathcal{D})}-y)^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x} {\big [}\mathbb{E}_{\mathcal{D}}{\big [}(g^{(\mathcal{D})}-y)^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}}{\big [}(g^{(\mathcal{D})}- f -\varepsilon)^{2}{\big ]}{\big ]} \\&=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}} {\big [}(f+\varepsilon -g^{(\mathcal{D})}+\bar{g}-\bar{g})^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}} {\big [}(\bar{g}-f)^{2}{\big ]}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}(\bar{g}-f)\varepsilon {\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}\varepsilon (g^{(\mathcal{D})}-\bar{g}){\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})(\bar{g}-f){\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}+2(\bar{g}-f)\mathbb{E}_{\mathcal{D}} [\varepsilon ]\: +2\: \mathbb{E}_{\mathcal{D}} [\varepsilon ]\: \mathbb{E}_{\mathcal{D}} {\big [}g^{(\mathcal{D})}-\bar{g}{\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}g^{(\mathcal{D})}-\bar{g}{\big ]}(\bar{g}-f){\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\operatorname {Var} [y]+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\operatorname {Bias} [g^{(\mathcal{D})}]^{2}+\operatorname {Var} [y]+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\operatorname {Bias} [g^{(\mathcal{D})}]^{2}+\sigma ^{2}+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\\end{aligned}}}$$

    where:
    \(\overline{g}(\mathbf{x})=\mathbb{E}_{\mathcal{D}}\left[g^{(\mathcal{D})}(\mathbf{x})\right]\) is the average hypothesis over all realization of \(N\) data-points \(\mathcal{D}_ i\), and \({\displaystyle \varepsilon }\) and \({\displaystyle {\hat {f}}} = g^{(\mathcal{D})}\) are independent.

  10. What are the key Takeaways from the Tradeoff?
    Match the “Model Capacity/Complexity” to the “Data Resources”, NOT to the Target Complexity.
  11. What are the most common ways to negotiate the Tradeoff?
    1. Cross-Validation
    2. MSE of the Estimates
  12. How does the decomposition relate to Classification?
    A similar decomposition exists for:
    1. Classification with \(0-1\) loss
    2. Probabilistic Classification with MSE
  13. Increasing/Decreasing Bias&Variance:
    1. Adding Good Feature:
      1. Decrease Bias
    2. Adding Bad Feature:
      1. No effect
    3. Adding ANY Feature:
      1. Increase Variance
    4. Adding more Data:
      1. Decrease Variance
      2. May decrease Bias (if \(f \in \mathcal{H}\) | \(h\) can fit \(f\) exactly)
    5. Noise in Test Set:
      1. Affects Only Irreducible Err
    6. Noise in Training Set:
      1. Affects Bias and Variance
    7. Dimensionality Reduction:
      1. Decrease Variance
    8. Feature Selection:
      1. Decrease Variance
    9. Regularization:
      1. Increase Bias
      2. Decrease Variance
    10. Increasing # of Hidden Units in ANNs:
      1. Decrease Bias
      2. Increase Variance
    11. Increasing # of Hidden Layers in ANNs:
      1. Decrease Bias
      2. Increase Variance
    12. Increasing \(k\) in K-NN:
      1. Increase Bias
      2. Decrease Variance
    13. Increasing Depth in Decision-Trees:
      1. Increase Variance
    14. Boosting:
      1. Decrease Bias
    15. Bagging:
      1. Decrease Variance

Activation Functions

  1. Describe the Desirable Properties for activation functions:
    1. Non-Linearity:
      When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
    2. Range:
      When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.
    3. Continuously Differentiable:
      This property is desirable for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.
    4. Monotonicity:
      When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex.
    5. Smoothness with Monotonic Derivatives:
      These have been shown to generalize better in some cases.
    6. Approximating Identity near Origin:
      Equivalent to \({\displaystyle f(0)=0}\) and \({\displaystyle f'(0)=1}\), and \({\displaystyle f'}\) is continuous at \(0\).
      When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values. When the activation function does not approximate identity near the origin, special care must be used when initializing the weights.
    7. Zero-Centered Range:
      Has effects of centering the data (zero mean) by centering the activations. Makes learning easier.

  2. Describe the NON-Desirable Properties for activation functions:
    1. Saturation:
      An activation functions output, with finite range, may saturate near its tail or head (e.g. \(\{0, 1\}\) for sigmoid). This leads to a problem called vanishing gradient.
    2. Vanishing Gradients:
      Happens when the gradient of an activation function is very small/zero. This usually happens when the activation function saturates at either of its tails.
      The chain-rule will multiply the local gradient (of activation function) with the whole objective. Thus, when gradient is small/zero, it will “kill” the gradient \(\rightarrow\) no signal will flow through the neuron to its weights or to its data.
      Slows/Stops learning completely.
    3. Range Not Zero-Centered:
      This is undesirable since neurons in later layers of processing in a Neural Network would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. \(x>0\) elementwise in \(f=w^Tx+b\)), then the gradient on the weights \(w\) will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression \(f\)). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
      Makes optimization harder.

  3. List the different activation functions used in ML?
    Identity, Sigmoid, Tanh, ReLU, L-ReLU, ELU, SoftPlus
    img
    Names, Definitions, Properties (pros&cons), Derivatives, Applications, pros/cons:
    1. Sigmoid:

      $$S(z)=\frac{1}{1+e^{-z}} \\ S^{\prime}(z)=S(z) \cdot(1-S(z))$$

      img

      1. Properties:
        Never use as activation, use as an output unit for binary classification.
        1. Pros:
          1. Has a nice interpretation as the firing rate of a neuron
        2. Cons:
          1. They Saturate and kill gradients \(\rightarrow\) Gives rise to vanishing gradients \(\rightarrow\) Stop Learning
            1. Happens when initialization weights are too large
            2. or sloppy with data preprocessing
            3. Neurons Activation saturates at either tail of \(0\) or \(1\)
          2. Output NOT Zero-Centered \(\rightarrow\) Gradient updates go too far in different directions \(\rightarrow\) makes optimization harder
          3. The local gradient \((z * (1-z))\) achieves maximum at \(0.25\), when \(z = 0.5\). \(\rightarrow\) very time the gradient signal flows through a sigmoid gate, its magnitude always diminishes by one quarter (or more) \(\rightarrow\) with basic SGD, the lower layers of a network train much slower than the higher one
    2. Tanh:

      $$\tanh (z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\ \tanh ^{\prime}(z)=1-\tanh (z)^{2}$$

      img
      Properties:
      Strictly superior to Sigmoid (scaled version of sigmoid | stronger gradient). Good for activation.

      1. Pros:
        1. Zero Mean/Centered
      2. Cons:
        1. They Saturate and kill gradients \(\rightarrow\) Gives rise to vanishing gradients \(\rightarrow\) Stop Learning
    3. Relu:

      $$R(z)=\left\{\begin{array}{cc}{z} & {z>0} \\ {0} & {z<=0}\end{array}\right\} \\ R^{\prime}(z)=\left\{\begin{array}{ll}{1} & {z>0} \\ {0} & {z<0}\end{array}\right\}$$

      img
      Properties:
      The best for activation (Better gradients).

      1. Pros:
        1. Non-saturation of gradients which accelerates convergence of SGD
        2. Sparsity effects and induced regularization. discussion
        3. Not computationally expensive
      2. Cons:
        1. ReLU not zero-centered problem:
          The problem that ReLU is not zero-centered can be solved/mitigated by using batch normalization, which normalizes the signal before activation:

          From paper: We add the BN transform immediately before the nonlinearity, by normalizing \(x = Wu + b\); normalizing it is likely to produce activations with a stable distribution.

        2. Dying ReLUs (Dead Neurons):
          If a neuron gets clamped to zero in the forward pass (it doesn’t “fire” / \(x<0\)), then its weights will get zero gradient. Thus, if a ReLU neuron is unfortunately initialized such that it never fires, or if a neuron’s weights ever get knocked off with a large update during training into this regime (usually as a symptom of aggressive learning rates), then this neuron will remain permanently dead.
          1. cs231n Explanation
        3. Infinite Range:
          Can blow up the activation.
    4. Leaky Relu:

      $$R(z)=\left\{\begin{array}{cc}{z} & {z>0} \\ {\alpha z} & {z<=0}\end{array}\right\} \\ R^{\prime}(z)=\left\{\begin{array}{ll}{1} & {z>0} \\ {\alpha} & {z<0}\end{array}\right\}$$

      img
      Properties:
      Sometimes useful. Worth trying.

      1. Pros:
        1. Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).
      2. Cons:
        The consistency of the benefit across tasks is presently unclear.


Kernels

  1. Define “Local Kernel” and give an analogy to describe it:
  2. Write the following kernels:
    1. Polynomial Kernel of degree, up to, \(d\):
    2. Gaussian Kernel:
    3. Sigmoid Kernel:
    4. Polynomial Kernel of degree, exactly, \(d\):

Math

  1. What is a metric?
    A Metric (distance function) \(d\) is a function that defines a distance between each pair of elements of a set \(X\).
    A Metric induces a topology on a set; BUT, not all topologies can be generated by a metric.
    Mathematically, it is a function:
    \({\displaystyle d:X\times X\to [0,\infty )},\)
    that must satisfy the following properties:
    1. \({\displaystyle d(x,y)\geq 0}\) \(\:\:\:\:\:\:\:\) non-negativity or separation axiom
    2. \({\displaystyle d(x,y)=0\Leftrightarrow x=y}\) \(\:\:\:\:\:\:\:\) identity of indiscernibles
    3. \({\displaystyle d(x,y)=d(y,x)}\) \(\:\:\:\:\:\:\:\) symmetry
    4. \({\displaystyle d(x,z)\leq d(x,y)+d(y,z)}\) \(\:\:\:\:\:\:\:\) subadditivity or triangle inequality

      The first condition is implied by the others.
      Metric

  2. Describe Binary Relations and their Properties?
    A binary relation on a set \(A\) is a set of ordered pairs of elements of \(A\). In other words, it is a subset of the Cartesian product \(A^2 = A ×A\).
    The number of binary relations on a set of \(N\) elements is \(= 2^{N^2}\) Examples:
    1. “is greater than”
    2. “is equal to”
    3. A function \(f(x)\)
      Properties: (for a relation \(R\) and set \(X\))
    4. Reflexive: for all \(x\) in \(X\) it holds that \(xRx\)
    5. Symmetric: for all \(x\) and \(y\) in \(X\) it holds that if \(xRy\) then \(yRx\)
    6. Transitive: for all \(x\), \(y\) and \(z\) in \(X\) it holds that if \(xRy\) and \(yRz\) then \(xRz\)
      answer
  3. Formulas:
    1. Set theory:
      1. Number of subsets of a set of \(N\) elements:
        \(= 2^N\)
      2. Number of pairs \((a,b)\) of a set of N elements:
        \(= N^2\)
    2. Binomial Theorem:

      $$(x+y)^{n}=\sum_{k=0}^{n}{n \choose k}x^{n-k}y^{k}=\sum_{k=0}^{n}{n \choose k}x^{k}y^{n-k} \\={n \choose 0}x^{n}y^{0}+{n \choose 1}x^{n-1}y^{1}+{n \choose 2}x^{n-2}y^{2}+\cdots +{n \choose n-1}x^{1}y^{n-1}+{n \choose n}x^{0}y^{n},$$

    3. Binomial Coefficient:

      $${\binom {n}{k}}={\frac {n!}{k!(n-k)!}} = N \text{choose} k = N \text{choose} (n-k)$$

    4. Expansion of \(x^n - y^n =\)

      $$x^n - y^n = (x-y)(x^{n-1} + x^{n-2} y + ... + x y^{n-2} + y^{n-1})$$

    5. Number of ways to partition \(N\) data points into \(k\) clusters:
      1. Number of pairs (e.g. \((a,b)\)) of a set of \(N\) elements \(= N^2\)
      2. There are at most \(k^N\) ways to partition \(N\) data points into \(k\) clusters - there are \(N\) choose \(k\) clusters, precisely
    6. \(\log_x(y) =\)

      $$\log_x(y) = \dfrac{\ln(y)}{\ln(x)}$$

    7. The length of a vector \(\mathbf{x}\) along a direction (projection):
      1. Along a unit-length vector \(\hat{\mathbf{w}}\):
        \(\text{comp}_ {\hat{\mathbf{w}}}(\mathbf{x}) = \hat{\mathbf{w}}^T\mathbf{x}\)
      2. Along an unnormalized vector \(\mathbf{w}\):
        \(\text{comp}_ {\mathbf{w}}(\mathbf{x}) = \dfrac{1}{\|\mathbf{w}\|} \mathbf{w}^T\mathbf{x}\)
    8. \(\sum_{i=1}^{n} 2^{i}=\)
      \(\sum_{i=1}^{n} 2^{i}=2^{n+1}-2\)
  4. List 6 proof methods:
    1. Direct Proof
    2. Mathematical Induction
      1. Strong Induction
      2. Infinite Descent
    3. Contradiction
    4. Contraposition (\((p \implies q) \iff (!q \implies !p)\))
    5. Construction
    6. Combinatorial
    7. Exhaustion
    8. Non-Constructive proof (existence proofs) answer
  5. Important Formulas
    1. Projection \(\tilde{\mathbf{x}}\) of a vector \(\mathbf{x}\) onto another vector \(\mathbf{u}\):

      $$\tilde{\mathbf{x}} = \mathbf{x}\cdot \dfrac{\mathbf{u}}{\|\mathbf{u}\|_2} \mathbf{u}$$


Statistics

  1. ROC curve:
    1. Definition:
      A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
    2. Purpose:
      A way to quantify how good a binary classifier separates two classes.
    3. How do you create the plot?
      The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
    4. How to identify a good classifier:
      A Good classifier has a ROC curve that is near the top-left diagonal (hugging it).
    5. How to identify a bad classifier:
      A Bad Classifier has a ROC curve that is close to the diagonal line.
    6. What is its application in tuning the model?
      It allows you to set the classification threshold:
      1. You can minimize False-positive rate or maximize the True-Positive Rate
  2. AUC - AUROC:
    1. Definition:
      When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’).
    2. Range:
      Range \(= 0.5 - 1.0\), from poor to perfect, with an uninformative classifier yielding \(0.5\)
    3. What does it measure:
      It is a measure of aggregated classification performance.
    4. Usage in ML:
      For model comparison.
  3. Define Statistical Efficiency (of an estimator)?
    Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance.
    Efficiencies are often defined using the variance or mean square error as the measure of desirability.
    An efficient estimator is also the minimum variance unbiased estimator (MVUE).

    1. An Efficient Estimator has lower variance than an inefficient one
    2. The use of an inefficient estimator gives results equivalent to those obtainable from a subset of data; and is therefor, wasteful of data
  4. Whats the difference between Errors and Residuals:
    The Error of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest.

    The Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.

    1. Compute the statistical errors and residuals of the univariate, normal distribution defined as \(X_{1}, \ldots, X_{n} \sim N\left(\mu, \sigma^{2}\right)\):
      1. Statistical Errors:

        $$e_{i}=X_{i}-\mu$$

      2. Residuals:

        $$r_{i}=X_{i}-\overline {X}$$

      3. Example in Univariate Distributions
  5. What is a biased estimator?
    We define the Bias of an estimator as:

    $$ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\mathbb{E}\left(\hat{\boldsymbol{\theta}}_{m}\right)-\boldsymbol{\theta} $$

    A Biased Estimator is an estimator \(\hat{\boldsymbol{\theta}}_ {m}\) such that:

    $$ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_ {m}\right) \geq 0$$

    1. Why would we prefer biased estimators in some cases?
      Mainly, due to the Bias-Variance Decomposition. The MSE takes into account both the bias and the variance and sometimes the biased estimator might have a lower variance than the unbiased one, which results in a total decrease in the MSE.
  6. What is the difference between “Probability” and “Likelihood”:
    Probabilities are the areas under a fixed distribution
    \(pr(\)data\(|\)distribution\()\)
    i.e. probability of some data (left hand side) given a distribution (described by the right hand side)
    Likelihoods are the y-axis values for fixed data points with distributions that can be moved..
    \(L(\)distribution\(|\)observation/data\()\)
    It is the likelihood of the parameter \(\theta\) for the data \(\mathcal{D}\).

    Likelihood is, basically, a specific probability that can only be calculated after the fact (of observing some outcomes). It is not normalized to \(1\) (it is not a probability). It is just a way to quantify how likely a set of observation is to occur given some distribution with some parameters; then you can manipulate the parameters to make the realization of the data more “likely” (it is precisely meant for that purpose of estimating the parameters); it is a function of the parameters.
    Probability, on the other hand, is absolute for all possible outcomes. It is a function of the Data.

  7. Estimators:
    1. Define:
      A Point Estimator or statistic is any function of the data.
    2. Formula:

      $$\hat{\boldsymbol{\theta}}_{m}=g\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)}\right)$$

    3. Whats a good estimator?
      A good estimator is a function whose output is close to the true underlying \(\theta\) that generated the training data.
    4. What are the Assumptions made regarding the estimated parameter:
      We assume that the true \(\boldsymbol{\theta}\) is fixed, and that \(\hat{\boldsymbol{\theta}}\) is a function of the data, which is drawn from a random process, making \(\hat{\boldsymbol{\theta}}\) a random variable.
  8. What is Function Estimation:
    Function Estimation/Approximation refers to estimation of the relationship between input and target data.
    I.E. We are trying to predict a variable \(y\) given an input vector \(x\), and we assume that there is a function \(f(x)\) that describes the approximate relationship between \(y\) and \(x\).
    If we assume that: \(y = f(x) + \epsilon\), where \(\epsilon\) is the part of \(y\) that is not predictable from \(x\); then we are interested in approximating \(f\) with a model or estimate \(\hat{f}\).
    1. Whats the relation between the Function Estimator \(\hat{f}\) and Point Estimator:
      Function estimation is really just the same as estimating a parameter \(\boldsymbol{\theta}\); the function estimator \(\hat{f}\) is simply a point estimator in function space.
  9. Define “marginal likelihood” (wrt naive bayes):
    Marginal likelihood is, the probability that the word ‘FREE’ is used in any message (not given any other condition?).

(Statistics) - MLE

  1. Clearly Define MLE and derive the final formula:
    MLE is a method/principle from which we can derive specific functions that are good estimators for different models.
    Likelihood in Parametric Models:
    Suppose we have a parametric model \(\{p(y ; \theta) | \theta \in \Theta\}\) and a sample \(D=\left\{y_{1}, \ldots, y_{n}\right\}\):
    1. The likelihood of parameter estimate \(\hat{\theta} \in \Theta\) for sample \(\mathcal{D}\) is:

      $$\begin{aligned} {\displaystyle {\mathcal {L}}(\theta \,;\mathcal{D} )} &= p(\mathcal{D} ; \theta) \\&=\prod_{i=1}^{n} p\left(y_{i} ; \theta\right)\end{aligned}$$

    2. In practice, we prefer to work with the log-likelihood. Same maximum but

      $$\log {\displaystyle {\mathcal {L}}(\theta \,;\mathcal{D} )}=\sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right)$$

      and sums are easier to work with than products.

    MLE for Parametric Models:
    The maximum likelihood estimator (MLE) for \(\theta\) in the (parametric) model \(\{p(y, \theta) | \theta \in \Theta\}\) is:

    $$\begin{aligned} \hat{\theta} &=\underset{\theta \in \Theta}{\arg \max } \log p(\mathcal{D}, \hat{\theta}) \\ &=\underset{\theta \in \Theta}{\arg \max } \sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right) \end{aligned}$$

    1. Write MLE as an expectation wrt the Empirical Distribution:
      Because the \(\text {arg max }\) does not change when we rescale the cost function, we can divide by \(m\) to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution \(\hat{p}_ {\text {data}}\) defined by the training data:

      $${\displaystyle \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \:\: \mathbb{E}_{\mathbf{x} \sim \hat{p} \text {data}} \log p_{\text {model}}(\boldsymbol{x} ; \boldsymbol{\theta})} \tag{5.59}$$

    2. Describe formally the relationship between MLE and the KL-divergence:
      MLE as Minimizing KL-Divergence between the Empirical dist. and the model dist.:
      We can interpret maximum likelihood estimation as minimizing the dissimilarity between the empirical distribution \(\hat{p}_ {\text {data}}\), defined by the training set, and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence.
      1. The KL-divergence is given by:

        $${\displaystyle D_{\mathrm{KL}}\left(\hat{p}_{\text {data}} \| p_{\text {model}}\right)=\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data}}}\left[\log \hat{p}_{\text {data}}(\boldsymbol{x})-\log p_{\text {model}}(\boldsymbol{x})\right]} \tag{5.60}$$

        The term on the left is a function only of the data-generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize:

      $${\displaystyle -\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data}}}\left[\log p_{\text {model}}(\boldsymbol{x})\right]} \tag{5.61}$$

      which is of course the same as the maximization in equation \(5.59\).

    3. Extend the argument to show the link between MLE and Cross-Entropy. Give an example:
      Minimizing this KL-divergence corresponds exactly to minimizing the cross-entropy between the distributions.
      Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model.
      E.g. MSE is the cross-entropy between the empirical distribution and a Gaussian model.

      Maximum likelihood thus becomes minimization of the negative log-likelihood(NLL), or equivalently, minimization of the cross-entropy.

    4. How does MLE relate to the model distribution and the empirical distribution?
      We can thus see maximum likelihood as an attempt to make the model distribution match the empirical distribution \(\hat{p} _ {\text {data}}\).
    5. What is the intuition behind using MLE?
      If I choose a hypothesis \(h\) underwhich the observed data is very plausible then the hypothesis is very likely.
    6. What does MLE find/result in?
      It finds the value of the parameter \(\theta\) that, if used (in the model) to generate the probability of the data, would make the data most “likely” to occur.
    7. What kind of problem is MLE and how to solve for it?
      It is an optimization problem that is solved by calculus for problems with closed-form solutions or with numerical methods (e.g. SGD).
    8. How does it relate to SLT:
      It corresponds to Empirical Risk Minimization.
    9. Explain clearly why we maximize the natural log of the likelihood
      1. Numerical Stability: change products to sums
      2. The logarithm of a member of the family of exponential probability distributions (which includes the ubiquitous normal) is polynomial in the parameters (i.e. max-likelihood reduces to least-squares for normal distributions)
        \(\log\left(\exp\left(-\frac{1}{2}x^2\right)\right) = -\frac{1}{2}x^2\)
      3. The latter form is both more numerically stable and symbolically easier to differentiate than the former. It increases the dynamic range of the optimization algorithm (allowing it to work with extremely large or small values in the same way).
      4. The logarithm is a monotonic transformation that preserves the locations of the extrema (in particular, the estimated parameters in max-likelihood are identical for the original and the log-transformed formulation)
      5. Gradient methods generally work better optimizing \(log_p(x)\) than \(p(x)\) because the gradient of \(log_p(x)\) is generally more well-scaled. link
        Justification: the gradient of the original term will include a \(e^{\vec{x}}\) multiplicative term that scales very quickly one way or another, requiring the step-size to equally scale/stretch in the opposite direction.

Text-Classification | Classical

  1. List some Classification Methods:
  2. (Hand-Coded)Rules-Based Algorithms: use rules based on combinations of words or other features.
    1. Can have high accuracy if the rules are carefully refined and maintained by experts.
    2. However, building and maintaining these rules is very hard.
  3. Supervised Machine Learning: using an ML algorithm that trains on a training set of (document, class) elements to train a classifier.
    1. Types of Classifiers:
      1. Naive Bayes
      2. Logistic Regression
      3. SVMs
      4. K-NNs
  4. List some Applications of Txt Classification:
    1. Spam Filtering: discerning spam emails form legitimate emails.
    2. Email Routing: sending an email sento to a genral address to a specfic affress based on the topic.
    3. Language Identification: automatiacally determining the genre of a piece of text.
    4. Readibility Assessment__: determining the degree of readability of a piece of text.
    5. Sentiment Analysis: determining the general emotion/feeling/attitude of the author of a piece of text.
    6. Authorship Attribution: determining which author wrote which piece of text.
    7. Age/Gender Identification: determining the age and/or gender of the author of a piece of text.

NLP

  1. List some problems in NLP:
    1. Question Answering (QA)
    2. Information Extraction (IE)
    3. Sentiment Analysis
    4. Machine Translation (MT)
    5. Spam Detection
    6. Parts-of-Speech (POS) Tagging
    7. Named Entity Recognition (NER)
    8. Conference Resolution
    9. Word Sense Disambugation (WSD)
    10. Parsing
    11. Paraphrasing
    12. Summarization
    13. Dialog
  2. List the Solved Problems in NLP:
    1. Spam Detection
    2. Parts-of-Speech (POS) Tagging
    3. Named Entity Recognition (NER)
  3. List the “within reach” problems in NLP:
    1. Sentiment Analysis
    2. Conference Resolution
    3. Word Sense Disambugation (WSD)
    4. Parsing
    5. Machine Translation (MT)
    6. Information Extraction (IE)
  4. List the Open Problems in NLP:
    1. Question Answering (QA)
    2. Paraphrasing
    3. Summarization
    4. Dialog
  5. Why is NLP hard? List Issues:
    1. Non-Standard English: “Great Job @ahmed_badary! I luv u 2!! were SOO PROUD of dis.”
    2. Segmentation Issues: “New York-New Haven” vs “New-York New-Haven”
    3. Idioms: “dark horse”, “getting cold feet”, “losing face”
    4. Neologisms: “unfriend”, “retweet”, “google”, “bromance”
    5. World Knowledge: “Ahmed and Zach are brothers”, “Ahmed and Zach are fathers”
    6. Tricky Entity Names: “Where is Life of Pie playing tonight?”, “Let it be was a hit song!”
  6. Define:
    1. Morphology:
      The study of words, how they are formed, and their relationship to other words in the same language.
    2. Morphemes:
      the small meaningful units that make up words.
    3. Stems:
      the core meaning-bearing units of words.
    4. Affixes:
      the bits and pieces that adhere to stems (often with grammatical functions).
    5. Stemming:
      is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
      The stem need not map to a valid root in the language.
    6. Lemmatization:
      reducing inflections or variant forms to base form.

Language Modeling

  1. What is a Language Model?
  2. List some Applications of LMs:
  3. Traditional LMs:
    1. How are they setup?
    2. What do they depend on?
    3. What is the Goal of the LM task? (in the ctxt of the problem setup)
    4. What assumptions are made by the problem setup? Why?
    5. What are the MLE Estimates for probabilities of the following:
      1. Bi-Grams:

        $$p(w_2\vert w_1) = $$

      2. Tri-Grams:

        $$p(w_3\vert w_1, w_2) = $$

    6. What are the issues w/ Traditional Approaches?
  4. What+How can we setup some NLP tasks as LM tasks:
  5. How does the LM task relate to Reasoning/AGI:
  6. Evaluating LM models:
  7. LM DATA:
    1. How does the fact that LM is a time-series prediction problem affect the way we need to train/test:
    2. How should we choose a subset of articles for testing:
  8. List three approaches to Parametrizing LMs:
  9. What’s the main issue in LM modeling?
    1. The Bias-Variance Tradeoff of the following:
      1. N-Gram Models:
      2. RNNs:
      3. An Estimate s.t. it predicts the probability of a sentence by how many times it has seen it before:
        1. What happens in the limit of infinite data?
  10. What are the advantages of sub-word level LMs:
  11. What are the disadvantages of sub-word level LMs:
  12. What is a “Conditional LM”?
  13. Write the decomposition of the probability for the Conditional LM:
  14. Describe the Computational Bottleneck for Language Models:
  15. Describe/List some solutions to the Bottleneck:
  16. Complexity Comparison of the different solutions:

Regularization

  1. Define Regularization both intuitively and formally:
    Regularization can be, loosely, defined as: any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

    Formally, it is a set of techniques that impose certain restrictions on the hypothesis space (by adding information) in order to solve an ill-posed problem or to prevent overfitting.

  2. Define “well-posedness”:
    Hadamard defines Well-Posed Problems as having the properties (1) A Solution Exists (2) It is Unique (3) It’s behavior changes continuously with the initial conditions.

  3. Give four aspects of justification for regularization (theoretical):
  4. Describe an overview of regularization in DL. How does it usually work?
    In the context of DL, most regularization strategies are based on regularizing estimators, which usually works by trading increased bias for reduced variance.
    1. Intuitively, how can a regularizer be effective?
      An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing the bias.
  5. Describe the relationship between regularization and capacity:
    Regularization is a (more general) way of controlling a models capacity by allowing us to express preference for one function over another in the same hypothesis space; instead of including or excluding members from the hypothesis space completely.
  6. Describe the different approaches to regularization:
    1. Parameter Norm Penalties: \(L^p\) norms, early stopping
    2. Data Augmentation: noise robustness/injection, dropout
    3. Semi-supervised Learning
    4. Multi-task Learning
    5. Ensemble Learning: bagging, etc.
    6. Adversarial Training
    7. Infinite Priors: parameter tying and sharing
  7. List 9 regularization techniques:
    1. \(L^2\) regularization
    2. \(L^1\) regularization
    3. Dataset Augmentation
    4. Noise Injection
    5. Semi-supervised Learning
    6. Multi-task Learning
    7. Early Stopping
    8. Parameter Tying, Sharing
    9. Sparse Representations
    10. Ensemble Methods, Bagging etc. 1* Dropout
    11. Adversarial Training
    12. Tangent prop and Manifold Tangent Classifier

  1. Add Answers from link below for L2 applied to linear regression and how it reduces variance:
    Link
  2. When is Ridge regression favorable over Lasso regression? for correlated features?
    If there exists a subset consisting of few variables that have medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.?? (ESL authors)

    If features are correlated, it is so hard to determine which variables to drop, it is often better not to drop variables. Thus, use ridge over lasso since the latter produces non-unique solutions and might drop random features; while former, spreads weight more evenly.


Misc.

  1. Explain Latent Dirichlet Allocation (LDA)
    Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying documents by subject matter.
    LDA is a generative model that represents documents as a mixture of topics that each have their own probability distribution of possible words.
    The “Dirichlet” distribution is simply a distribution of distributions. In LDA, documents are distributions of topics that are distributions of words.
  2. How to deal with curse of dimensionality
    1. Feature Selection
    2. Feature Extraction
  3. How to detect correlation of “categorical variables”?
    1. Chi-Squared test
  4. Define “marginal likelihood” (wrt naive bayes):
    Marginal likelihood is, the probability that the word ‘FREE’ is used in any message (not given any other condition?).
  5. KNN VS K-Means
    1. KNN: Supervised, Classification/Regression Algorithm
    2. K-Means: Unsupervised Clustering Algorithm
  6. When is Ridge regression favorable over Lasso regression for correlated features?
    If there are exists a subset consisting of few variables that have medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.?? (ESL authors)

    If features are correlated; it is so hard to determine which variables to drop, it is often better not to drop variables. Thus, use ridge over lasso since the latter produces non-unique solutions and might drop random features; while former, spreads weight more evenly.

  7. What is convex hull ?
    In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.
  8. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
    For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
  9. OLS vs MLE
    They both estimate parameters in the model. They are the same in the case of normal distribution.
  10. What are collinearity and multicollinearity?
    1. Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.
    2. Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.
  11. Describe ways to overcome scaling (scalability) issues:
    nystrom methods/low-rank kernel matrix approximations, random features, local by query/near neighbors