Gradient-Based Optimization

Define Gradient Methods:
Gradient Methods are algorithms for solving (optimization) minimization problems of the form:
$$\min_{x \in \mathbb{R}^{n}} f(x)$$

by doing local search (~ hill-climbing) with the search directions defined by the gradient of the function at the current point.
Give examples of Gradient-Based Algorithms:
1. Gradient Descent: minimizes arbitrary differentiable functions
2. Conjugate Gradient: minimizes sparse linear systems w/ symmetric & PD matrices
3. Coordinate Descent: minimizes functions of two variables
What is Gradient Descent:
GD is a first order, iterative algorithm to minimize an objective function $J(\theta)$ parametrized by a models parameters $\theta \in \mathbb{R}^d$ by updating the parameters in the opposite direction of the gradient of the objective function $\nabla_{\theta} J(\theta)$ w.r.t. to the parameters.
Explain it intuitively:
1) Local Search from a starting location on a hill
2) Feel around how a small movement/step around your location would change the height of the surrounding hill (is the ground higher or lower)
3) Make the movement/step consistent as a small fixed step along some direction
4) Measure the steepness of the hill at the new location in the chosen direction
5) Do so by Approximating the steepness with some local information
6) Measure the change in steepness from your current point to the new point
6) Find the direction that decreases the steepness the most
$\iff$

1) Local Search from an initial point $x_0$ on a function
2) Explore the value of the function at different small nudges around $x_0$
3) Make the nudges consistent as a small fixed step $\delta$ along a normalized direction $\hat{\boldsymbol{u}}$
4) Evaluate the function at the new location $x_0 + \delta \hat{\boldsymbol{u}}$
5) Do so by Approximating the function w/ first-order information (Taylor expansion)
6) Measure the change in value of the objective $\Delta f$, from current point to the new point
7) Find the direction $\hat{\boldsymbol{u}}$ that minimizes the function the most
Give its derivation:
1) Start at $\boldsymbol{x} = x_0$
2) We would like to know how would a small change in $\boldsymbol{x}$, namely $\Delta \boldsymbol{x}$ would affect the value of the function $f(x)$. This will allow us to evaluate the function:
$$f(\mathbf{x}+\Delta \mathbf{x})$$

to find the direction that makes $f$ decrease the fastest
3) Let’s set up $\Delta \boldsymbol{x}$, the change in $\boldsymbol{x}$, as a fixed step $\delta$ along some normalized direction $\hat{\boldsymbol{u}}$:

$$\Delta \boldsymbol{x} = \delta \hat{\boldsymbol{u}}$$

4) Evaluate the function at the new location:

$$f(\mathbf{x}+\Delta \mathbf{x}) = f(\mathbf{x}+\delta \hat{\boldsymbol{u}})$$

5) Using the first-order approximation:

$$\begin{aligned} f(\mathbf{x}+\delta \hat{\boldsymbol{u}}) &\approx f({\boldsymbol{x}}_0) + \left(({\boldsymbol{x}}_0 - \delta \hat{\boldsymbol{u}}) - {\boldsymbol{x}}_0 \right)^T \nabla_{\boldsymbol{x}} f({\boldsymbol{x}}_0) \\ &\approx f(x_0) + \delta \hat{\boldsymbol{u}}^T \nabla_x f(x_0) \end{aligned}$$

6) The change in $f$ is:

$$\begin{aligned} \Delta f &= f(\boldsymbol{x}_ 0 + \Delta \boldsymbol{x}) - f(\boldsymbol{x}_ 0) \\ &= f(\boldsymbol{x}_ 0 + \delta \hat{\boldsymbol{u}}) - f(\boldsymbol{x}_ 0)\\ &= \delta \nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} + \mathcal{O}(\delta^2) \\ &= \delta \nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} \\ &\geq -\delta\|\nabla f(\boldsymbol{x}_ 0)\|_ 2 \end{aligned}$$

where

$$\nabla_x f(\boldsymbol{x}_ 0)^T\hat{\boldsymbol{u}} \in \left[-\|\nabla f(\boldsymbol{x}_ 0)\|_ 2, \|\nabla f(\boldsymbol{x}_ 0)\|_ 2\right]$$

since $\hat{\boldsymbol{u}}$ is a unit vector; either aligned with $\nabla_x f(\boldsymbol{x}_ 0)$ or in the opposite direction; it contributes nothing to the magnitude of the dot product.
7) So, the $\hat{\boldsymbol{u}}$ that changes the above inequality to equality, achieves the largest negative value (moves the most downhill). That vector $\hat{\boldsymbol{u}}$ is, then, the one in the negative direction of $\nabla_x f(\boldsymbol{x}_ 0)$; the opposite direction of the gradient.
$\implies$

$$\hat{\boldsymbol{u}} = - \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2}$$

Finally, The method of steepest/gradient descent proposes a new point to decrease the value of $f$:

$$\boldsymbol{x}^{\prime}=\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x})$$

where $\epsilon$ is the learning rate, defined as:

$$\epsilon = \dfrac{\delta}{\left\|\nabla_{x} f(x)\right\|_ {2}}$$
What is the learning rate?
Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient; defined as:
$$\epsilon = \dfrac{\delta}{\left\|\nabla_{x} f(x)\right\|_ {2}}$$
1. Where does it come from?
  The learning rate comes from a modification of the step-size in the GD derivation.
  We get the learning rate by employing a simple idea:
  We have a fixed step-size $\delta$ that dictated how much we should be moving in the direction of steepest descent. However, we would like to keep the step-size from being too small or overshooting. The idea is to make the step-size proportional to the magnitude of the gradient (i.e. some constant multiplied by the magnitude of the gradient):
  $$\delta = \epsilon \left\|\nabla_{x} f(x)\right\|_ {2}$$
  
  If we do so, we get a nice cancellation as follows:
  
  $$\begin{aligned}\Delta \boldsymbol{x} &= \delta \hat{\boldsymbol{u}} \\ &= -\delta \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2} \\ &= - \epsilon \left\|\nabla_{x} f(x)\right\|_ {2} \dfrac{\nabla_x f(x)}{\|\nabla_x f(x)\|_ 2} \\ &= - \dfrac{\epsilon \left\|\nabla_{x} f(x)\right\|_ {2}}{\|\nabla_x f(x)\|_ 2} \nabla_x f(x) \\ &= - \epsilon \nabla_x f(x) \end{aligned}$$
  
  where now we have a fixed learning rate instead of a fixed step-size.
1. How does it relate to the step-size?
  [Answer above]
2. We go from having a fixed step-size to [blank]:
  a fixed learning rate.
1. What is its range?
  $[0.0001, 0.4]$
2. How do we choose the learning rate?
  1. Choose a starting lr from the range $[0.0001, 0.4]$ and adjust using cross-validation (lr is a HP).
  2. Do
    1. Set it to a small constant
    2. Smooth Functions (non-NN):
      1. Line Search: evaluate $f\left(\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x})\right)$ for several values of $\epsilon$ and choose the one that results in the smallest objective value.
        E.g. Secant Method, Newton-Raphson Method (may need Hessian, hard for large dims)
      2. Trust Region Method
    3. Non-Smooth Functions (non-NN):
      1. Direct Line Search
        E.g. golden section search
    4. Grid Search: is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. It is guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
    5. Population-Based Training (PBT): is an elegant implementation of using a genetic algorithm for hyper-parameter choice.
    6. Bayesian Optimization: is a global optimization method for noisy black-box functions.
  3. Use larger lr and use a learning rate schedule
1. Compare Line Search vs Trust Region:
  Trust-region methods are in some sense dual to line-search methods: trust-region methods first choose a step size (the size of the trust region) and then a step direction, while line-search methods first choose a step direction and then a step size.
2. Learning Rate Schedule:
  1. Define:
    A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum.
  2. List Types:
    1. Time-based learning schedules alter the learning rate depending on the learning rate of the previous time iteration. Factoring in the decay the mathematical formula for the learning rate is:
      $${\displaystyle \eta_{n+1}={\frac {\eta_{n}}{1+dn}}}$$
      
      where $\eta$ is the learning rate, $d$ is a decay parameter and $n$ is the iteration step.
    2. Step-based learning schedules changes the learning rate according to some pre defined steps:
      $${\displaystyle \eta_{n}=\eta_{0}d^{floor({\frac {1+n}{r}})}}$$
      
      where ${\displaystyle \eta_{n}}$ is the learning rate at iteration $n$, $\eta_{0}$ is the initial learning rate, $d$ is how much the learning rate should change at each drop (0.5 corresponds to a halving) and $r$ corresponds to the droprate, or how often the rate should be dropped ($10$ corresponds to a drop every $10$ iterations). The floor function here drops the value of its input to $0$ for all values smaller than $1$.
    3. Exponential learning schedules are similar to step-based but instead of steps a decreasing exponential function is used. The mathematical formula for factoring in the decay is:
      $$ {\displaystyle \eta_{n}=\eta_{0}e^{-dn}}$$
      
      where $d$ is a decay parameter.
Describe the convergence of the algorithm:
Gradient Descent converges when every element of the gradient is zero, or very close to zero within some threshold.

With certain assumptions on $f$ (convex, $\nabla f$ lipschitz) and particular choices of $\epsilon$ (chosen via line-search etc.), convergence to a local minimum can be guaranteed.
Moreover, if $f$ is convex, all local minima are global minimia, so convergence is to the global minimum.
How does GD relate to Euler?
Gradient descent can be viewed as applying Euler’s method for solving ordinary differential equations ${\displaystyle x'(t)=-\nabla f(x(t))}$ to a gradient flow.
List the variants of GD:
1. Batch Gradient Descent
2. Stochastic GD
3. Mini-Batch GD
4. How do they differ? Why?:
  There are three variants of gradient descent, which differ in the amount of data used to compute the gradient. The amount of data imposes a trade-off between the accuracy of the parameter updates and the time it takes to perform the update.
1. BGD:
  Batch Gradient Descent AKA Vanilla Gradient Descent, computes the gradient of the objective wrt. the parameters $\theta$ for the entire dataset:
  $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J(\theta)$$
  1. Properties:
    1. Since we need to compute the gradient for the entire dataset for each update, this approach can be very slow and is intractable for datasets that can’t fit in memory.
    2. Moreover, batch-GD doesn’t allow for an online learning approach.
2. SGD:
  SGD performs a parameter update for each data-point:
  $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i)} ; y^{(i)}\right)$$
  1. Properties:
    1. SGD exhibits a lot of fluctuation and has a lot of variance in the parameter updates. However, although, SGD can potentially move in the wrong direction due to limited information; in-practice, if we slowly decrease the learning-rate, it shows the same convergence behavior as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
    2. Moreover, the fluctuations it exhibits enables it to jump to new and potentially better local minima.
  2. How should we handle the lr in this case? Why?
    We should reduce the lr after every epoch:
    This is due to the fact that the random sampling of batches acts as a source of noise which might make SGD keep oscillating around the minima without actually reaching it.
  3. What conditions guarantee convergence of SGD?
    The following conditions guarantee convergence under convexity conditions for SGD:
    $$\begin{array}{l}{\sum_{k=1}^{\infty} \epsilon_{k}=\infty, \quad \text { and }} \\ {\sum_{k=1}^{\infty} \epsilon_{k}^{2}<\infty}\end{array}$$
3. M-BGD:
  A hybrid approach that perform updates for a, pre-specified, mini-batch of $n$ training examples:
  $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i : i+n)} ; y^{(i : i+n)}\right)$$
  1. What advantages does it have?
    1. Reduce the variance of the parameter updates $\rightarrow$ more stable convergence
    2. Makes use of matrix-vector highly optimized libraries
4. Explain the different kinds of gradient-descent optimization procedures:
  1. Batch Gradient Descent AKA Vanilla Gradient Descent, computes the gradient of the objective wrt. the parameters $\theta$ for the entire dataset:
    $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J(\theta)$$
  2. SGD performs a parameter update for each data-point:
    $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i)} ; y^{(i)}\right)$$
  3. Mini-batch Gradient Descent a hybrid approach that perform updates for a, pre-specified, mini-batch of $n$ training examples:
    $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i : i+n)} ; y^{(i : i+n)}\right)$$
5. State the difference between SGD and GD?
  Gradient Descent’s cost-function iterates over ALL training samples.
  Stochastic Gradient Descent’s cost-function only accounts for ONE training sample, chosen at random.
6. When would you use GD over SDG, and vice-versa?
  GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large.
  That means GD is preferable for small datasets while SGD is preferable for larger ones.
What is the problem of vanilla approaches to GD?
All the variants described above, however, do not guarantee “good” convergence due to some challenges.
1. List the challenges that account for the problem above:
  1. Choosing a proper learning rate is usually difficult:
    A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
  2. Inability to adapt to the specific characteristics of the dataset:
    Learning rate schedules try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.
  3. The learning rate is fixed for all parameter updates:
    If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
  4. Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
List the different strategies for optimizing GD:
1. Adapt our updates to the slope of our error function
2. Adapt our updates to the importance of each individual parameter
List the different variants for optimizing GD:
1. Adapt our updates to the slope of our error function:
  1. Momentum
  2. Nesterov Accelerated Gradient
2. Adapt our updates to the importance of each individual parameter:
  1. Adagrad
  2. Adadelta
  3. RMSprop
  4. Adam
  5. AdaMax
  6. Nadam
  7. AMSGrad

Momentum:
1. Motivation:
  SGD has trouble navigating ravines (i.e. areas where the surface curves much more steeply in one dimension than in another) which are common around local optima.
  In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum.
2. Definitions/Algorithm:
  Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations (image^). It does this by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector:
  $$\begin{aligned} v_{t} &=\gamma v_{t-1}+\eta \nabla_{\theta} J(\theta) \\ \theta &=\theta-v_{t} \end{aligned}$$
3. Intuition:
  Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. $\gamma < 1$). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
  
  In this case we think of the equation as:
  
  $$v_{t} =\underbrace{\gamma}_{\text{friction }} \: \underbrace{v_{t-1}}_{\text{velocity}}+\eta \underbrace{\nabla_{\theta} J(\theta)}_ {\text{acceleration}}$$
4. Parameter Settings:
  The momentum term $\gamma$ is usually set to $0.9$ or a similar value, and $v_0 = 0$.
Nesterov Accelerated Gradient (Momentum):
1. Motivation:
  Momentum is good, however, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.
2. Definitions/Algorithm:
  NAG is a way to five our momentum term this kind of prescience. Since we know that we will use the momentum term $\gamma v_{t-1}$ to move the parameters $\theta$, we can compute a rough approximation of the next position of the parameters with $\theta - \gamma v_{t-1}$ (w/o the gradient). This allows us to, effectively, look ahead by calculating the gradient not wrt. our current parameters $\theta$ but wrt. the approximate future position of our parameters:
  $$\begin{aligned} v_{t} &=\gamma v_{t-1}+\eta \nabla_{\theta} J\left(\theta-\gamma v_{t-1}\right) \\ \theta &=\theta-v_{t} \end{aligned}$$
3. Intuition:
  While Momentum first computes the current gradient (small blue vector) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector), NAG first makes a big jump in the direction of the previous accumulated gradient (brown vector), measures the gradient and then makes a correction (red vector), which results in the complete NAG update (green vector). This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks.
4. Parameter Settings:
  $\gamma = 0.9$,
5. Successful Applications:
  This really helps the optimization of recurrent neural networks
Adagrad
1. Motivation:
  Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.
2. Definitions/Algorithm:
  Adagrad is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features.
3. Intuition:
  Adagrad uses a different learning rate for every parameter $\theta_i$ at every time step $t$, so, we first show Adagrad’s per-parameter update.
  Adagrad per-parameter update:
  The SGD update for every parameter $\theta_i$ at each time step $t$ is:
  $$\theta_{t+1, i}=\theta_{t, i}-\eta \cdot g_{t, i}$$
  
  where $g_{t, i}=\nabla_{\theta} J\left(\theta_{t, i}\right.$ is the partial derivative of the objective function w.r.t. to the parameter $\theta_i$ at time step $t$, and $g_{t}$ is the gradient at time-step $t$.
  
  In its update rule, Adagrad modifies the general learning rate $\eta$ at each time step $t$ for every parameter $\theta_i$ based on the past gradients that have been computed for $\theta_i$:
  
  $$\theta_{t+1, i}=\theta_{t, i}-\frac{\eta}{\sqrt{G_{t, i i}+\epsilon}} \cdot g_{t, i}$$
  
  $G_t \in \mathbb{R}^{d \times d}$ here is a diagonal matrix where each diagonal element $i, i$ is the sum of the squares of the gradients wrt $\theta_i$ up to time step $t$[^12], while $\epsilon$ is a smoothing term that avoids division by zero ($\approx 1e - 8$).
  
  As $G_t$ contains the sum of the squares of the past gradients w.r.t. to all parameters $\theta$ along its diagonal, we can now vectorize our implementation by performing a matrix-vector product $\odot$ between $G_t$ and $g_t$:
  
  $$\theta_{t+1}=\theta_{t}-\frac{\eta}{\sqrt{G_{t}+\epsilon}} \odot g_{t}$$
4. Parameter Settings:
  For the lr, Most implementations use a default value of $0.01$ and leave it at that.
5. Successful Application:
  Well-suited for dealing with sparse data (because it adapts the lr of each parameter wrt feature frequency)
6. Properties:
  1. Well-suited for dealing with sparse data (because it adapts the lr of each parameter wrt feature frequency)
  2. Eliminates need for manual tuning of lr:
    One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of $0.01$ and leave it at that.
  3. Weakness $\rightarrow$ Accumulation of the squared gradients in the denominator:
    Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.
  4. Without the sqrt, the algorithm performs much worse
Adadelta
1. Motivation:
  Adagrad has a weakness where it suffers from aggressive, monotonically decreasing lr by accumulation of the squared gradients in the denominator. The following algorithm aims to resolve this flow.
2. Definitions/Algorithm:
3. Intuition:
4. Parameter Settings:
  We set $\gamma$ to a similar value as the momentum term, around $0.9$.
5. Properties:
  1. Eliminates need for lr completely:
    With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.
RMSprop
1. Motivation:
  RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates.
2. Definitions/Algorithm:
  RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
3. Intuition:
  RMSprop in fact is identical to the first update vector of Adadelta that we derived above:
  $$\begin{align} \begin{split} E[g^2]_t &= 0.9 E[g^2]_{t-1} + 0.1 g^2_t \\ \theta_{t+1} &= \theta_{t} - \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t} \end{split} \end{align}$$
  
  RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.
4. Parameter Settings:
  Hinton suggests $\gamma$ to be set to $0.9$, while a good default value for the learning rate $\eta$ is $0.001$.
5. Properties:
  Works well with RNNs.
Adam
1. Motivation:
  Adding momentum to Adadelta/RMSprop.
2. Definitions/Algorithm:
  Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum.
3. Intuition:
  Adam VS Momentum:
  Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.
4. Parameter Settings:
  The authors propose default values of $0.9$ for $\beta_1$, $0.999$ for $\beta_2$, and $10^{ −8}$ for $\epsilon$.
5. Properties:
  1. They show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.
  2. Kingma et al. show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.
Which methods have trouble with saddle points?
SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope.
How should you choose your optimizer?
1. As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are most suitable and provide the best convergence for these scenarios.
2. Sparse Input Data:
  If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.
Summarize the different variants listed above. How do they compare to each other?
In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numerator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances.
What’s a common choice in many research papers?
Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule.
List additional strategies for optimizing SGD:
1. Shuffling and Curriculum Learning
2. Batch Normalization
3. Early Stopping
4. Gradient Noise

Maximum Margin Classifiers

Define Margin Classifiers:
A margin classifier is a classifier which is able to give an associated distance from the decision boundary for each example.
What is a Margin for a linear classifier?
The margin of a linear classifier is the distance from the decision boundary to the nearest sample point.
Give the motivation for margin classifiers:
Non-margin classifiers (e.g. Centroid, Perceptron, LR) will converge to a correct classifier on linearly separable data; however, the classifier they converge to is not unique nor the best.
Define the notion of the “best” possible classifier
We assume that if we can maximize the distance between the data points to be classified and the hyperplane that classifies them, then we have reached a boundary that allows for the “best-fit”, i.e. allows for the most room for error.
How can we achieve the “best” classifier?
We enforce a constraint that achieves a classifier that has a maximum-margin.
What unique vector is orthogonal to the hp? Prove it:
The weight vector $\mathbf{w}$ is orthogonal to the separating-plane/decision-boundary, defined by $\mathbf{w}^T\mathbf{x} + b = 0$, in the $\mathcal{X}$ space; Reason:
Since if you take any two points $\mathbf{x}^\prime$ and $\mathbf{x}^{\prime \prime}$ on the plane, and create the vector $\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)$ parallel to the plane by subtracting the two points, then the following equations must hold:
$$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$
What do we mean by “signed distance”? Derive its formula:
The signed distance is the minimum distance from a point to a hyperplane. We solve for the signed distance to achieve the following formula for it: $d = \dfrac{\| w \cdot x_0 + b \|}{\|w\|},$
where we have an n-dimensional hyperplane: $w \cdot x + b = 0$ and a point $\mathbf{x}_ n$.
Derivation:
1. Suppose we have an affine hyperplane defined by $w \cdot x + b$ and a point $\mathbf{x}_ n$.
2. Suppose that $\mathbf{x} \in \mathbf{R}^n$ is a point satisfying $w \cdot \mathbf{x} + b = 0$, i.e. it is a point on the plane.
3. We construct the vector $\mathbf{x}_ n−\mathbf{x}$ which points from $\mathbf{x}$ to $\mathbf{x}_ n$, and then, project (scalar projection==signed distance) it onto the unique vector perpendicular to the plane, i.e. $w$,
  $$d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \left| \frac{(\mathbf{x}_ n-\mathbf{x})\cdot w}{\|w\|} \right| = \frac{|\mathbf{x}_ n \cdot w - \mathbf{x} \cdot w|}{\|w\|}.$$
4. Since $\mathbf{x}$ is a vector on the plane, it must satisfy $w\cdot \mathbf{x}=-b$ so we get
  $$d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \frac{|\mathbf{x}_ n \cdot w +b|}{\|w\|}$$
Thus, we conclude that if $\|w\| = 1$ then the signed distance from a datapoint $X_i$ to the hyperplane is $\|wX_i + b\|$.
Given the formula for signed distance, calculate the “distance of the point closest to the hyperplane”:
Let $X_n$ be the point that is closest to the plane, and let $\hat{w} = \dfrac{w}{\|w\|}$.
Take any point $X$ on the plane, and let $\vec{v}$ be the vector $\vec{v} = X_n - X$.
Now, the distance, $d$ is equal to
$$\begin{align} d & \ = \vert\hat{w}^{\top}\vec{v}\vert \\ & \ = \vert\hat{w}^{\top}(X_n - X)\vert \\ & \ = \vert\hat{w}^{\top}X_n - \hat{w}^{\top}X)\vert \\ & \ = \dfrac{1}{\|w\|}\vert w^{\top}X_n + b - w^{\top}X) - b\vert , & \text{we add and subtract the bias } b\\ & \ = \dfrac{1}{\|w\|}\vert (w^{\top}X_n + b) - (w^{\top}X + b)\vert \\ & \ = \dfrac{1}{\|w\|}\vert (w^{\top}X_n + b) - (0)\vert , & \text{from the eq. of the plane on a point on the plane} \\ & \ = \dfrac{\vert (w^{\top}X_n + b)\vert}{\|w\|} \end{align} $$
Use geometric properties of the hp to Simplify the expression for the distance of the closest point to the hp, above
First, we notice that for any given plane $w^Tx = 0$, the equations, $\gamma * w^Tx = 0$, where $\gamma \in \mathbf{R}$ is a scalar, basically characterize the same plane and not many planes.
This is because $w^Tx = 0 \iff \gamma * w^Tx = \gamma * 0 \iff \gamma * w^Tx = 0$.
The above implies that any model that takes input $w$ and produces a margin, will have to be Scale Invariant.
To get around this and simplify the analysis, I am going to consider all the representations of the same plane, and I am going to pick one where we normalize (re-scale) the weight $w$ such that the signed distance (distance to the point closest to the margin) is equal to one:
$$|w^Tx_n| > 0 \rightarrow |w^Tx_n| = 1$$

, where $x_n$ is the point closest to the plane.
We constraint the hyperplane by normalizing $w$ to this equation $|w^Tx_i| = 1$ or with added bias, $|w^Tx_i + b| = 1$.
$\implies$

$$\begin{align} d & \ = \dfrac{\vert (w^{\top}X_n + b)\vert}{\|w\|} \\ & \ = \dfrac{\vert (1)\vert}{\|w\|} , & \text{from the constraint on the distance of the closest point} \\ & \ = \dfrac{1}{\|w\|} \end{align} $$
Characterize the margin, mathematically:
we can characterize the margin, with its size, as the distance, $\frac{1}{\|\mathbf{w}\|}$, between the hyperplane/boundary and the closest point to the plane $\mathbf{x}_ n$, in both directions (multiply by 2) $= \frac{2}{\|\mathbf{w}\|}$ ; given the condition we specified earlier $\left|\mathbf{w}^{\top} \mathbf{x}_ {n} + b\right|=1$ for the closest point $\mathbf{x}_ n$.
Characterize the “Slab Existence”:
The analysis done above allows us to conclusively prove that there exists a slab of width $\dfrac{2}{\|w\|}$ containing no sample points where the hyperplane runs through (bisects) its center.
Formulate the optimization problem of maximizing the margin wrt analysis above:
We formulate the optimization problem of maximizing the margin by maximizing the distance, subject to the condition on how we derived the distance:
$$\max_{\mathbf{w}} \dfrac{2}{\|\mathbf{w}\|} \:\:\: : \:\: \min _{n=1,2, \ldots, N}\left|\mathbf{w}^{\top} \mathbf{x}_{n}+b\right|=1$$
Reformulate the optimization problem above to a more “friendly” version (wrt optimization -> put in standard form):
We can reformulate by (1) Flipping and Minimizing, (2) Taking a square since it’s monotonic and convex, and (3) noticing that $\left|\mathbf{w}^{T} \mathbf{x}_ {n}+b\right|=y_{n}\left(\mathbf{w}^{T} \mathbf{x}_ {n}+b\right)$ (since the signal and label must agree, their product will always be positive) and the $\min$ operator can be replaced by ensuring that for all the points the condition $y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1$ holds proof (by contradiction) as:
$$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$

Now when we solve the “friendly” equation above, we will get the separating plane with the best possible margin (best=biggest).
1. Give the final (standard) formulation of the “Optimization problem for maximum margin classifiers”:
  $$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$
2. What kind of formulation is it (wrt optimization)? What are the parameters?
  The above problem is a Quadratic Program, in $d + 1$-dimensions and $n$-constraints, in standard form.

Hard-Margin SVMs

Define:
1. SVMs:
  1.*Support Vector Machines** (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
  The SVM is a Maximum Margin Classifier that aims to find the “maximum-margin hyperplane” that divides the group of points ${\displaystyle {\vec {x}}_{i}} {\vec {x}}_{i}$ for which ${\displaystyle y_{i}=1}$ from the group of points for which ${\displaystyle y_{i}=-1}$.
2. Support Vectors:
  1.*Support Vectors** are the data-points that lie exactly on the margin (i.e. on the boundary of the slab).
  They satisfy $\|w^TX' + b\| = 1, \forall$ support vectors $X'$
3. Hard-Margin SVM:
  The Hard-Margin SVM is just a maximum-margin classifier with features and kernels (discussed later).
Define the following wrt hard-margin SVM:
1. Goal:
  Find weights ‘$w$’ and scalar ‘$b$’ that correctly classifies the data-points and, moreover, does so in the “best” possible way.
2. Procedure:
  (1) Use a linear classifier (2) But, Maximize the Margin (3) Do so by Minimizing $\|w\|$
3. Decision Function:
  $${\displaystyle f(x)={\begin{cases}1&{\text{if }}\ w\cdot X_i+\alpha>0\\0&{\text{otherwise}}\end{cases}}}$$
4. Constraints:
  $$y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$
5. The Optimization Problem:
  Find weights ‘$w$’ and scalar ‘$b$’ that minimize
  $$ \dfrac{1}{2} w^Tw$$
  
  Subject to
  
  $$y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$
  
  Formally,
  
  $$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$
6. The Optimization Method:
  The SVM optimization problem reduces to a Quadratic Program.
Elaborate on the generalization analysis:
We notice that, geometrically, the hyperplane (the maximum margin classifier) is completely characterized by the support vectors (the vectors that lie on the margin).
A very important conclusion arises.
The maximum margin classifier (SVM) depends only on the number of support vectors and not on the dimension of the problem.
This implies that the computation doesn’t scale up with the dimension and, also implies, that the kernel trick works very well.
List the properties:
1. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
2. The hyperplane is determined solely by its support vectors.
3. The SVM always converges on linearly separable data.
4. The Hard-Margin SVM fails if the data is not linearly separable.
5. The Hard-Margin SVM is quite sensitive to outliers
Give the solution to the optimization problem for H-M SVM:
To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
The Lagrange Formulation:
1. Formulate the Lagrangian:
  1. Take each inequality constraint and put them in the zero-form (equality with Zero)
  2. Multiply each inequality by a Lagrange Multiplier $\alpha_n$
  3. Add them to the objective function $\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}$
    The sign will be $-$ (negative) simply because the inequality is $\geq 0$
  $$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$
2. Optimize the objective independently, for each of the unconstrained variables:
  1. Gradient w.r.t. $\mathbf{w}$:
    $$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$
  2. Derivative w.r.t. $b$:
    $$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$
3. Get the Dual Formulation w.r.t. the (tricky) constrained variable $\alpha_n$:
  1. Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. $\alpha_n$ will become free of $\mathbf{w}$ and $b$:
    $$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$
  2. Notice that the first constraint $\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$ has-no-effect/doesn’t-constraint $\alpha_n$ so it’s a vacuous constraint. However, not the second constraint $\sum_{n=1}^{N} \alpha_{n} y_{n}=0$.
  3. Set the optimization objective and the constraints, a quadratic function in $\alpha_n$:
  $$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$
4. Set the problem as a Quadratic Programming problem:
  1. Change the maximization to minimization by flipping the signs:
    $$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$
  2. Isolate the Coefficients from the $\alpha_n$s and set in matrix-form:
    $$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text {linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text {linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text {lower bounds }} \leq \alpha \leq \underbrace{\infty}_{\text {upper bounds }} $$
    
    The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of $\alpha_n$s; and then, gives you back an $\mathbf{\alpha}$.
  Equivalently:
  
  $$\min _{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text {subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$
5. What method does it require to be solved:
  To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
6. Formulate the Lagrangian:
  1. Take each inequality constraint and put them in the zero-form (equality with Zero)
  2. Multiply each inequality by a Lagrange Multiplier $\alpha_n$
  3. Add them to the objective function $\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}$
    The sign will be $-$ (negative) simply because the inequality is $\geq 0$
  $$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$
7. Optimize the objective for each variable:
  1. Gradient w.r.t. $\mathbf{w}$:
    $$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$
  2. Derivative w.r.t. $b$:
    $$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$
8. Get the Dual Formulation w.r.t. the (tricky) constrained variable $\alpha_n$:
  1. Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. $\alpha_n$ will become free of $\mathbf{w}$ and $b$:
    $$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$
  2. Notice that the first constraint $\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$ has-no-effect/doesn’t-constraint $\alpha_n$ so it’s a vacuous constraint. However, not the second constraint $\sum_{n=1}^{N} \alpha_{n} y_{n}=0$.
  3. Set the optimization objective and the constraints, a quadratic function in $\alpha_n$:
  $$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$
9. Set the problem as a Quadratic Programming problem:
  1. Change the maximization to minimization by flipping the signs:
    $$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$
  2. Isolate the Coefficients from the $\alpha_n$s and set in matrix-form:
    $$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text {linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text {linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text {lower bounds }} \leq \alpha \leq \underbrace{\infty}_ {\text {upper bounds }} $$
10. What are the inputs and outputs to the Quadratic Program Package?
  The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of $\alpha_n$s; and then, gives you back an $\mathbf{\alpha}$.
11. Give the final form of the optimization problem in standard form:
  $$\min_{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text {subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$

Soft-Margin SVM

Motivate the soft-margin SVM:
The Hard-Margin SVM faces a few issues:
1. The Hard-Margin SVM fails if the data is not linearly separable.
2. The Hard-Margin SVM is quite sensitive to outliers
The Soft-Margin SVM aims to fix/reconcile these problems.
What is the main idea behind it?
Allow some points to violate the margin, by introducing slack variables.
Define the following wrt soft-margin SVM:
1. Goal:
  Find weights ‘$w$’ and scalar ‘$b$’ that correctly classifies the data-points and, moreover, does so in the “best” possible way, but allow some points to violate the margin, by introducing slack variables.
2. Procedure:
  (1) Use a linear classifier
  (2) But, Maximize the Margin
  (3) Do so by Minimizing $\|w\|$
  (4) But allow some points to penetrate the margin
3. Decision Function:
4. Constraints:
  $$y_i(wX_i + b) \geq 1 - \zeta_i, \forall i \in [1,n]$$
  
  where the $\zeta_i$s are slack variables.
  We, also, enforce the non-negativity constraint on the slack variables:
  
  $$\zeta_i \geq 0, \:\:\: \forall i \in [1, n]$$
  1. Why is there a non-negativity constraint?
    The non-negativity constraint forces the slack variables to be zero for all points that do not violate the original constraint:
    i.e. are not inside the slab.
5. Objective/Cost Function:
  $$ R(w) = \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i$$
6. The Optimization Problem:
  Find weights ‘$w$’, scalar ‘$b$’, and $\zeta_i$s that minimize
  $$ \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i$$
  
  Subject to
  
  $$y_i(wX_i + b) \geq 1 - \zeta_i, \zeta_i \geq 0, \forall i \in [1,n]$$
  
  Formally,
  
  $$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1 - \zeta_i, \:\: \zeta_i \geq 0, \forall i \in [1,n]$$
7. The Optimization Method:
  The SVM optimization problem reduces to a Quadratic Program in $d + n + 1$-dimensions and $2n$-constraints.
8. Properties:
  1. The Soft-Margin SVM will converge on non-linearly separable data.

Specify the effects of the regularization hyperparameter $C$:

	Small C	Large C
Desire	Maximizing Margin = $\dfrac{1}{\|w\|}$	keep most slack variables zero or small
Danger	underfitting (High Misclassification)	overfitting (awesome training, awful test)
outliers	less sensitive	very sensitive
boundary	more “flat”	more sinuous

Describe the effect wrt over/under fitting:
Increase $C$ hparam in SVM = causes overfitting

How do we choose $C$?
We choose ‘$C$’ with cross-validation.
Give an equivalent formulation in the standard form objective for function estimation (what should it minimize?)
In function estimation we prefer the standard-form objective to minimize (and trade-off); the loss + penalty form.
We introduce a loss function to moderate the use of the slack variables (i.e. to avoid abusing the slack variables), namely, Hinge Loss:
$${\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$

The motivation: We motivate it by comparing it to the traditional $0-1$ Loss function.
Notice that the $0-1$ loss is actually non-convex. It has an infinite slope at $0$. On the other hand, the hinge loss is actually convex.

Analysis wrt maximum margin:
This function is zero if the constraint, $y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\geq 1$, is satisfied, in other words, if ${\displaystyle {\vec {x}}_{i}} {\vec {x}}_{i}$ lies on the correct side of the margin.
For data on the wrong side of the margin, the function’s value is proportional to the distance from the margin.

Modified Objective Function:

$$R(w) = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$

Proof of equivalence:

$$\begin{align} y_if\left(x_i\right) & \ \geq 1-\zeta_i, & \text{from 1st constraint } \\ \implies \zeta_i & \ \geq 1-y_if\left(x_i\right) \\ \zeta_i & \ \geq 1-y_if\left(x_i\right) \geq 0, & \text{from 2nd positivity constraint on} \zeta_i \\ \iff \zeta_i & \ \geq \max \{0, 1-y_if\left(x_i\right)\} \\ \zeta_i & \ = \max \{0, 1-y_if\left(x_i\right)\}, & \text{minimizing means } \zeta_i \text{reach lower bound}\\ \implies R(w) & \ = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}, & \text{plugging in and multplying } \lambda = \dfrac{1}{C} \end{align}$$

The (reformulated) Optimization Problem:

$$\min_{w, b}\dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_ {i}-b)\right)}$$

Loss Functions

Define:
1. Loss Functions - Abstractly and Mathematically:
  Abstractly, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number, intuitively, representing some “cost” associated with the event.
  
  Formally, a loss function is a function $L :(\hat{y}, y) \in \mathbb{R} \times Y \longmapsto L(\hat{y}, y) \in \mathbb{R}$ that takes as inputs the predicted value $\hat{y}$ corresponding to the real data value $y$ and outputs how different they are.
2. Distance-Based Loss Functions:
  A Loss function $L(\hat{y}, y)$ is called distance-based if it:
  1. Only depends on the residual:
    $$L(\hat{y}, y) = \psi(y-\hat{y}) \:\: \text{for some } \psi : \mathbb{R} \longmapsto \mathbb{R}$$
  2. Loss is $0$ when residual is $0$:
    $$\psi(0) = 0$$
  3. What are they used for?
    Regression.
  4. Describe an important property of dist-based losses:
    Translation Invariance:
    Distance-based losses are translation-invariant:
    $$L(\hat{y}+a, y+a) = L(\hat{y}, y)$$
3. Relative Error - What does it lack?
  Relative-Error $\dfrac{\hat{y}-y}{y}$ is a more natural loss but it is NOT translation-invariant.
List 3 Regression Loss Functions
1. MSE
2. MAE
3. Huber

MSE
1. What does it minimize:
  The MSE minimizes the sum of squared differences between the predicted values and the target values.
2. Formula:
  $$L(\hat{y}, y) = \dfrac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_ {i}\right)^{2}$$
3. Graph:
4. Derivation:
MAE
1. What does it minimize:
  The MAE minimizes the sum of absolute differences between the predicted values and the target values.
2. Formula:
  $$L(\hat{y}, y) = \dfrac{1}{n} \sum_{i=1}^{n}\vert y_{i}-\hat{y}_ {i}\vert$$
3. Graph:
4. Derivation:
5. List properties:
  1. Solution may be Non-unique
  2. Robustness to outliers
  3. Unstable Solutions:
  4. Data-points “Latching” ref:
    1. Unique Solution:
      If there are $k$ features (including the constant), then at least one optimal regression surface will pass through $k$ of the data points; unless there are multiple solutions.
    2. Multiple Solutions:
      The region of valid least absolute deviations solutions will be bounded by at least $k$ lines, each of which passes through at least $k$ data points.
      
      Wikipedia
Huber Loss
1. AKA:
  Smooth Mean Absolute Error
2. What does it minimize:
3. Formula:
  $$L(\hat{y}, y) = \left\{\begin{array}{cc}{\frac{1}{2}(y-\hat{y})^{2}} & {\text {if }|(y-\hat{y})|<\delta} \\ {\delta(y-\hat{y})-\frac{1}{2} \delta} & {\text {otherwise }}\end{array}\right.$$
4. Graph:
5. List properties:
  1. It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.
Analyze MSE vs MAE ref:
| MSE | MAE | | Sensitive to outliers | Robust to outliers | | Differentiable Everywhere | Non-Differentiable at $0$ | | Stable Solutions | Unstable Solutions | | Unique Solution | Possibly multiple solutions |
1. Statistical Efficiency:
  1. “For normal observations MSE is about $12\%$ more efficient than MAE” - Fisher
  2. $1\%$ Error is enough to make MAE more efficient
  3. 2/1000 bad observations, make the median more efficient than the mean
2. Subgradient methods are slower than gradient descent
  1. you get a lot better convergence rate guarantees for MSE

List 7 Classification Loss Functions
1. $0-1$ Loss
2. Square Loss
3. Hinge Loss
4. Logistic Loss
5. Cross-Entropy
6. Exponential Loss
7. Perceptron Loss

$0-1$ loss
1. What does it minimize:
  It measures accuracy, and minimizes mis-classification error/rate.
2. Formula:
  $$L(\hat{y}, y) = I(\hat{y} \neq y) = \left\{\begin{array}{ll}{0} & {\hat{y}=y} \\ {1} & {\hat{y} \neq y}\end{array}\right.$$
3. Graph:
MSE
1. Formula:
  $$L(\hat{y}, y) = (1-y \hat{y})^{2}$$
2. Graph:
3. Derivation (for classification) - give assumptions:
  We can write the loss in terms of the margin $m = y\hat{y}$:
  $L(\hat{y}, y)=(y - \hat{y})^{2}=(1-y\hat{y})^{2}=(1-m)^{2}$
  
  Since $y \in {-1,1} \implies y^2 = 1$
4. Properties:
  1. Convex
  2. Smooth
  3. Sensitive to outliers: Penalizes outliers excessively
  4. ^Slower Convergence Rate (wrt sample complexity) than logistic or hinge loss
  5. Functions which yield high values of $f({\vec {x}})$ for some $x\in X$ will perform poorly with the square loss function, since high values of $yf({\vec {x}})$ will be penalized severely, regardless of whether the signs of $y$ and $f({\vec {x}})$ match.
Hinge Loss
1. What does it minimize:
  It minimizes missclassification wrt penetrating a margin $\rightarrow$ maximizes a margin.
2. Formula:
  $$L(\hat{y}, y) = \max (0,1-y \hat{y})=|1-y \hat{y}|_ {+}$$
3. Graph:
4. Properties:
  1. Continuous, Convex, Non-Differentiable
  2. The hinge loss provides a relatively tight, convex upper bound on the $0–1$ indicator function
5. Describe the properties of the Hinge loss and why it is used?
  1. Hinge loss upper bounds 0-1 loss
  2. It is the tightest convex upper bound on the 0/1 loss
  3. Minimizing 0-1 loss is NP-hard in the worst-case
Logistic Loss
1. AKA:
  Log-Loss, Logarithmic Loss
2. What does it minimize:
  Minimizes the Kullback-Leibler divergence between the empirical distribution and the predicted distribution.
3. Formula:
  $$L(\hat{y}, y) = \log{\left(1+e^{-y \hat{y}}\right)}$$
4. Graph:
5. Derivation:
  We get the likelihood of the dataset $\mathcal{D}=\left(\mathbf{x}_{1}, y_{1}\right), \ldots,\left(\mathbf{x}_{N}, y_{N}\right)$:
  $$\prod_{n=1}^{N} P\left(y_{n} | \mathbf{x}_{n}\right) =\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\mathrm{T}} \mathbf{x}_ {n}\right)$$
  1. Maximize:
    $$\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)$$
  2. Take the natural log to avoid products:
    $$\ln \left(\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)\right)$$
    
    Motivation:
    1. The inner quantity is non-negative and non-zero.
    2. The natural log is monotonically increasing (its max, is the max of its argument)
  3. Take the average (still monotonic):
    $$\frac{1}{N} \ln \left(\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)\right)$$
  4. Take the negative and Minimize:
    $$-\frac{1}{N} \ln \left(\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)\right)$$
  5. Simplify:
    $$=\frac{1}{N} \sum_{n=1}^{N} \ln \left(\frac{1}{\theta\left(y_{n} \mathbf{w}^{\tau} \mathbf{x}_ {n}\right)}\right)$$
  6. Substitute $\left[\theta(s)=\frac{1}{1+e^{-s}}\right]$:
    $$\frac{1}{N} \sum_{n=1}^{N} \underbrace{\ln \left(1+e^{-y_{n} \mathbf{w}^{\top} \mathbf{x}_{n}}\right)}_{e\left(h\left(\mathbf{x}_{n}\right), y_{n}\right)}$$
  7. Use this as the Cross-Entropy Error Measure:
    $$E_{\mathrm{in}}(\mathrm{w})=\frac{1}{N} \sum_{n=1}^{N} \underbrace{\ln \left(1+e^{-y_{n} \mathrm{w}^{\top} \mathbf{x}_{n}}\right)}_{\mathrm{e}\left(h\left(\mathrm{x}_{n}\right), y_{n}\right)}$$
6. Properties:
  1. Convex
  2. Grows linearly for negative values which make it less sensitive to outliers
  3. The logistic loss function does not assign zero penalty to any points. Instead, functions that correctly classify points with high confidence (i.e., with high values of ${\displaystyle \vert f({\vec {x}})\vert }$) are penalized less. This structure leads the logistic loss function to be sensitive to outliers in the data.
Cross-Entropy
1. What does it minimize:
  It minimizes the Kullback-Leibler divergence between the empirical distribution and the predicted distribution.
2. Formula:
  $$L(\hat{y}, y) = -\sum_{i} y_i \log \left(\hat{y}_ {i}\right)$$
3. Binary Cross-Entropy:
  $$L(\hat{y}, y) = -\left[y \log \hat{y}+\left(1-y\right) \log \left(1-\hat{y}_ {n}\right)\right]$$
4. Graph:
5. CE and Negative-Log-Probability:
  The Cross-Entropy is equal to the Negative-Log-Probability (of predicting the true class) in the case that the true distribution that we are trying to match is peaked at a single point and is identically zero everywhere else; this is usually the case in ML when we are using a one-hot encoded vector with one class $y = [0 \: 0 \: \ldots \: 0 \: 1 \: 0 \: \ldots \: 0]$ peaked at the $j$-th position
  $\implies$
  $$L(\hat{y}, y) = -\sum_{i} y_i \log \left(\hat{y}_ {i}\right) = - \log (\hat{y}_ {j})$$
6. CE and Log-Loss:
  Given $p \in\{y, 1-y\}$ and $q \in\{\hat{y}, 1-\hat{y}\}$:
  $$H(p,q)=-\sum_{x }p(x)\,\log q(x) = -y \log \hat{y}-(1-y) \log (1-\hat{y}) = L(\hat{y}, y)$$
  1. Derivation:
    1. $\hat{y} = \sigma(yf(x))$,
    2. $y \in \{-1, 1\}$,
    3. $\hat{y}' = \sigma(f(x))$,
    4. \[y' = (1+y)/2 = \left\{\begin{array}{ll}{1} & {\text {for }} y' = 1 \\ {0} & {\text {for }} y = -1\end{array}\right. \in \{0, 1\}\]
    5. We start with the modified binary cross-entropy
      \(\begin{aligned} -y' \log \hat{y}'-(1-y') \log (1-\hat{y}') &= \left\{\begin{array}{ll}{-\log\hat{y}'} & {\text {for }} y' = 1 \\ {-\log(1-\hat{y}')} & {\text {for }} y' = 0\end{array}\right. \\ \\ &= \left\{\begin{array}{ll}{-\log\sigma(f(x))} & {\text {for }} y' = 1 \\ {-\log(1-\sigma(f(x)))} & {\text {for }} y' = 0\end{array}\right. \\ \\ &= \left\{\begin{array}{ll}{-\log\sigma(1\times f(x))} & {\text {for }} y' = 1 \\ {-\log(\sigma((-1)\times f(x)))} & {\text {for }} y' = 0\end{array}\right. \\ \\ &= \left\{\begin{array}{ll}{-\log\sigma(yf(x))} & {\text {for }} y' = 1 \\ {-\log(\sigma(yf(x)))} & {\text {for }} y' = 0\end{array}\right. \\ \\ &= \left\{\begin{array}{ll}{-\log\hat{y}} & {\text {for }} y' = 1 \\ {-\log\hat{y}} & {\text {for }} y' = 0\end{array}\right. \\ \\ &= -\log\hat{y} \\ \\ &= \log\left[\dfrac{1}{\hat{y}}\right] \\ \\ &= \log\left[\hat{y}^{-1}\right] \\ \\ &= \log\left[\sigma(yf(x))^{-1}\right] \\ \\ &= \log\left[ \left(\dfrac{1}{1+e^{-yf(x)}}\right)^{-1}\right] \\ \\ &= \log \left(1+e^{-yf(x)}\right)\end{aligned}\)
7. CE and KL-Div:
  When comparing a distribution ${\displaystyle q}$ against a fixed reference distribution ${\displaystyle p}$, cross entropy and KL divergence are identical up to an additive constant (since ${\displaystyle p}$ is fixed): both take on their minimal values when ${\displaystyle p=q}$, which is ${\displaystyle 0}$ for KL divergence, and ${\displaystyle \mathrm {H} (p)}$ for cross entropy.
  
  Basically, minimizing either will result in the same solution.
Exponential Loss
1. Formula:
  $$L(\hat{y}, y) = e^{-\beta y \hat{y}}$$
2. Properties:
  1. Convex
  2. Grows Exponentially for negative values making it more sensitive to outliers
  3. It penalizes incorrect predictions more than Hinge loss and has a larger gradient.
  4. Used in AdaBoost algorithm
Perceptron Loss
1. Formula:
  $${\displaystyle L(z, y_i) = {\begin{cases}0&{\text{if }}\ y_i\cdot z_i \geq 0\\-y_i z&{\text{otherwise}}\end{cases}}}$$
Analysis
1. Logistic vs Hinge Loss:
  Logistic loss diverges faster than hinge loss (image). So, in general, it will be more sensitive to outliers. Reference. Bad info?
2. Cross-Entropy vs MSE:
  Basically, CE > MSE because the gradient of MSE $z(1-z)$ leads to saturation when then output $z$ of a neuron is near $0$ or $1$ making the gradient very small and, thus, slowing down training.
  CE > Class-Loss because Class-Loss is binary and doesn’t take into account “how well” are we actually approximating the probabilities as opposed to just having the target class be slightly higher than the rest (e.g. $[c_1=0.3, c_2=0.3, c_3=0.4]$).

Information Theory

What is Information Theory? In the context of ML?
Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal.

In the context of machine learning, we can also apply information theory to continuous variables where some of these message length interpretations do not apply, instead, we mostly use a few key ideas from information theory to characterize probability distributions or to quantify similarity between probability distributions.
Describe the Intuition for Information Theory. Intuitively, how does the theory quantify information (list)?
The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative.

Thus, information theory quantifies information in a way that formalizes this intuition:
1. Likely events should have low information content - in the extreme case, guaranteed events have no information at all
2. Less likely events should have higher information content
3. Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
Measuring Information - Definitions and Formulas:
1. In Shannons Theory, how do we quantify “transmitting 1 bit of information”?
  To transmit $1$ bit of information means to divide the recipients Uncertainty by a factor of $2$.
2. What is the amount of information transmitted?
  The amount of information transmitted is the logarithm (base $2$) of the uncertainty reduction factor.
3. What is the uncertainty reduction factor?
  It is the inverse of the probability of the event being communicated.
4. What is the amount of information in an event $x$?
  The amount of information in an event $\mathbf{x} = x$, called the Self-Information is:
  $$I(x) = \log (1/p(x)) = -\log(p(x))$$
Define the Self-Information - Give the formula:
The Self-Information or surprisal is a synonym for the surprise when a random variable is sampled.
The Self-Information of an event $\mathrm{x} = x$:
$$I(x) = - \log P(x)$$
1. What is it defined with respect to?
  Self-information deals only with a single outcome.
Define Shannon Entropy - What is it used for?
Shannon Entropy is defined as the average amount of information produced by a stochastic source of data.
Equivalently, the amount of information that you get from one sample drawn from a given probability distribution $p$.

To quantify the amount of uncertainty in an entire probability distribution, we use Shannon Entropy.

$$H(x) = {\displaystyle \operatorname {E}_{x \sim P} [I(x)]} = - {\displaystyle \operatorname {E}_{x \sim P} [\log P(X)] = -\sum_{i=1}^{n} p\left(x_{i}\right) \log p\left(x_{i}\right)}$$
1. Describe how Shannon Entropy relate to distributions with a graph:
Define Differential Entropy:
Differential Entropy is Shannons entropy of a continuous random variable $x$
How does entropy characterize distributions?
Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy.
Define Relative Entropy - Give it’s formula:
The Kullback–Leibler divergence (Relative Entropy) is a measure of how one probability distribution diverges from a second, expected probability distribution.

Mathematically:

$${\displaystyle D_{\text{KL}}(P\parallel Q)=\operatorname{E}_{x \sim P} \left[\log \dfrac{P(x)}{Q(x)}\right]=\operatorname{E}_{x \sim P} \left[\log P(x) - \log Q(x)\right]}$$
1. Discrete:
$${\displaystyle D_{\text{KL}}(P\parallel Q)=\sum_{i}P(i)\log \left({\frac {P(i)}{Q(i)}}\right)}$$
1. Continuous:
$${\displaystyle D_{\text{KL}}(P\parallel Q)=\int_{-\infty }^{\infty }p(x)\log \left({\frac {p(x)}{q(x)}}\right)\,dx,}$$
1. Give an interpretation:
  1. Discrete variables:
    It is the extra amount of information needed to send a message containing symbols drawn from probability distribution $P$, when we use a code that was designed to minimize the length of messages drawn from probability distribution $Q$.
2. List the properties:
  1. Non-Negativity:
    ${\displaystyle D_{\mathrm {KL} }(P\|Q) \geq 0}$
  2. ${\displaystyle D_{\mathrm {KL} }(P\|Q) = 0 \iff}$ $P$ and $Q$ are:
    1. Discrete Variables:
      the same distribution
    2. Continuous Variables:
      equal “almost everywhere”
  3. Additivity of Independent Distributions:
    ${\displaystyle D_{\text{KL}}(P\parallel Q)=D_{\text{KL}}(P_{1}\parallel Q_{1})+D_{\text{KL}}(P_{2}\parallel Q_{2}).}$
  4. ${\displaystyle D_{\mathrm {KL} }(P\|Q) \neq D_{\mathrm {KL} }(Q\|P)}$
    
    This asymmetry means that there are important consequences to the choice of the ordering
  5. Convexity in the pair of PMFs $(p, q)$ (i.e. ${\displaystyle (p_{1},q_{1})}$ and ${\displaystyle (p_{2},q_{2})}$ are two pairs of PMFs):
    ${\displaystyle D_{\text{KL}}(\lambda p_{1}+(1-\lambda )p_{2}\parallel \lambda q_{1}+(1-\lambda )q_{2})\leq \lambda D_{\text{KL}}(p_{1}\parallel q_{1})+(1-\lambda )D_{\text{KL}}(p_{2}\parallel q_{2}){\text{ for }}0\leq \lambda \leq 1.}$
3. Describe it as a distance:
  Because the KL divergence is non-negative and measures the difference between two distributions, it is often conceptualized as measuring some sort of distance between these distributions.
  However, it is not a true distance measure because it is not symmetric.
  
  KL-div is, however, a Quasi-Metric, since it satisfies all the properties of a distance-metric except symmetry
4. List the applications of relative entropy:
  Characterizing:
  1. Relative (Shannon) entropy in information systems
  2. Randomness in continuous time-series
  3. Information gain when comparing statistical models of inference
5. How does the direction of minimization affect the optimization:
  Suppose we have a distribution $p(x)$ and we wish to approximate it with another distribution $q(x)$.
  We have a choice of minimizing either:
  1. ${\displaystyle D_{\text{KL}}(p\|q)} \implies q^\ast = \operatorname {arg\,min}_q {\displaystyle D_{\text{KL}}(p\|q)}$
    Produces an approximation that usually places high probability anywhere that the true distribution places high probability.
  2. ${\displaystyle D_{\text{KL}}(q\|p)} \implies q^\ast \operatorname {arg\,min}_q {\displaystyle D_{\text{KL}}(q\|p)}$
    Produces an approximation that rarely places high probability anywhere that the true distribution places low probability.
    
    which are different due to the asymmetry of the KL-divergence
Define Cross Entropy - Give it’s formula:
The Cross Entropy between two probability distributions ${\displaystyle p}$ and ${\displaystyle q}$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution ${\displaystyle q}$, rather than the “true” distribution ${\displaystyle p}$.
$$H(p,q) = \operatorname{E}_{p}[-\log q]= H(p) + D_{\mathrm{KL}}(p\|q) =-\sum_{x }p(x)\,\log q(x)$$
1. What does it measure?
  The average number of bits that need to be transmitted using a different probability distribution $q$ (for encoding) than the “true” distribution $p$, to convey the information in $p$.
2. How does it relate to relative entropy?
  It is similar to KL-Div but with an additional quantity - the entropy of $p$.
3. When are they equivalent?
  Minimizing the cross-entropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term.

Recommendation Systems

Describe the different algorithms for recommendation systems:

Ensemble Learning

What are the two paradigms of ensemble methods?
1. Parallel
2. Sequential
Random Forest VS GBM?
The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

Data Processing and Analysis

What are 3 data preprocessing techniques to handle outliers?
1. Winsorizing/Winsorization (cap at threshold).
2. Transform to reduce skew (using Box-Cox or similar).
3. Remove outliers if you’re certain they are anomalies or measurement errors.
Describe the strategies to dimensionality reduction?
1. Feature Selection
2. Feature Projection/Extraction
What are 3 ways of reducing dimensionality?
1. Removing Collinear Features
2. Performing PCA, ICA, etc.
3. Feature Engineering
4. AutoEncoder
5. Non-negative matrix factorization (NMF)
6. LDA
7. MSD
List methods for Feature Selection
1. Variance Threshold: normalize first (variance depends on scale)
2. Correlation Threshold: remove the one with larger mean absolute correlation with other features.
3. Genetic Algorithms
4. Stepwise Search: bad performance, regularization much better, it’s a greedy algorithm (can’t account for future effects of each change)
5. LASSO, Elastic-Net
List methods for Feature Extraction
1. PCA, ICA, CCA
2. AutoEncoders
3. LDA: LDA is a supervised linear transformation technique since the dependent variable (or the class label) is considered in the model. It Extracts the k new independent variables that maximize the separation between the classes of the dependent variable.
  1. Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
  2. Unlike PCA, LDA extracts the k new independent variables that maximize the separation between the classes of the dependent variable. LDA is a supervised linear transformation technique since the dependent variable (or the class label) is considered in the model.
4. Latent Semantic Analysis
5. Isomap
How to detect correlation of “categorical variables”?
1. Chi-Squared test: it is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.
Feature Importance
1. Use linear regression and select variables based on $p$ values
2. Use Random Forest, Xgboost and plot variable importance chart
3. Lasso
4. Measure information gain for the available set of features and select top $n$ features accordingly.
5. Use Forward Selection, Backward Selection, Stepwise Selection
6. Remove the correlated variables prior to selecting important variables
7. In linear models, feature importance can be calculated by the scale of the coefficients
8. In tree-based methods (such as random forest), important features are likely to appear closer to the root of the tree. We can get a feature’s importance for random forest by computing the averaging depth at which it appears across all trees in the forest
Capturing the correlation between continuous and categorical variable? If yes, how?
Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.
What cross validation technique would you use on time series data set?
Forward chaining strategy with k folds.
How to deal with missing features? (Imputation?)
1. Assign a unique category to missing values, who knows the missing values might decipher some trend.
2. Remove them blatantly
3. we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.
Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
What are collinearity and multicollinearity?
1. Collinearity occurs when two predictor variables (e.g., $x_1$ and $x_2$) in a multiple regression have some correlation.
2. Multicollinearity occurs when more than two predictor variables (e.g., $x_1, x_2, \text{ and } x_3$) are inter-correlated.

ML/Statistical Models

What are parametric models?
Parametric models are those with a finite number of parameters. To predict new data, you only need to know the parameters of the model. Examples include linear regression, logistic regression, and linear SVMs.
What is a classifier?
A function that maps…

K-NN

PCA

What is PCA?
It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
What is the Goal of PCA?
Given points $\mathbf{x}_ i \in \mathbf{R}^d$, find k-directions that capture most of the variation.
List the applications of PCA:
1. Find a small basis for representing variations in complex things.
  
  e.g. faces, genes.
2. Reducing the number of dimensions makes some computations cheaper.
3. Remove irrelevant dimensions to reduce over-fitting in learning algorithms.
  
  Like “subset selection” but the features are not axis aligned.
  They are linear combinations of input features.
4. Represent the data with fewer parameters (dimensions)
Give formulas for the following:
1. Assumptions on $X$:
  The analysis above is valid only for (1) $X$ w/ samples in rows and variables in columns (2) $X$ is centered (mean=0)
2. SVD of $X$:
  $X = USV^{T}$
3. Principal Directions/Axes:
  $V$
4. Principal Components (scores):
  $US$
5. The $j$-th principal component:
  $Xv_j = Us_j$
Define the transformation, mathematically:
Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights or coefficients ${\displaystyle \mathbf {v}_ {(k)}=(v_{1},\dots ,v_{p})_ {(k)}}$ that map each row vector ${\displaystyle \mathbf {x}_ {(i)}}$ of $X$ to a new vector of principal component scores ${\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_ {(i)}}$, given by:
$${\displaystyle {t_{k}}_{(i)}=\mathbf {x}_ {(i)}\cdot \mathbf {v}_ {(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l}$$

in such a way that the individual variables ${\displaystyle t_{1},\dots ,t_{l}}$ of $t$ considered over the data set successively inherit the maximum possible variance from $X$, with each coefficient vector $v$ constrained to be a unit vector (where $l$ is usually selected to be less than ${\displaystyle p}$ to reduce dimensionality).
What does PCA produce/result in?
1. Finds a lower dimensional subspace spanned by what?:
  Finds a lower dimensional subspace spanned by PCs.
2. Finds a lower dimensional subspace that minimizes what?:
  Finds a lower dimensional subspace (PCs) that Minimizes the RSS of projection errors.
3. What does each PC have (properties)?
  Produces a vector (1st PC) with the highest possible variance, each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
4. What does the procedure find in terms of a “basis”?
  Results in an uncorrelated orthogonal basis set.
5. What does the procedure find in terms of axes? (where do they point?):
  PCA constructs new axes that point to the directions of maximal variance (in the original variable space)
Describe the PCA algorithm:
1. Data Preprocessing:
  1. Training set: $x^{(1)}, x^{(2)}, \ldots, x^{(m)}$
  2. Preprocessing (feature scaling + mean normalization):
    1. mean normalization:
      $\mu_{j}=\frac{1}{m} \sum_{i=1}^{m} x_{j}^{(i)}$
      Replace each $x_{j}^{(i)}$ with $x_j^{(i)} - \mu_j$
    2. feature scaling:
      If different features on different, scale features to have comparable range
      $s_j = S.D(X_j)$ (the standard deviation of feature $j$)
      Replace each $x_{j}^{(i)}$ with $\dfrac{x_j^{(i)} - \mu_j}{s_j}$
2. Computing the Principal Components:
  1. Compute the SVD of the matrix $X = U S V^T$
  2. Compute the Principal Components:
    $$T = US = XV$$
    
    Note: The $j$-th principal component is: $Xv_j$
  3. Choose the top $k$ components singular values in $S = S_k$
  4. Compute the Truncated Principal Components:
    $$T_k = US_k$$
3. Computing the Low-rank Approximation Matrix $X_k$:
  1. Compute the reconstruction matrix:
    $$X_k = T_kV^T = US_kV^T$$
1. What Data Processing needs to be done?
  1. Preprocessing (feature scaling + mean normalization):
    1. mean normalization:
      $\mu_{j}=\frac{1}{m} \sum_{i=1}^{m} x_{j}^{(i)}$
      Replace each $x_{j}^{(i)}$ with $x_j^{(i)} - \mu_j$
    2. feature scaling:
      If different features on different, scale features to have comparable range
      $s_j = S.D(X_j)$ (the standard deviation of feature $j$)
      Replace each $x_{j}^{(i)}$ with $\dfrac{x_j^{(i)} - \mu_j}{s_j}$
2. How to compute the Principal Components?
  1. Compute the SVD of the matrix $X = U S V^T$
  2. Compute the Principal Components:
    $$T = US = XV$$
3. How do you compute the Low-Rank Approximation Matrix $X_k$?
  1. Choose the top $k$ components singular values in $S = S_k$
  2. Compute the Truncated Principal Components:
    $$T_k = US_k$$
  3. Compute the reconstruction matrix:
    $$X_k = T_kV^T = US_kV^T$$
Describe the Optimality of PCA:
Optimal for Finding a lower dimensional subspace (PCs) that Minimizes the RSS of projection errors.
List limitations of PCA:
1. PCA is highly sensitive to the (relative) scaling of the data; no consensus on best scaling.
Intuition:
1. PCA can be thought of as fitting a $p$-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only a commensurately small amount of information.
2. Its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data.
1. What property of the internal structure of the data does PCA reveal/explain?
  The variance in the data.
2. What object does it fit to the data?:
  1. PCA can be thought of as fitting a $p$-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component.
Should you remove correlated features b4 PCA?
Yes. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated. Discussion
How can we measure the “Total Variance” of the data?
1.*The Total Variance** of the data can be expressed as the sum of all the eigenvalues:
$$\mathbf{Tr} \Sigma = \mathbf{Tr} (U \Lambda U^T) = \mathbf{Tr} (U^T U \Lambda) = \mathbf{Tr} \Lambda = \lambda_1 + \ldots + \lambda_n.$$
How can we measure the “Total Variance” of the projected data?
1.*The Total Variance** of the Projected data is:
$$\mathbf{Tr} (P \Sigma P^T ) = \lambda_1 + \lambda_2 + \cdots + \lambda_k. $$
How can we measure the “Error in the Projection”?
1.*The Error in the Projection** could be measured with respect to variance.
1. We define the ratio of variance “explained” by the projected data (equivalently, the ratio of information “retained”) as:
$$\dfrac{\lambda_1 + \ldots + \lambda_k}{\lambda_1 + \ldots + \lambda_n}. $$
1. What does it mean when this ratio is high?
  If the ratio is high, we can say that much of the variation in the data can be observed on the projected plane.
How does PCA relate to CCA?
1. CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while
2. PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset.
How does PCA relate to ICA?
Independent component analysis (ICA) is directed to similar problems as principal component analysis, but finds additively separable components rather than successive approximations.

The Centroid Method

Define “The Centroid”:
In mathematics and physics, the centroid or geometric center of a plane figure is the arithmetic mean (“average”) position of all the points in the shape.

The definition extends to any object in n-dimensional space: its centroid is the mean position of all the points in all of the coordinate directions.
Describe the Procedure:
Compute the mean ($\mu_c$) of all the vectors in class $C$ and the mean ($\mu_x$) of all the vectors not in $C$
What is the Decision Function:
$$f(x) = (\mu_c - \mu_x) \cdot \vec{x} - (\mu_c - \mu_x) \cdot \dfrac{\mu_c + \mu_x}{2}$$
Describe the Decision Boundary:
The decision boundary is a Hyperplane that bisects the line segment with endpoints $<\mu_c, \mu_x>$

K-Means

What is K-Means?
It is a clustering algorithm. It aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean. It results in a partitioning of the data space into Voronoi Cells.
What is the idea behind K-Means?
1. Minimize the aggregate intra-cluster distance
2. Equivalent to minimizing the variance
3. Thus, it finds $k-$clusters with minimum aggregate variance
What does K-Mean find?
It finds $k-$clusters with minimum aggregate variance.
Formal Description of the Model:
Given a set of observations $\left(\mathbf{x}_{1}, \mathbf{x} _{2}, \ldots, \mathbf{x}_{n}\right)$, $\mathbf{x}_ i \in \mathbb{R}^d$, the algorithm aims to partition the $n$ observations into $k$ sets $\mathbf{S}=\left\{S_{1}, S_{2}, \ldots, S_{k}\right\}$ so as to minimize the intra-cluster Sum-of-Squares (i.e. variance).
1. What is the Objective?
  $$\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}=\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{k}\left|S_{i}\right| \operatorname{Var} S_{i}$$
  
  where $\boldsymbol{\mu}_i$ is the mean of points in $S_i$.
Description of the Algorithm:
1. Choose two random points, call them “Centroids”
2. Assign the closest $N/2$ points (Euclidean-wise) to each of the Centroids
3. Compute the mean of each “group”/class of points
4. Re-Assign the centroids to the newly computed Means ↑
5. REPEAT!
What is the Optimization method used? What class does it belong to?
Coordinate descent. Expectation-Maximization.
1. How does the optimization method relate to EM?
  The “assignment” step is referred to as the “expectation step”, while the “update step” is a maximization step, making this algorithm a variant of the generalized expectation-maximization algorithm.
What is the Complexity of the algorithm?
The original formulation of the problem is NP-Hard; however, EM algorithms (specifically, Coordinate-Descent) can be used as efficient heuristic algorithms that converge quickly to good local minima.
Describe the convergence and prove it:
Guaranteed to converge after a finite number of iterations to a local minimum.
1. Proof:
  The Algorithm Minimizes a monotonically decreasing, Non-Negative Energy function on a finite Domain:
  By Monotone Convergence Theorem the objective Value Converges.
Describe the Optimality of the Algorithm:
1. Locally optimal: due to convergence property
2. Non-Globally optimal:
  1. The objective function is non-convex
  2. Moreover, coordinate Descent doesn’t converge to global minimum on non-convex functions.
Derive the estimated parameters of the algorithm:
1. Objective Function:
  $$J(S, \mu)= \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}} \| \mathbf{x} -\mu_i \|^{2}$$
2. Optimization Objective:
  $$\min _{\mu} \min _{S} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x} -\mu_{i}\right\|^{2}$$
3. Derivation:
4. Fix $S = \hat{S}$, optimize $\mu$:
  $$\begin{aligned} & \min _{\mu} \sum_{i=1}^{k} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mu_{i}-x_{j}\right\|^{2}\\ =& \sum_{i=1}^{k} \min _{\mu_i} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mathbf{x} - \mu_{i}\right\|^{2} \end{aligned}$$
  1. MLE:
    $$\min _{\mu_i} \sum_{\mathbf{x} \in \hat{S}_{i}}\left\|\mathbf{x} - \mu_{i}\right\|^{2}$$
    
    $\implies$
    
    $${\displaystyle \hat{\mu_i} = \dfrac{\sum_{\mathbf{x} \in \hat{S}_ {i}} \mathbf{x}}{\vert\hat{S}_ i\vert}}$$
5. Fix $\mu_i = \hat{\mu_i}, \forall i$, optimize $S$:
  $$\arg \min _{S} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x} - \hat{\mu_{i}}\right\|^{2}$$
  
  $\implies$
  
  $$S_{i}^{(t)}=\left\{x_{p} :\left\|x_{p}-m_{i}^{(t)}\right\|^{2} \leq\left\|x_{p}-m_{j}^{(t)}\right\|^{2} \forall j, 1 \leq j \leq k\right\}$$
  1. MLE:
When does K-Means fail to give good results?
K-means clustering algorithm fails to give good results when:
1. the data contains outliers
2. the density spread of data points across the data space is different
3. and the data points follow nonconvex shapes.

Naive Bayes

Define:
1. Naive Bayes:
  It is a simple technique used for constructing classifiers.
2. Naive Bayes Classifiers:
  A family of simple probabilistic classifiers based on applying bayes theorem with strong (naive) independence assumptions between the features.
3. Bayes Theorem:
  $p(x\vert y) = \dfrac{p(y\vert x) p(x)}{p(y)}$
List the assumptions of Naive Bayes:
1. Conditional Independence: the features are conditionally independent from each other given a class $C_k$
2. Bag-of-words: The relative importance (positions) of the features do not matter
List some properties of Naive Bayes:
1. Not a Bayesian method
2. It’s a Bayes Classifier: minimizes the probability of misclassification
1. Is it a Bayesian Method or Frequentest Method?
  Not a Bayesian method
2. Is it a Bayes Classifier? What does that mean?:
  It’s a Bayes Classifier: minimizes the probability of misclassification
Define the Probabilistic Model for the method:
Naive Bayes is a conditional probability model:
given a problem instance to be classified represented by a vector $\boldsymbol{x} = (x_1, \cdots, x_n)$ of $n$ features/words (independent variables), it assigns to this instance probabilities:
$$P(C_k\vert \boldsymbol{x}) = p(C_k\vert x_1, \cdots, x_n)$$

for each of the $k$ classes.

Using Bayes theorem to decompose the conditional probability:

$$p(C_k\vert \boldsymbol{x}) = \dfrac{p(\boldsymbol{x}\vert C_k) p(C_k)}{p(\boldsymbol{x})}$$

Notice that the numerator is equivalent to the joint probability distribution:

$$p\left(C_{k}\right) p\left(\mathbf{x} | C_{k}\right) = p\left(C_{k}, x_{1}, \ldots, x_{n}\right)$$

Using the Chain-Rule for repeated application of the conditional probability, the joint probability model can be rewritten as:

$$p(C_{k},x_{1},\dots ,x_{n})\, = p(x_{1}\mid x_{2},\dots ,x_{n},C_{k})p(x_{2}\mid x_{3},\dots ,x_{n},C_{k})\dots p(x_{n-1}\mid x_{n},C_{k})p(x_{n}\mid C_{k})p(C_{k})$$

Using the Naive Conditional Independence assumptions:

$$p\left(x_{i} | x_{i+1}, \ldots, x_{n}, C_{k}\right)=p\left(x_{i} | C_{k}\right)$$

Thus, we can write the joint model as:

$${\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\dots ,x_{n})&\varpropto p(C_{k},x_{1},\dots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}}$$

Finally, the conditional distribution over the class variable $C$ is:

$${\displaystyle p(C_{k}\mid x_{1},\dots ,x_{n})={\frac {1}{Z}}p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})}$$

where, ${\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})}$ is a constant scaling factor, a dependent only on the, known, feature variables $x_i$s.
1. What kind of model is it?
  Conditional Probability Model.
2. What is a conditional probability model?
  A model that assigns probabilities to an input $\boldsymbol{x}$ conditioned on being a member of each class in a set of $k$ classes $C_1, \cdots, C_k$.
3. Decompose the conditional probability w/ Bayes Theorem:
  $$p(C_k\vert \boldsymbol{x}) \dfrac{p(\boldsymbol{x}\vert C_k) p(C_k)}{p(\boldsymbol{x})}= $$
4. How does the new expression incorporate the joint probability model?
  We notice that the numerator is equivalent to the joint probability distribution:
  $$p\left(C_{k}\right) p\left(\mathbf{x} | C_{k}\right) = p\left(C_{k}, x_{1}, \ldots, x_{n}\right)$$
5. Use the chain rule to re-write the joint probability model:
  $$p(C_{k},x_{1},\dots ,x_{n})\, = p(x_{1}\mid x_{2},\dots ,x_{n},C_{k})p(x_{2}\mid x_{3},\dots ,x_{n},C_{k})\dots p(x_{n-1}\mid x_{n},C_{k})p(x_{n}\mid C_{k})p(C_{k})$$
6. Use the Naive Conditional Independence assumption to rewrite the joint model:
  $${\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\dots ,x_{n})&\varpropto p(C_{k},x_{1},\dots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}}$$
7. What is the conditional distribution over the class variable $C_k$:
  $${\displaystyle p(C_{k}\mid x_{1},\dots ,x_{n})={\frac {1}{Z}}p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})}$$
  
  where, ${\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})}$ is a constant scaling factor, a dependent only on the, known, feature variables $x_i$s.
Construct the classifier. What are its components? Formally define it.
The classifier is made of (1) the conditional probability model (above) $\:$ and (2) A decision rule.

The decision rule used is the MAP hypothesis: i.e. pick the hypothesis that is most probable (maximize the MAP estimate=posterior*prior).

The classifier is the function that assigns a class label $\hat{y} = C_k$ for some $k$ as follows:

$$\hat{y}=\underset{k \in\{1, \ldots, K\}}{\operatorname{argmax}} p\left(C_{k}\right) \prod_{i=1}^{n} p\left(x_{i} | C_{k}\right)$$
1. What’s the decision rule used?
  we commonly use the Maximum A Posteriori (MAP) hypothesis, as the decision rule; i.e. pick the hypothesis that is most probable.
2. List the difference between the Naive Bayes Estimate and the MAP Estimate:
  1. MAP Estimate:
    $${\displaystyle {\hat {y}_{\text{MAP}}}={\underset {k\in \{1,\dots ,K\}}{\operatorname {argmax} }}\ p(C_{k})\ p(\mathbf {x} \mid C_{k})}$$
  2. Naive Bayes Estimate:
    $${\displaystyle {\hat {y}_{\text{NB}}}={\underset {k\in \{1,\dots ,K\}}{\operatorname {argmax} }}\ p(C_{k})\displaystyle \prod _{i=1}^{n}p(x_{i}\mid C_{k})}$$
What are the parameters to be estimated for the classifier?:
1. The posterior probability of each feature/word given a class
2. The prior probability of each class
What method do we use to estimate the parameters?:
Maximum Likelihood Estimation (MLE).
What are the estimates for each of the following parameters?:
1. The prior probability of each class:
  $\hat{P}(C_k) = \dfrac{\text{doc-count}(C=C_k)}{N_\text{doc}}$,
2. The conditional probability of each feature (word) given a class:
  $\hat{P}(x_i | C_i) = \dfrac{\text{count}(x_i,C_j)}{\sum_{x \in V} \text{count}(x, C_j)}$

CNNs

What is a CNN?
A convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery.

Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
1. What kind of data does it work on? What is the mathematical property?
  In general, it works on data that have grid-like topology.
What are the layers of a CNN?
$$\text{Input} \longrightarrow \left[\text{Conv} \rightarrow \text{ReLU} \rightarrow \text{Pooling} \rightarrow \text{Norm}\right] \times N \longrightarrow \text{FC}$$
What are the four important ideas and their benefits that the convolution affords CNNs:
1. Sparse Interactions/Connectivity/Weights:
  Unlike FNNs, where every input unit is connected to every output unit, CNNs have sparse interactions. This is accomplished by making the kernel smaller than the input.
  
  Benefits:
  1. This means that we need to store fewer parameters, which both,
    1. Reduces the memory requirements of the model and
    2. Improves its statistical efficiency
  2. Also, Computing the output requires fewer operations
  3. In deep CNNs, the units in the deeper layers interact indirectly with large subsets of the input which allows modelling of complex interactions through sparse connections.
2. Parameter Sharing:
  refers to using the same parameter for more than one function in a model.
  
  Benefits:
  1. This means that rather than learning a separate set of parameters for every location, we learn only one set of parameters.
    1. This does not affect the runtime of forward propagation—it is still $\mathcal{O}(k \times n)$
    2. But it does further reduce the storage requirements of the model to $k$ parameters ($k$ is usually several orders of magnitude smaller than $m$)
3. Equivariant Representations:
  For convolutions, the particular form of parameter sharing causes the layer to have a property called equivariance to translation.
  
  Thus, if we move the object in the input, its representation will move the same amount in the output.
  
  Benefits:
  1. It is most useful when we know that some function of a small number of neighboring pixels is useful when applied to multiple input locations (e.g. edge detection)
  2. Shifting the position of an object in the input doesn’t confuse the NN
  3. Robustness against translated inputs/images
4. Accepts inputs of variable sizes
What is the inspirational model for CNNs:
Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex.
Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.
Describe the connectivity pattern of the neurons in a layer of a CNN:
The neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner.
Describe the process of a ConvNet:
ConvNets transform the original image layer by layer from the original pixel values to the final class scores.
Convolution Operation:
1. Define:
2. Formula (continuous):
3. Formula (discrete):
4. Define the following:
  1. Feature Map:
5. Does the operation commute?
Cross Correlation:
1. Define:
2. Formulae:
3. What are the differences/similarities between convolution and cross-correlation:
Write down the Convolution operation and the cross-correlation over two axes and:
1. Convolution:
2. Convolution (commutative):
3. Cross-Correlation:
The Convolutional Layer:
1. What are the parameters and how do we choose them?
2. Describe what happens in the forward pass:
3. What is the output of the forward pass:
4. How is the output configured?
Spatial Arrangements:
1. List the Three Hyperparameters that control the output volume:
2. How to compute the spatial size of the output volume?
3. How can you ensure that the input & output volume are the same?
4. In the output volume, how do you compute the $d$-th depth slice:
Calculate the number of parameters for the following config:

Given:
1. Input Volume: $64\times64\times3$
1. Filters: $15 7\times7$
1. Stride: $2$
1. Pad: $3$
Definitions:
1. Receptive Field:
Suppose the input volume has size $[ 32 × 32 × 3 ]$ and the receptive field (or the filter size) is $5 × 5$ , then each neuron in the Conv Layer will have weights to a __Blank__ region in the input volume, for a total of __Blank__ weights:
How can we achieve the greatest reduction in the spatial dims of the network (for classification):
Pooling Layer:
1. Define:
2. List key ideas/properties and benefits:
3. List the different types of Pooling:
  Answer
4. List variations of pooling and their definitions:
  1. What is “Learned Pooling”:
  2. What is “Dynamical Pooling”:
5. List the hyperparams of Pooling Layer:
6. How to calculate the size of the output volume:
7. How many parameters does the pooling layer have:
8. What are other ways to perform downsampling:
Weight Priors:
1. Define “Prior Prob Distribution on the parameters”:
2. Define “Weight Prior” and its types/classes:
  1. Weak Prior:
  2. Strong Prior:
  3. Infinitely Strong Prior:
3. Describe the Conv Layer as a FC Layer using priors:
4. What are the key insights of using this view:
  1. When is the prior imposed by convolution INAPPROPRIATE:
  2. What happens when the priors imposed by convolution and pooling are not suitable for the task?
  3. What kind of other models should Convolutional models be compared to? Why?:
When do multi-channel convolutions commute?
Answer
Why do we use several different kernels in a given conv-layer?
Strided Convolutions
1. Define:
2. What are they used for?
3. What are they equivalent to?
4. Formula:
Zero-Padding:
1. Definition/Usage:
2. List the types of padding:
Locally Connected Layers/Unshared Convolutions:
Bias Parameter:
1. How many bias terms are used per output channel in the tradional convolution:
Dilated Convolutions
1. Define:
2. What are they used for?
Stacked Convolutions
1. Define:
2. What are they used for?

Archits

Theory

RNNs

What is an RNN?
1. Definition:
2. What machine-type is the standard RNN:
What is the big idea behind RNNs?
Dynamical Systems:
1. Standard Form:
2. RNN as a Dynamical System:
Unfolding Computational Graphs
1. Definition:
2. List the Advantages introduced by unfolding and the benefits:
3. Graph and write the equations of Unfolding hidden recurrence:
Describe the State of the RNN, its usage, and extreme cases of the usage:
RNN Architectures:
1. List the three standard architectures of RNNs:
  1. Graph:
  2. Architecture:
  3. Equations:
  4. Total Loss:
  5. Complexity:
  6. Properties:
Teacher Forcing:
1. Definition:
2. Application:
3. Disadvantages:
4. Possible Solutions for Mitigation:

Optimization

Define the sigmoid function and some of its properties:
$\sigma(-x) = 1 - \sigma(x)$
Backpropagation:
1. Definition:
  Backpropagation algorithms are a family of methods used to efficiently train artificial neural networks (ANNs) following a gradient descent approach that exploits the chain rule.
2. Derive Gradient Descent Update:
  Answer
3. Explain the difference kinds of gradient-descent optimization procedures:
  1. Batch Gradient Descent AKA Vanilla Gradient Descent, computes the gradient of the objective wrt. the parameters $\theta$ for the entire dataset:
    $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J(\theta)$$
  2. SGD performs a parameter update for each data-point:
    $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i)} ; y^{(i)}\right)$$
  3. Mini-batch Gradient Descent a hybrid approach that perform updates for a, pre-specified, mini-batch of $n$ training examples:
    $$\theta=\theta-\epsilon \cdot \nabla_{\theta} J\left(\theta ; x^{(i : i+n)} ; y^{(i : i+n)}\right)$$
4. List the different optimizers and their properties:
  Answer
Error-Measures:
1. Define what an error measure is:
  Error Measures aim to answer the question:
  “What does it mean for $h$ to approximate $f$ ($h \approx f$)?”
  The Error Measure: $E(h, f)$
  It is almost always defined point-wise: $\mathrm{e}(h(\mathbf{X}), f(\mathbf{X}))$.
2. List the 5 most common error measures and where they are used:
3. Specific Questions:
  1. Derive MSE carefully:
  2. Derive the Binary Cross-Entropy Loss function:
    It is the log-likelihood of a Bernoulli probability model:
    $$\begin{array}{c}{L(p)=p^{y}(1-p)^{1-y}} \\ {\log (L(p))=y \log p+(1-y) \log (1-p)}\end{array}$$
  3. Explain the difference between Cross-Entropy and MSE and which is better (for what task)?
  4. Describe the properties of the Hinge loss and why it is used?
    1. Hinge loss upper bounds 0-1 loss
    2. It is the tightest convex upper bound on the 0/1 loss
    3. Minimizing 0-1 loss is NP-hard in the worst-case
Show that the weight vector of a linear signal is orthogonal to the decision boundary?
The weight vector $\mathbf{w}$ is orthogonal to the separating-plane/decision-boundary, defined by $\mathbf{w}^T\mathbf{x} + b = 0$, in the $\mathcal{X}$ space; Reason:
Since if you take any two points $\mathbf{x}^\prime$ and $\mathbf{x}^{\prime \prime}$ on the plane, and create the vector $\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)$ parallel to the plane by subtracting the two points, then the following equations must hold:
$$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$
What does it mean for a function to be well-behaved from an optimization pov?
The well-behaved property from an optimization standpoint, implies that $f''(x)$ doesn’t change too much or too rapidly, leading to a nearly quadratic function that is easy to optimize by gradient methods.
Write $\|\mathrm{Xw}-\mathrm{y}\|^{2}$ as a summation
$$\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{n}-y_{n}\right)^{2} = \frac{1}{N}\|\mathrm{Xw}-\mathrm{y}\|^{2}$$
Compute:
1. $\dfrac{\partial}{\partial y}\vert{x-y}\vert=$
  $\dfrac{\partial}{\partial y} \vert{x-y}\vert = - \text{sign}(x-y)$
State the difference between SGD and GD?
Gradient Descent’s cost-function iterates over ALL training samples.
Stochastic Gradient Descent’s cost-function only accounts for ONE training sample, chosen at random.
When would you use GD over SDG, and vice-versa?
GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large.
That means GD is preferable for small datasets while SGD is preferable for larger ones.
What is convex hull ?
In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.
OLS vs MLE
They both estimate parameters in the model. They are the same in the case of normal distribution.

ML Theory

Explain intuitively why Deep Learning works?
Circuit Theory: There are function you can compute with a “small” L-layer deep NN that shallower networks require exponentially more hidden units to compute. (comes from looking at networks as logic gates).
1. Example:
  Computing $x_1 \text{XOR} x_2 \text{XOR} ... \text{XOR} x_n$ takes:
  1. $\mathcal{O}(log(n))$ in a tree representation.
  2. $\mathcal{O}(2^n)$ in a one-hidden-layer network because you need to exhaustively enumerate all possible $2^N$ configurations of the input bits that result in the $\text{XOR}$ being ${1, 0}$.
List the different types of Learning Tasks and their definitions:
1. Multi-Task Learning: general term for training on multiple tasks
  1. Joint Learning: by choosing mini-batches from two different tasks simultaneously/alternately
  2. Pre-Training: first train on one task, then train on another
    
    widely used for word embeddings
2. Transfer Learning:
  a type of multi-task learning where we are focused on one task; by learning on another task then applying those models to our main task
3. Domain Adaptation:
  a type of transfer learning, where the output is the same, but we want to handle different inputs/topics/genres
4. Zero-Shot Learning: is a form of extending supervised learning to a setting of solving for example a classification problem when not enough labeled examples are available for all classes.
  
  “Zero-shot learning is being able to solve a task despite not having received any training examples of that task.” - Goodfellow
answer
Describe the relationship between supervised and unsupervised learning?
answer
Many ml algorithms can be used to perform both tasks. E.g., the chain rule of probability states that for a vector $x \in \mathbb{R}^n$, the joint distribution can be decomposed as:
$p(\mathbf{x})=\prod_{i=1}^{n} p\left(\mathrm{x}_{i} | \mathrm{x}_{1}, \ldots, \mathrm{x}_{i-1}\right)$
which implies that we can solve the Unsupervised problem of modeling $p(x)$ by splitting it into $n$ supervised learning problems.
Alternatively, we can solve the supervised learning problem of learning $p(y \vert x)$ by using traditional unsupervised learning technologies to learn the joint distribution $p(x, y)$, then inferring:
$p(y | \mathbf{x})=\frac{p(\mathbf{x}, y)}{\sum_{y} p\left(\mathbf{x}, y^{\prime}\right)}$
Describe the differences between Discriminative and Generative Models?
1. Generative Model: learns the joint probability distribution, “probability of $\boldsymbol{x}$ and $y$”:
  $$P(\mathbf{x}, y)$$
2. Discriminative Model: learns the conditional probability distribution, “probability of $y$ given $\boldsymbol{x}$”:
  $$P(y \vert \mathbf{x})$$
Describe the curse of dimensionality and its effects on problem solving:
It is a phenomena where many machine learning problems become exceedingly difficult when the number of dimensions in the data is high.

The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases $\rightarrow$ Sparsity:
1. Common Theme - Sparsity: When the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
2. Sparsity and Statistical Significance: This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.
3. Sparsity and Clustering: Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties (clustering); in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
4. Statistical Challenge: the number of possible configurations of $x$ is much larger than the number of training examples
5. Stastical Sampling: The sampling density is proportional to $N^{1/p}$, where $p$ is the dimension of the input space and $N$ is the sample size. Thus, if $N_1 = 100$ represents a dense sample for a single input problem, then $N_{10} = 100^{10}$ is the sample size required for the same sampling density with $10$ inputs. Thus in high dimensions all feasible training samples sparsely populate the input space.
6. Further Reading
How to deal with curse of dimensionality?
1. Feature Selection
2. Feature Extraction
Describe how to initialize a NN and any concerns w/ reasons:
1. Don’t initialize the weights to Zero. The symmetry of hidden units results in a similar computation for each hidden unit, making all the rows of the weight matrix to be equal (by induction).
2. It’s OK to initialize the bias term to zero.
Describe the difference between Learning and Optimization in ML:
1. The problem of Reducing the training error on the training set is one of optimization.
2. The problem of Reducing the training error, as well as, the generalization (test) error is one of learning.
List the 12 Standard Tasks in ML:
Answer
1. Clustering
2. Forecasting
3. Dimension reduction
4. Sequence Labeling
What is the difference between inductive and deductive learning?
1. Inductive learning is the process of using observations to draw conclusions
  1. It is a method of reasoning in which the premises are viewed as supplying some evidence for the truth of the conclusion.
  2. It goes from specific to general (“bottom-up logic”).
  3. The truth of the conclusion of an inductive argument may be probable, based upon the evidence given.
2. Deductive learning is the process of using conclusions to form observations.
  1. It is the process of reasoning from one or more statements (premises) to reach a logically certain conclusion.
  2. It goes from general to specific (“top-down logic”).
  3. The conclusions reached (“observations”) are necessarily True.

Statistical Learning Theory

Define Statistical Learning Theory:
It is a framework for machine learning drawing from the fields of statistics and functional analysis that allows us, under certain assumptions, to study the question:

How can we affect performance on the test set when we can only observe the training set?

It is a statistical approach to Computational Learning Theory.
What assumptions are made by the theory?
1. The training and test data are generated by an *unknown__1. __data generating distribution (over the product space $Z = X \times Y$, denoted: $p_{\text{data}}(z) = p(x,y)$) called the data-generating process.
2. The i.i.d assumptions:
  1. The data-points in each dataset are independent from each other
  2. The training and testing are both identically distributed (drawn from the same distribution)
1. Define the i.i.d assumptions?
  A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
2. Why assume a joint probability distribution $p(x,y)$?
  Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because ${\displaystyle y}$ is not a deterministic function of ${\displaystyle x}$, but rather a random variable with conditional distribution ${\displaystyle P(y|x)}$ for a fixed ${\displaystyle x}$.
3. Why do we need to model $y$ as a target-distribution and not a target-function?
  The ‘Target Function’ is not always a function because two ‘identical’ input points can be mapped to two different outputs (i.e. they have different labels).
Give the Formal Definition of SLT:
1. The Definitions:
  1. $X$: $\:$ Input (vector) Space
  2. $Y$: $\:$ Output (vector) Space
  3. $Z = X \times Y$: $\:$ Product space of (input, output) pairs
  4. $n$: $\:$ number of data points
  5. $S = \left\{\left(\vec{x}_{1}, y_{1}\right), \ldots,\left(\vec{x}_{n}, y_{n}\right)\right\}=\left\{\vec{z}_{1}, \ldots, \vec{z}_{n}\right\}$: $\:$ the training set
  6. $\mathcal{H} = f : X \rightarrow Y$: $\:$ the hypothesis space of all functions
  7. $V(f(\vec{x}), y)$: $\:$ an error/loss function
2. The Assumptions:
  1. The training and testing sets are generated by an *unknown__1. __data-generating distribution function (over $Z$, denoted: $p_{\text{data}} = p(z) = p(x,y)$) called the data-generation process.
  2. The i.i.d assumptions:
    1. The examples in each dataset are generated independently from each other
    2. Both datasets are identically distributed
3. The Inference Problem:
  Find a function $f : X \rightarrow Y$ such that $f(\vec{x}) \sim y$.
4. The Expected Risk:
  It is the overall average risk over the entire (data) probability-distribution:
  $$I[f] = \mathbf{E}[V(f(\vec{x}), y)]=\int_{X \times Y} V(f(\vec{x}), y) p(\vec{x}, y) d \vec{x} d y$$
5. The Target Function:
  It is the best possible function $f$ that can be chosen, and is given by:
  $$f = \inf_{h \in \mathcal{H}} I[h]$$
6. The Empirical Risk:
  It is a proxy measure to the expected risk based on the training set. It is necessary since the probability distribution $p(\vec{x}, y)$ is unknown.
  $$I_{S}[f]=\frac{1}{n} \sum_{i=1}^{n} V\left(f\left(\vec{x}_{i}\right), y_{i}\right)$$
Define Empirical Risk Minimization:
It is a (learning) principle in statistical learning theory that is based on approximating the Expected/True Risk (Generalization Error) by measuring the Empirical Risk (Training Error); i.e. the performance on the training-data.

A learning algorithm that chooses the function $f_S$ which minimizes the empirical risk is called Empirical Risk Minimization:

$$R_{\text{emp}} = I_S[f] = \dfrac{1}{n} \sum_{i=1}^n V(f(\vec{x}_ i, y_i))$$

$$f_{S} = \hat{h} = \arg \min _{h \in \mathcal{H}} R_{\mathrm{emp}}(h)$$
What is the Complexity of ERM?
NP-Hard for classification with $0-1$ loss function, even for linear classifiers
1. It can be solved efficiently when the minimal empirical risk is ZERO; i.e. the data is linearly separable
1. How do you Cope with the Complexity?
  1. Employ a convex approximation to the $0-1$ loss: Hinge, SVM etc.
  2. Imposing assumptions on the data-generating distribution thus, stop being an agnostic learning algorithm
Definitions:
1. Generalization:
  The ability to do well on previously unobserved data:
  $$I[f] \approx I_S[f]$$
  
  “Good” generalization is achieved when the empirical risk approximates the expected risk well.
2. Generalization Error:
  AKA: Expected Risk, Out-of-sample Error, $E_{\text{out}}$
  It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data defined as the expected value of the error on a new input, measured w.r.t. the data-generating probability distribution.
3. Generalization Gap:
  It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data, defined as the difference between the expected risk and empirical risk:
  $$G =I\left[f_{n}\right]-I_{S}\left[f_{n}\right]$$
  1. Computing the Generalization Gap:
    Since the empirical risk/generalization error $I[f_n]$ cannot be computed for an *unknown__1. __distribution, the generalization gap cannot be computed either.
  2. What is the goal of SLT in the context of the Generalization Gap given that it can’t be computed?
    Instead, the goal of SLT is to bound/characterize the gap in probability:
    $$P_{G}=P\left(I\left[f_{n}\right]-I_{S}\left[f_{n}\right] \leq \epsilon\right) \geq 1-\delta_{n}$$
    
    I.E. goal is to characterize the probability $1- \delta_n$ that the generalization gap is less than some error bound $\epsilon$ (known as the learning rate)
4. Achieving (“good”) Generalization:
  An algorithm is said to generalize when the expected risk is well approximated by the empirical risk:
  $$I\left[f_{n}\right] \approx I_{S}\left[f_{n}\right]$$
  
  Equivalently:
  
  $$E_{\text {out}}(g) \approx E_{\text {in}}(g)$$
  
  I.E. when the generalization gap approaches zero in the limit of data-points:
  
  $$\lim _{n \rightarrow \infty} G_{n}=\lim _{n \rightarrow \infty} I\left[f_{n}\right]-I_{S}\left[f_{n}\right]=0$$
5. Empirical Distribution:
  AKA: Data-Generating Distribution
  is the discrete, uniform, joint distribution $p_{\text{data}} = p(x,y)$ over the sample points.
Describe the difference between Learning and Optimization in ML:
1. Optimization: is concerned with the problem of reducing the training error on the training set
2. Learning: is concerned with the problem of reducing the training error, as well as, the generalization (test) error
Describe the difference between Generalization and Learning in ML:
1. Generalization guarantee: tells us that it is likely that the following condition holds:
  $$E_{\mathrm{out}}(\hat{h}) \approx E_{\mathrm{in}}(\hat{h})$$
  
  I.E. that the empirical error tracks/approximates the expected/generalization error well.
2. Learning: corresponds to the condition that $\hat{h} \approx f$ the chosen hypothesis approximates the target function well, which in-turn corresponds to the condition:
  $$E_{\mathrm{out}}(\hat{h}) \approx 0$$
How to achieve Learning?
To achieve learning we need to achieve the condition $E_{\mathrm{out}}(\hat{h}) \approx 0$, which we do by:
1. $E_{\mathrm{out}}(\hat{h}) \approx E_{\mathrm{in}}(\hat{h})$
  A theoretical result achieved through Hoeffding Inequality
2. $E_{\mathrm{in}}(\hat{h}) \approx 0$
  A practical result of Empirical Error Minimization
What does the (VC) Learning Theory Achieve?
1. Characterizing the feasibility of learning for infinite hypotheses
2. Characterizing the Approximation-Generalization Tradeoff
Why do we need the probabilistic framework?
What is the Approximation-Generalization Tradeoff:
It is a tradeoff between (1) How well we can approximate the target function $f \:\:$ and (2) How well we can generalize to unseen data.
Given the goal: Small $E_{\text{out}}$, good approximation of $f$ out of sample (not in-sample).
The tradeoff is characterized by the complexity of the hypothesis space $\mathcal{H}$:
1. More Complex $\mathcal{H}$: Better chance of approximating $f$
2. Less Complex $\mathcal{H}$: Better chance of generalizing out-of-sample
What are the factors determining how well an ML-algo will perform?
1. Ability to Approximate $f$ well, in-sample | Make the training error small &
2. Decrease the gap between $E_{\text{in}}$ and $E_{\text{out}}$ | Make gap between training and test error small
Define the following and their usage/application & how they relate to each other:
1. Underfitting:
  Occurs when the model cannot fit the training data well; high $E_{\text{in}}$.
2. Overfitting:
  Occurs when the gap between the training error and test error is too large.
3. Capacity:
  a models ability to fit a high variety of functions (complexity).
  It allows us to control the amount of overfitting and underfitting
  1. Models with Low-Capacity:
    Underfitting. High-Bias. Struggle to fit training-data.
  2. Models with High-Capacity:
    Overfitting. High-Variance. Memorizes noise.
4. Hypothesis Space:
  The set of functions that the learning algorithm is allowed to select as being the target function.
  Allows us to control the capacity of a model.
5. VC-Dimension:
  The largest possible value of $m$ for which there exists a training set of $m$ different points that the classifier can label arbitrarily.
  Quantifies a models capacity.
  1. What does it measure?
    Measures the capacity of a binary classifier.
6. Graph the relation between Error, and Capacity in the ctxt of (Underfitting, Overfitting, Training Error, Generalization Err, and Generalization Gap):
What is the most important result in SLT that show that learning is feasible?
Shows that the discrepancy between training error and generalization error is bounded above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases.

Bias-Variance Decomposition Theory

What is the Bias-Variance Decomposition Theory:
It is an approach for the quantification of the Approximation-Generalization Tradeoff.
What are the Assumptions made by the theory?
1. Analysis is done over the entire data-distribution
2. Real-Valued inputs, targets (can be extended)
3. Target function $f$ is known
4. Uses MSE (can be extended)
What is the question that the theory tries to answer? What assumption is important? How do you achieve the answer/goal?
1. “How can $\mathcal{H}$ approximate $f$ overall? not just on our sample/training-data.”.
2. We assume that the target function $f$ is known.
3. By taking the expectation over all possible realization of $N$ data-points.
What is the Bias-Variance Decomposition:
It is the decomposition of the expected error as a sum of three concepts: bias, variance and irreducible error, each quantifying an aspect of the error.
Define each term w.r.t. source of the error:
1. Bias: the error from the erroneous/simplifying assumptions in the learning algorithm
2. Variance: the error from sensitivity to small fluctuations in the training set
3. Irreducible Error: the error resulting from the noise in the problem itself
What does each of the following measure? Describe it in Words? Give their AKA in statistics?
1. Bias:
  1. AKA: Approximation Error
  2. Measures the error in approximating the target function with the best possible hypothesis in $\mathcal{H}$
  3. The expected deviation from the true value of the function (or parameter)
  4. How well can $\mathcal{H}$ approximate the target function $f$
2. Variance:
  1. AKA: Estimation Error
  2. Measures the error in estimating the best possible hypothesis in $\mathcal{H}$ with a particular hypothesis resulting from a specific training-set
  3. The deviation from the expected estimator value that any particular sampling of the data is likely to cause
  4. How well can we zoom in on a good $h \in \mathcal{H}$
3. Irreducible Error: measures the inherent noise in the target $y$
Give the Formal Definition of the Decomposition (Formula):
Given any hypothesis $\hat{f} = g^{\mathcal{D}}$ we select, we can decompose its expected risk on an unseen sample $x$ as:
$$\mathbb{E}\left[(y-\hat{f}(x))^{2}\right]=(\operatorname{Bias}[\hat{f}(x)])^{2}+\operatorname{Var}[\hat{f}(x)]+\sigma^{2}$$

Where:
1. Bias:
  $$\operatorname{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x)$$
2. Variance:
  $$\operatorname{Var}[\hat{f}(x)] = \mathbb{E}\left[(\hat{f}(x))^2\right] - \mathbb{E}\left[\hat{f}(x)\right]^2$$
3. What is the Expectation over?
  The expectation is over all different samplings of the data $\mathcal{D}$
Define the Bias-Variance Tradeoff:
It is the property of a set of predictive models whereby, models with a lower bias have a higher variance and vice-versa.
1. Effects of Bias:
  1. High Bias: simple models $\rightarrow$ underfitting
  2. Low Bias: complex models $\rightarrow$ overfitting
2. Effects of Variance:
  1. High Variance: complex models $\rightarrow$ overfitting
  2. Low Variance: simple models $\rightarrow$ underfitting
3. Draw the Graph of the Tradeoff (wrt model capacity):
Derive the Bias-Variance Decomposition with explanations:
$${\displaystyle {\begin{aligned}\mathbb{E}_{\mathcal{D}} {\big [}I[g^{(\mathcal{D})}]{\big ]}&=\mathbb{E}_{\mathcal{D}} {\big [}\mathbb{E}_{x}{\big [}(g^{(\mathcal{D})}-y)^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x} {\big [}\mathbb{E}_{\mathcal{D}}{\big [}(g^{(\mathcal{D})}-y)^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}}{\big [}(g^{(\mathcal{D})}- f -\varepsilon)^{2}{\big ]}{\big ]} \\&=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}} {\big [}(f+\varepsilon -g^{(\mathcal{D})}+\bar{g}-\bar{g})^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\mathbb{E}_{\mathcal{D}} {\big [}(\bar{g}-f)^{2}{\big ]}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}(\bar{g}-f)\varepsilon {\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}\varepsilon (g^{(\mathcal{D})}-\bar{g}){\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})(\bar{g}-f){\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}+2(\bar{g}-f)\mathbb{E}_{\mathcal{D}} [\varepsilon ]\: +2\: \mathbb{E}_{\mathcal{D}} [\varepsilon ]\: \mathbb{E}_{\mathcal{D}} {\big [}g^{(\mathcal{D})}-\bar{g}{\big ]}+2\: \mathbb{E}_{\mathcal{D}} {\big [}g^{(\mathcal{D})}-\bar{g}{\big ]}(\bar{g}-f){\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\mathbb{E}_{\mathcal{D}} [\varepsilon ^{2}]+\mathbb{E}_{\mathcal{D}} {\big [}(g^{(\mathcal{D})} - \bar{g})^{2}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}(\bar{g}-f)^{2}+\operatorname {Var} [y]+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\operatorname {Bias} [g^{(\mathcal{D})}]^{2}+\operatorname {Var} [y]+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\ &=\mathbb{E}_{x}{\big [}\operatorname {Bias} [g^{(\mathcal{D})}]^{2}+\sigma ^{2}+\operatorname {Var} {\big [}g^{(\mathcal{D})}{\big ]}{\big ]}\\\end{aligned}}}$$

where:
$\overline{g}(\mathbf{x})=\mathbb{E}_{\mathcal{D}}\left[g^{(\mathcal{D})}(\mathbf{x})\right]$ is the average hypothesis over all realization of $N$ data-points $\mathcal{D}_ i$, and ${\displaystyle \varepsilon }$ and ${\displaystyle {\hat {f}}} = g^{(\mathcal{D})}$ are independent.
What are the key Takeaways from the Tradeoff?
Match the “Model Capacity/Complexity” to the “Data Resources”, NOT to the Target Complexity.
What are the most common ways to negotiate the Tradeoff?
1. Cross-Validation
2. MSE of the Estimates
How does the decomposition relate to Classification?
A similar decomposition exists for:
1. Classification with $0-1$ loss
2. Probabilistic Classification with MSE
Increasing/Decreasing Bias&Variance:
1. Adding Good Feature:
  1. Decrease Bias
2. Adding Bad Feature:
  1. No effect
3. Adding ANY Feature:
  1. Increase Variance
4. Adding more Data:
  1. Decrease Variance
  2. May decrease Bias (if $f \in \mathcal{H}$ | $h$ can fit $f$ exactly)
5. Noise in Test Set:
  1. Affects Only Irreducible Err
6. Noise in Training Set:
  1. Affects Bias and Variance
7. Dimensionality Reduction:
  1. Decrease Variance
8. Feature Selection:
  1. Decrease Variance
9. Regularization:
  1. Increase Bias
  2. Decrease Variance
10. Increasing # of Hidden Units in ANNs:
  1. Decrease Bias
  2. Increase Variance
11. Increasing # of Hidden Layers in ANNs:
  1. Decrease Bias
  2. Increase Variance
12. Increasing $k$ in K-NN:
  1. Increase Bias
  2. Decrease Variance
13. Increasing Depth in Decision-Trees:
  1. Increase Variance
14. Boosting:
  1. Decrease Bias
15. Bagging:
  1. Decrease Variance

Activation Functions

Describe the Desirable Properties for activation functions:
1. Non-Linearity:
  When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
2. Range:
  When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.
3. Continuously Differentiable:
  This property is desirable for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.
4. Monotonicity:
  When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex.
5. Smoothness with Monotonic Derivatives:
  These have been shown to generalize better in some cases.
6. Approximating Identity near Origin:
  Equivalent to ${\displaystyle f(0)=0}$ and ${\displaystyle f'(0)=1}$, and ${\displaystyle f'}$ is continuous at $0$.
  When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values. When the activation function does not approximate identity near the origin, special care must be used when initializing the weights.
7. Zero-Centered Range:
  Has effects of centering the data (zero mean) by centering the activations. Makes learning easier.
1. Non-Linearity:
2. Range:
3. Continuously Differentiable:
4. Monotonicity:
5. Smoothness with Monotonic Derivatives:
6. Approximating Identity near Origin:
7. Zero-Centered Range:
Describe the NON-Desirable Properties for activation functions:
1. Saturation:
  An activation functions output, with finite range, may saturate near its tail or head (e.g. $\{0, 1\}$ for sigmoid). This leads to a problem called vanishing gradient.
2. Vanishing Gradients:
  Happens when the gradient of an activation function is very small/zero. This usually happens when the activation function saturates at either of its tails.
  The chain-rule will multiply the local gradient (of activation function) with the whole objective. Thus, when gradient is small/zero, it will “kill” the gradient $\rightarrow$ no signal will flow through the neuron to its weights or to its data.
  Slows/Stops learning completely.
3. Range Not Zero-Centered:
  This is undesirable since neurons in later layers of processing in a Neural Network would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. $x>0$ elementwise in $f=w^Tx+b$), then the gradient on the weights $w$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $f$). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
  Makes optimization harder.
1. Saturation:
2. Vanishing Gradients:
3. Range Not Zero-Centered:
List the different activation functions used in ML?
Identity, Sigmoid, Tanh, ReLU, L-ReLU, ELU, SoftPlus

Names, Definitions, Properties (pros&cons), Derivatives, Applications, pros/cons:
1. Sigmoid:
  $$S(z)=\frac{1}{1+e^{-z}} \\ S^{\prime}(z)=S(z) \cdot(1-S(z))$$
  1. Properties:
    Never use as activation, use as an output unit for binary classification.
    1. Pros:
      1. Has a nice interpretation as the firing rate of a neuron
    2. Cons:
      1. They Saturate and kill gradients $\rightarrow$ Gives rise to vanishing gradients $\rightarrow$ Stop Learning
        
        Happens when initialization weights are too large
        
        or sloppy with data preprocessing
        
        Neurons Activation saturates at either tail of $0$ or $1$
      2. Output NOT Zero-Centered $\rightarrow$ Gradient updates go too far in different directions $\rightarrow$ makes optimization harder
      3. The local gradient $(z * (1-z))$ achieves maximum at $0.25$, when $z = 0.5$. $\rightarrow$ very time the gradient signal flows through a sigmoid gate, its magnitude always diminishes by one quarter (or more) $\rightarrow$ with basic SGD, the lower layers of a network train much slower than the higher one
2. Tanh:
  $$\tanh (z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \\ \tanh ^{\prime}(z)=1-\tanh (z)^{2}$$
  
  Properties:
  Strictly superior to Sigmoid (scaled version of sigmoid | stronger gradient). Good for activation.
  1. Pros:
    1. Zero Mean/Centered
  2. Cons:
    1. They Saturate and kill gradients $\rightarrow$ Gives rise to vanishing gradients $\rightarrow$ Stop Learning
3. Relu:
  $$R(z)=\left\{\begin{array}{cc}{z} & {z>0} \\ {0} & {z<=0}\end{array}\right\} \\ R^{\prime}(z)=\left\{\begin{array}{ll}{1} & {z>0} \\ {0} & {z<0}\end{array}\right\}$$
  
  Properties:
  The best for activation (Better gradients).
  1. Pros:
    1. Non-saturation of gradients which accelerates convergence of SGD
    2. Sparsity effects and induced regularization. discussion
    3. Not computationally expensive
  2. Cons:
    1. ReLU not zero-centered problem:
      The problem that ReLU is not zero-centered can be solved/mitigated by using batch normalization, which normalizes the signal before activation:
      From paper: We add the BN transform immediately before the nonlinearity, by normalizing $x = Wu + b$; normalizing it is likely to produce activations with a stable distribution.
      
      WHY NORMALIZING THE SIGNAL IS IMPORTANT
    2. Dying ReLUs (Dead Neurons):
      If a neuron gets clamped to zero in the forward pass (it doesn’t “fire” / $x<0$), then its weights will get zero gradient. Thus, if a ReLU neuron is unfortunately initialized such that it never fires, or if a neuron’s weights ever get knocked off with a large update during training into this regime (usually as a symptom of aggressive learning rates), then this neuron will remain permanently dead.
      1. cs231n Explanation
    3. Infinite Range:
      Can blow up the activation.
4. Leaky Relu:
  $$R(z)=\left\{\begin{array}{cc}{z} & {z>0} \\ {\alpha z} & {z<=0}\end{array}\right\} \\ R^{\prime}(z)=\left\{\begin{array}{ll}{1} & {z>0} \\ {\alpha} & {z<0}\end{array}\right\}$$
  
  Properties:
  Sometimes useful. Worth trying.
  1. Pros:
    1. Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).
  2. Cons:
    The consistency of the benefit across tasks is presently unclear.

Kernels

Define “Local Kernel” and give an analogy to describe it:
Write the following kernels:
1. Polynomial Kernel of degree, up to, $d$:
2. Gaussian Kernel:
3. Sigmoid Kernel:
4. Polynomial Kernel of degree, exactly, $d$:

Math

What is a metric?
A Metric (distance function) $d$ is a function that defines a distance between each pair of elements of a set $X$.
A Metric induces a topology on a set; BUT, not all topologies can be generated by a metric.
Mathematically, it is a function:
${\displaystyle d:X\times X\to [0,\infty )},$
that must satisfy the following properties:
1. ${\displaystyle d(x,y)\geq 0}$ $\:\:\:\:\:\:\:$ non-negativity or separation axiom
2. ${\displaystyle d(x,y)=0\Leftrightarrow x=y}$ $\:\:\:\:\:\:\:$ identity of indiscernibles
3. ${\displaystyle d(x,y)=d(y,x)}$ $\:\:\:\:\:\:\:$ symmetry
4. ${\displaystyle d(x,z)\leq d(x,y)+d(y,z)}$ $\:\:\:\:\:\:\:$ subadditivity or triangle inequality
  
  The first condition is implied by the others.
  Metric
Describe Binary Relations and their Properties?
A binary relation on a set $A$ is a set of ordered pairs of elements of $A$. In other words, it is a subset of the Cartesian product $A^2 = A ×A$.
The number of binary relations on a set of $N$ elements is $= 2^{N^2}$ Examples:
1. “is greater than”
2. “is equal to”
3. A function $f(x)$
  Properties: (for a relation $R$ and set $X$)
4. Reflexive: for all $x$ in $X$ it holds that $xRx$
5. Symmetric: for all $x$ and $y$ in $X$ it holds that if $xRy$ then $yRx$
6. Transitive: for all $x$, $y$ and $z$ in $X$ it holds that if $xRy$ and $yRz$ then $xRz$
  answer
Formulas:
1. Set theory:
  1. Number of subsets of a set of $N$ elements:
    $= 2^N$
  2. Number of pairs $(a,b)$ of a set of N elements:
    $= N^2$
2. Binomial Theorem:
  $$(x+y)^{n}=\sum_{k=0}^{n}{n \choose k}x^{n-k}y^{k}=\sum_{k=0}^{n}{n \choose k}x^{k}y^{n-k} \\={n \choose 0}x^{n}y^{0}+{n \choose 1}x^{n-1}y^{1}+{n \choose 2}x^{n-2}y^{2}+\cdots +{n \choose n-1}x^{1}y^{n-1}+{n \choose n}x^{0}y^{n},$$
3. Binomial Coefficient:
  $${\binom {n}{k}}={\frac {n!}{k!(n-k)!}} = N \text{choose} k = N \text{choose} (n-k)$$
4. Expansion of $x^n - y^n =$
  $$x^n - y^n = (x-y)(x^{n-1} + x^{n-2} y + ... + x y^{n-2} + y^{n-1})$$
5. Number of ways to partition $N$ data points into $k$ clusters:
  1. Number of pairs (e.g. $(a,b)$) of a set of $N$ elements $= N^2$
  2. There are at most $k^N$ ways to partition $N$ data points into $k$ clusters - there are $N$ choose $k$ clusters, precisely
6. $\log_x(y) =$
  $$\log_x(y) = \dfrac{\ln(y)}{\ln(x)}$$
7. The length of a vector $\mathbf{x}$ along a direction (projection):
  1. Along a unit-length vector $\hat{\mathbf{w}}$:
    $\text{comp}_ {\hat{\mathbf{w}}}(\mathbf{x}) = \hat{\mathbf{w}}^T\mathbf{x}$
  2. Along an unnormalized vector $\mathbf{w}$:
    $\text{comp}_ {\mathbf{w}}(\mathbf{x}) = \dfrac{1}{\|\mathbf{w}\|} \mathbf{w}^T\mathbf{x}$
8. $\sum_{i=1}^{n} 2^{i}=$
  $\sum_{i=1}^{n} 2^{i}=2^{n+1}-2$
List 6 proof methods:
1. Direct Proof
2. Mathematical Induction
  1. Strong Induction
  2. Infinite Descent
3. Contradiction
4. Contraposition ($(p \implies q) \iff (!q \implies !p)$)
5. Construction
6. Combinatorial
7. Exhaustion
8. Non-Constructive proof (existence proofs) answer
Important Formulas
1. Projection $\tilde{\mathbf{x}}$ of a vector $\mathbf{x}$ onto another vector $\mathbf{u}$:
  $$\tilde{\mathbf{x}} = \mathbf{x}\cdot \dfrac{\mathbf{u}}{\|\mathbf{u}\|_2} \mathbf{u}$$

Statistics

ROC curve:
1. Definition:
  A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
2. Purpose:
  A way to quantify how good a binary classifier separates two classes.
3. How do you create the plot?
  The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
4. How to identify a good classifier:
  A Good classifier has a ROC curve that is near the top-left diagonal (hugging it).
5. How to identify a bad classifier:
  A Bad Classifier has a ROC curve that is close to the diagonal line.
6. What is its application in tuning the model?
  It allows you to set the classification threshold:
  1. You can minimize False-positive rate or maximize the True-Positive Rate
AUC - AUROC:
1. Definition:
  When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’).
2. Range:
  Range $= 0.5 - 1.0$, from poor to perfect, with an uninformative classifier yielding $0.5$
3. What does it measure:
  It is a measure of aggregated classification performance.
4. Usage in ML:
  For model comparison.
Define Statistical Efficiency (of an estimator)?
Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance.
Efficiencies are often defined using the variance or mean square error as the measure of desirability.
An efficient estimator is also the minimum variance unbiased estimator (MVUE).
1. An Efficient Estimator has lower variance than an inefficient one
2. The use of an inefficient estimator gives results equivalent to those obtainable from a subset of data; and is therefor, wasteful of data
Whats the difference between Errors and Residuals:
The Error of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest.

The Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.
1. Compute the statistical errors and residuals of the univariate, normal distribution defined as $X_{1}, \ldots, X_{n} \sim N\left(\mu, \sigma^{2}\right)$:
  1. Statistical Errors:
    $$e_{i}=X_{i}-\mu$$
  2. Residuals:
    $$r_{i}=X_{i}-\overline {X}$$
  3. Example in Univariate Distributions
What is a biased estimator?
We define the Bias of an estimator as:
$$ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_{m}\right)=\mathbb{E}\left(\hat{\boldsymbol{\theta}}_{m}\right)-\boldsymbol{\theta} $$

A Biased Estimator is an estimator $\hat{\boldsymbol{\theta}}_ {m}$ such that:

$$ \operatorname{bias}\left(\hat{\boldsymbol{\theta}}_ {m}\right) \geq 0$$
1. Why would we prefer biased estimators in some cases?
  Mainly, due to the Bias-Variance Decomposition. The MSE takes into account both the bias and the variance and sometimes the biased estimator might have a lower variance than the unbiased one, which results in a total decrease in the MSE.
What is the difference between “Probability” and “Likelihood”:
Probabilities are the areas under a fixed distribution
$pr($data$|$distribution$)$
i.e. probability of some data (left hand side) given a distribution (described by the right hand side)
Likelihoods are the y-axis values for fixed data points with distributions that can be moved..
$L($distribution$|$observation/data$)$
It is the likelihood of the parameter $\theta$ for the data $\mathcal{D}$.

Likelihood is, basically, a specific probability that can only be calculated after the fact (of observing some outcomes). It is not normalized to $1$ (it is not a probability). It is just a way to quantify how likely a set of observation is to occur given some distribution with some parameters; then you can manipulate the parameters to make the realization of the data more “likely” (it is precisely meant for that purpose of estimating the parameters); it is a function of the parameters.
Probability, on the other hand, is absolute for all possible outcomes. It is a function of the Data.
Estimators:
1. Define:
  A Point Estimator or statistic is any function of the data.
2. Formula:
  $$\hat{\boldsymbol{\theta}}_{m}=g\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)}\right)$$
3. Whats a good estimator?
  A good estimator is a function whose output is close to the true underlying $\theta$ that generated the training data.
4. What are the Assumptions made regarding the estimated parameter:
  We assume that the true $\boldsymbol{\theta}$ is fixed, and that $\hat{\boldsymbol{\theta}}$ is a function of the data, which is drawn from a random process, making $\hat{\boldsymbol{\theta}}$ a random variable.
What is Function Estimation:
Function Estimation/Approximation refers to estimation of the relationship between input and target data.
I.E. We are trying to predict a variable $y$ given an input vector $x$, and we assume that there is a function $f(x)$ that describes the approximate relationship between $y$ and $x$.
If we assume that: $y = f(x) + \epsilon$, where $\epsilon$ is the part of $y$ that is not predictable from $x$; then we are interested in approximating $f$ with a model or estimate $\hat{f}$.
1. Whats the relation between the Function Estimator $\hat{f}$ and Point Estimator:
  Function estimation is really just the same as estimating a parameter $\boldsymbol{\theta}$; the function estimator $\hat{f}$ is simply a point estimator in function space.
Define “marginal likelihood” (wrt naive bayes):
Marginal likelihood is, the probability that the word ‘FREE’ is used in any message (not given any other condition?).

(Statistics) - MLE

Clearly Define MLE and derive the final formula:
MLE is a method/principle from which we can derive specific functions that are good estimators for different models.
Likelihood in Parametric Models:
Suppose we have a parametric model $\{p(y ; \theta) | \theta \in \Theta\}$ and a sample $D=\left\{y_{1}, \ldots, y_{n}\right\}$:
1. The likelihood of parameter estimate $\hat{\theta} \in \Theta$ for sample $\mathcal{D}$ is:
  $$\begin{aligned} {\displaystyle {\mathcal {L}}(\theta \,;\mathcal{D} )} &= p(\mathcal{D} ; \theta) \\&=\prod_{i=1}^{n} p\left(y_{i} ; \theta\right)\end{aligned}$$
2. In practice, we prefer to work with the log-likelihood. Same maximum but
  $$\log {\displaystyle {\mathcal {L}}(\theta \,;\mathcal{D} )}=\sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right)$$
  
  and sums are easier to work with than products.
MLE for Parametric Models:
The maximum likelihood estimator (MLE) for $\theta$ in the (parametric) model $\{p(y, \theta) | \theta \in \Theta\}$ is:

$$\begin{aligned} \hat{\theta} &=\underset{\theta \in \Theta}{\arg \max } \log p(\mathcal{D}, \hat{\theta}) \\ &=\underset{\theta \in \Theta}{\arg \max } \sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right) \end{aligned}$$
1. Write MLE as an expectation wrt the Empirical Distribution:
  Because the $\text {arg max }$ does not change when we rescale the cost function, we can divide by $m$ to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution $\hat{p}_ {\text {data}}$ defined by the training data:
  $${\displaystyle \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \:\: \mathbb{E}_{\mathbf{x} \sim \hat{p} \text {data}} \log p_{\text {model}}(\boldsymbol{x} ; \boldsymbol{\theta})} \tag{5.59}$$
2. Describe formally the relationship between MLE and the KL-divergence:
  MLE as Minimizing KL-Divergence between the Empirical dist. and the model dist.:
  We can interpret maximum likelihood estimation as minimizing the dissimilarity between the empirical distribution $\hat{p}_ {\text {data}}$, defined by the training set, and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence.
  1. The KL-divergence is given by:
    $${\displaystyle D_{\mathrm{KL}}\left(\hat{p}_{\text {data}} \| p_{\text {model}}\right)=\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data}}}\left[\log \hat{p}_{\text {data}}(\boldsymbol{x})-\log p_{\text {model}}(\boldsymbol{x})\right]} \tag{5.60}$$
    
    The term on the left is a function only of the data-generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize:
  $${\displaystyle -\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data}}}\left[\log p_{\text {model}}(\boldsymbol{x})\right]} \tag{5.61}$$
  
  which is of course the same as the maximization in equation $5.59$.
3. Extend the argument to show the link between MLE and Cross-Entropy. Give an example:
  Minimizing this KL-divergence corresponds exactly to minimizing the cross-entropy between the distributions.
  Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model.
  E.g. MSE is the cross-entropy between the empirical distribution and a Gaussian model.
  
  Maximum likelihood thus becomes minimization of the negative log-likelihood(NLL), or equivalently, minimization of the cross-entropy.
4. How does MLE relate to the model distribution and the empirical distribution?
  We can thus see maximum likelihood as an attempt to make the model distribution match the empirical distribution $\hat{p} _ {\text {data}}$.
5. What is the intuition behind using MLE?
  If I choose a hypothesis $h$ underwhich the observed data is very plausible then the hypothesis is very likely.
6. What does MLE find/result in?
  It finds the value of the parameter $\theta$ that, if used (in the model) to generate the probability of the data, would make the data most “likely” to occur.
7. What kind of problem is MLE and how to solve for it?
  It is an optimization problem that is solved by calculus for problems with closed-form solutions or with numerical methods (e.g. SGD).
8. How does it relate to SLT:
  It corresponds to Empirical Risk Minimization.
9. Explain clearly why we maximize the natural log of the likelihood
  1. Numerical Stability: change products to sums
  2. The logarithm of a member of the family of exponential probability distributions (which includes the ubiquitous normal) is polynomial in the parameters (i.e. max-likelihood reduces to least-squares for normal distributions)
    $\log\left(\exp\left(-\frac{1}{2}x^2\right)\right) = -\frac{1}{2}x^2$
  3. The latter form is both more numerically stable and symbolically easier to differentiate than the former. It increases the dynamic range of the optimization algorithm (allowing it to work with extremely large or small values in the same way).
  4. The logarithm is a monotonic transformation that preserves the locations of the extrema (in particular, the estimated parameters in max-likelihood are identical for the original and the log-transformed formulation)
  5. Gradient methods generally work better optimizing $log_p(x)$ than $p(x)$ because the gradient of $log_p(x)$ is generally more well-scaled. link
    Justification: the gradient of the original term will include a $e^{\vec{x}}$ multiplicative term that scales very quickly one way or another, requiring the step-size to equally scale/stretch in the opposite direction.

Text-Classification | Classical

List some Classification Methods:
(Hand-Coded)Rules-Based Algorithms: use rules based on combinations of words or other features.
1. Can have high accuracy if the rules are carefully refined and maintained by experts.
2. However, building and maintaining these rules is very hard.
Supervised Machine Learning: using an ML algorithm that trains on a training set of (document, class) elements to train a classifier.
1. Types of Classifiers:
  1. Naive Bayes
  2. Logistic Regression
  3. SVMs
  4. K-NNs
List some Applications of Txt Classification:
1. Spam Filtering: discerning spam emails form legitimate emails.
2. Email Routing: sending an email sento to a genral address to a specfic affress based on the topic.
3. Language Identification: automatiacally determining the genre of a piece of text.
4. Readibility Assessment__: determining the degree of readability of a piece of text.
5. Sentiment Analysis: determining the general emotion/feeling/attitude of the author of a piece of text.
6. Authorship Attribution: determining which author wrote which piece of text.
7. Age/Gender Identification: determining the age and/or gender of the author of a piece of text.

NLP

List some problems in NLP:
1. Question Answering (QA)
2. Information Extraction (IE)
3. Sentiment Analysis
4. Machine Translation (MT)
5. Spam Detection
6. Parts-of-Speech (POS) Tagging
7. Named Entity Recognition (NER)
8. Conference Resolution
9. Word Sense Disambugation (WSD)
10. Parsing
11. Paraphrasing
12. Summarization
13. Dialog
List the Solved Problems in NLP:
1. Spam Detection
2. Parts-of-Speech (POS) Tagging
3. Named Entity Recognition (NER)
List the “within reach” problems in NLP:
1. Sentiment Analysis
2. Conference Resolution
3. Word Sense Disambugation (WSD)
4. Parsing
5. Machine Translation (MT)
6. Information Extraction (IE)
List the Open Problems in NLP:
1. Question Answering (QA)
2. Paraphrasing
3. Summarization
4. Dialog
Why is NLP hard? List Issues:
1. Non-Standard English: “Great Job @ahmed_badary! I luv u 2!! were SOO PROUD of dis.”
2. Segmentation Issues: “New York-New Haven” vs “New-York New-Haven”
3. Idioms: “dark horse”, “getting cold feet”, “losing face”
4. Neologisms: “unfriend”, “retweet”, “google”, “bromance”
5. World Knowledge: “Ahmed and Zach are brothers”, “Ahmed and Zach are fathers”
6. Tricky Entity Names: “Where is Life of Pie playing tonight?”, “Let it be was a hit song!”
Define:
1. Morphology:
  The study of words, how they are formed, and their relationship to other words in the same language.
2. Morphemes:
  the small meaningful units that make up words.
3. Stems:
  the core meaning-bearing units of words.
4. Affixes:
  the bits and pieces that adhere to stems (often with grammatical functions).
5. Stemming:
  is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
  The stem need not map to a valid root in the language.
6. Lemmatization:
  reducing inflections or variant forms to base form.

Language Modeling

What is a Language Model?
List some Applications of LMs:
Traditional LMs:
1. How are they setup?
2. What do they depend on?
3. What is the Goal of the LM task? (in the ctxt of the problem setup)
4. What assumptions are made by the problem setup? Why?
5. What are the MLE Estimates for probabilities of the following:
  1. Bi-Grams:
    $$p(w_2\vert w_1) = $$
  2. Tri-Grams:
    $$p(w_3\vert w_1, w_2) = $$
6. What are the issues w/ Traditional Approaches?
What+How can we setup some NLP tasks as LM tasks:
How does the LM task relate to Reasoning/AGI:
Evaluating LM models:
1. List the Loss Functions (+formula) used to evaluate LM models? Motivate each:
2. Which application of LM modeling does each loss work best for?
3. Why Cross-Entropy:
4. Which setting it used for?
5. Why Perplexity:
6. Which setting used for?
7. If no surprise, what is the perplexity?
8. How does having a good LM relate to Information Theory?
LM DATA:
1. How does the fact that LM is a time-series prediction problem affect the way we need to train/test:
2. How should we choose a subset of articles for testing:
List three approaches to Parametrizing LMs:
1. Describe “Count-Based N-gram Models”:
2. What distributions do they capture?:
3. Describe “Neural N-gram Models”:
4. What do they replace the captured distribution with?
5. What are they better at capturing:
6. Describe “RNNs”:
7. What do they replace/capture?
8. How do they capture it?
9. What are they best at capturing:
What’s the main issue in LM modeling?
1. How do N-gram models capture/approximate the history?:
2. How do RNNs models capture/approximate the history?:
1. The Bias-Variance Tradeoff of the following:
  1. N-Gram Models:
  2. RNNs:
  3. An Estimate s.t. it predicts the probability of a sentence by how many times it has seen it before:
    1. What happens in the limit of infinite data?
What are the advantages of sub-word level LMs:
What are the disadvantages of sub-word level LMs:
What is a “Conditional LM”?
Write the decomposition of the probability for the Conditional LM:
Describe the Computational Bottleneck for Language Models:
Describe/List some solutions to the Bottleneck:
Complexity Comparison of the different solutions:

Regularization

Define Regularization both intuitively and formally:
Regularization can be, loosely, defined as: any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Formally, it is a set of techniques that impose certain restrictions on the hypothesis space (by adding information) in order to solve an ill-posed problem or to prevent overfitting.
Define “well-posedness”:
Hadamard defines Well-Posed Problems as having the properties (1) A Solution Exists (2) It is Unique (3) It’s behavior changes continuously with the initial conditions.
Give four aspects of justification for regularization (theoretical):
1. From a philosophical pov:
  It attempts to impose Occam’s razor on the solution.
2. From a probabilistic pov:
  From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters. This is equivalent to making a MAP Estimate of the function we are trying to learn:
  $$\hat{\boldsymbol{\theta}}_{\mathrm{MAP}}=\underset{\boldsymbol{\theta}}{\arg \min }-\left(\sum_{i=1}^{n} \log \left[p\left(y_{i} | \mathbf{x}_ {i}, \boldsymbol{\theta}\right)\right]\right)-\log [p(\boldsymbol{\theta})]$$
3. From an SLT pov:
  Regularization can be motivated as imposing a restriction on the complexity of a hypothesis space; called, the “capacity”. In SLT, this is called the Approximation-Generalization Tradeoff. Regularization, effectively, works as a way to maneuver that tradeoff by giving preference to certain solutions. This can be described in two ways:
  1. VC-Theory:
    The theory results in a very simple, yet informative, inequality that captures the tradeoff, called The Generalization Bound:
    $$E_{\text {out}}(h) \leq E_{\text {in}}(h)+\Omega(\mathcal{H})$$
    
    But, if we formalize regularization as adding a regularizer (from the MAP estimate above): $\Omega = \Omega(h)$ a function of the hypothesis $h$, to the in-sample error $E_{\text {in}}$, then we get a new “Augmented” Error:
    
    $$E_{\text {aug}}(h) = E_{\text {in}} + \lambda \Omega(h)$$
    
    Notice the correspondence between the form of the new augmented error $E_{\text {aug}}$ and the VC Inequality:
    
    $$\begin{aligned}E_{\text {aug}}(h) &= E_{\text {in}}(h) + \lambda \Omega(h) \\ &\downarrow \\ E_{\text {out}}(h) &\leq E_{\text {in}}(h)+\Omega(\mathcal{H})\end{aligned}$$
    
    (we can relate the complexity of a single object to the complexity of a set of objects)
    Interpreting the correspondence:
    Consider the goal of ML: “to find an in-sample estimate of the out-sample error”. From that perspective, the Augmented Error $E_{\text {aug}}(h)$ is a better proxy to (estimate of) the out-sample error. Thus, minimizing $E_{\text {aug}}(h)$ corresponds, better, to minimizing $E_{\text {out}}(h)$. Where $\Omega(h)$ is an estimate of $\Omega(\mathcal{H})$; the regularization term a minimizer of the complexity term.
  2. Bias-Variance Decomposition:
    If the variance term $\operatorname{Var}[h]$ corresponds to the complexity term $\Omega(\mathcal{H})$, and the regularization term $\Omega(h)$, also, corresponds to $\Omega(\mathcal{H})$; then, the regularization term must also correspond to the variance term.
    From that perspective, adding the regularization term to the augmented error accounts for the variance of the hypothesis and acts as a measure for it. Thus, minimizing the regularization term $\Omega(h)$ corresponds to minimizing the variance $\operatorname{Var}[h]$. However, when the variance of a model goes down the bias tends to go up. Since this inverse-relationship is not linear, this analysis/view gives us a way to apply regularization effectively, by allowing us to measure the effect of regularization.
    Basically, we would like our regularization term to reduce the variance more than it increases the bias. We can control that tradeoff by the hyperparameter $\lambda$.
4. From a practical pov (relating to the real-world):
  Most applications of DL are to domains where the true data-generating process is almost certainly outside the model family (hypothesis space). Deep learning algorithms are typically applied to extremely complicated domains such as images, audio sequences and text, for which the true generation process essentially involves simulating the entire universe.
  
  Thus, controlling the complexity of the mdoel is not a simple matter of finding the model of the right size, with the right number of parameters; instead, the best fitting model (wrt. generalization error) is a large model that has been regularized appropriately.
Describe an overview of regularization in DL. How does it usually work?
In the context of DL, most regularization strategies are based on regularizing estimators, which usually works by trading increased bias for reduced variance.
1. Intuitively, how can a regularizer be effective?
  An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing the bias.
Describe the relationship between regularization and capacity:
Regularization is a (more general) way of controlling a models capacity by allowing us to express preference for one function over another in the same hypothesis space; instead of including or excluding members from the hypothesis space completely.
Describe the different approaches to regularization:
1. Parameter Norm Penalties: $L^p$ norms, early stopping
2. Data Augmentation: noise robustness/injection, dropout
3. Semi-supervised Learning
4. Multi-task Learning
5. Ensemble Learning: bagging, etc.
6. Adversarial Training
7. Infinite Priors: parameter tying and sharing
List 9 regularization techniques:
1. $L^2$ regularization
2. $L^1$ regularization
3. Dataset Augmentation
4. Noise Injection
5. Semi-supervised Learning
6. Multi-task Learning
7. Early Stopping
8. Parameter Tying, Sharing
9. Sparse Representations
10. Ensemble Methods, Bagging etc. 1* Dropout
11. Adversarial Training
12. Tangent prop and Manifold Tangent Classifier

Describe Parameter Norm Penalties (PNPs):
Many regularization approaches are based on limiting the capacity of models by adding a parameter norm penalty $\Omega(\boldsymbol{\theta})$ to the objective function $J$. We denote the regularized objective function by $\tilde{J}$:
$$\tilde{J}(\boldsymbol{\theta} ; \boldsymbol{X}, \boldsymbol{y})=J(\boldsymbol{\theta} ; \boldsymbol{X}, \boldsymbol{y})+\alpha \Omega(\boldsymbol{\theta}) \tag{7.1}$$

where $\alpha \in[0, \infty)$ is a HP that weights the relative contribution of the norml penalty term, $\Omega$, relative to the standard objective function $J$.
1. Define the regularized objective:
  $$\tilde{J}(\boldsymbol{\theta} ; \boldsymbol{X}, \boldsymbol{y})=J(\boldsymbol{\theta} ; \boldsymbol{X}, \boldsymbol{y})+\alpha \Omega(\boldsymbol{\theta}) \tag{7.1}$$
2. Describe the parameter $\alpha$:
  $\alpha \in[0, \infty)$ is a HP that weights the relative contribution of the norm penalty term, $\Omega$, relative to the standard objective function $J$.
3. How does it influence the regularization:
  1. Effects of $\alpha$:
    1. $\alpha = 0$ results in NO regularization
    2. Larger values of $\alpha$ correspond to MORE regularization
4. What is the effect of minimizing the regularized objective?
  The effect of minimizing the regularized objective function is that it will decrease, both, the original objective $J$ on the training data and some measure of the size of the parameters $\boldsymbol{\theta}$.
How do we deal with the Bias parameter in PNPs? Explain.
In NN, we usually penalize only the weights of the affine transformation at each layer and we leave the biases unregularized.
Biases typically require less data than the weights to fit accurately. The reason is that each weight specifies how TWO variables interact so fitting the weights well, requires observing both variables in a variety of conditions. However, each bias controls only a single variable, thus, we don’t induce too much variance by leaving the biases unregularized. If anything, regularizing the bias can introduce a significant amount of underfitting.
Describe the tuning of the $\alpha$ HP in NNs for different hidden layers:
In the context of neural networks, it is sometimes desirable to use a separate penalty with a different $\alpha$ coefficient for each layer of the network. Because it can be expensive to search for the correct value of multiple hyperparameters, it is still reasonable to use the same weight decay at all layers just to reduce the size of search space.

Formally describe the $L^2$ parameter regularization:
It is a regularization strategy that drives the weights closer to the origin by adding a regularization term:

$$\Omega(\mathbf{\theta}) = \frac{1}{2}\|\boldsymbol{w}\|_ {2}^{2}$$

to the objective function.

AKA:
In statistics, $L^2$ regularization is also known as Ridge Regression or Tikhonov Regularization.
In ML, it is known as weight decay.
Describe the regularization contribution to the gradient in a single step.
The addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step, just before performing the usual gradient update.
$$\boldsymbol{w} \leftarrow(1-\epsilon \alpha) \boldsymbol{w}-\epsilon \nabla_{\boldsymbol{w}} J(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y}) \tag{7.5}$$
Describe the regularization contribution to the gradient. How does it scale?
The effect of weight decay is to rescale $\boldsymbol{w}^{\ast}$ along the axes defined by the eigenvector of $\boldsymbol{H}$. Specifically, the component of $\boldsymbol{w}^{\ast}$ that is aligned with the $i$-th eigenvector of $\boldsymbol{H}$ is rescaled by a factor of $\frac{\lambda_{i}}{\lambda_{i}+\alpha}$.

How does weight decay relate to shrinking the individual weight wrt their size? What is the measure/comparison used?
Only directions along which the parameters contribute significantly to reducing the objective function are preserved relatively intact. In directions that do not contribute to reducing the objective function, a small eigenvalue of the Hessian tells us that movement in this direction will not significantly increase the gradient. Components of the weight vector corresponding to such unimportant directions are decayed away through the use of the regularization throughout training.

Condition	Effect of Regularization
$\lambda_{i}>>\alpha$	Not much
$\lambda_{i}<<\alpha$	The weight value almost shrunk to $0$

Draw a graph describing the effects of $L^2$ regularization on the weights:
Describe the effects of applying weight decay to linear regression
Derivation:
We place a Gaussian Prior on the weights, with zero mean and equal variance $\tau^2$:
$$\begin{aligned} \hat{\theta}_ {\mathrm{MAP}} &=\arg \max_{\theta} \log P(y | \theta)+\log P(\theta) \\ &=\arg \max _{\boldsymbol{w}}\left[\log \prod_{i=1}^{n} \dfrac{1}{\sigma \sqrt{2 \pi}} e^{-\dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}}+\log \prod_{j=0}^{p} \dfrac{1}{\tau \sqrt{2 \pi}} e^{-\dfrac{w_{j}^{2}}{2 \tau^{2}}} \right] \\ &=\arg \max _{\boldsymbol{w}} \left[-\sum_{i=1}^{n} \dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}-\sum_{j=0}^{p} \dfrac{w_{j}^{2}}{2 \tau^{2}}\right] \\ &=\arg \min_{\boldsymbol{w}} \dfrac{1}{2 \sigma^{2}}\left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\dfrac{\sigma^{2}}{\tau^{2}} \sum_{j=0}^{p} w_{j}^{2}\right] \\ &=\arg \min_{\boldsymbol{w}} \left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\lambda \sum_{j=0}^{p} w_{j}^{2}\right] \\ &= \arg \min_{\boldsymbol{w}} \left[ \|XW - \boldsymbol{y}\|^2 + \lambda {\|\boldsymbol{w}\|_ 2}^2\right]\end{aligned}$$
1. What is $L^2$ regularization equivalent to?
  $L^2$ regularization is equivalent to MAP Bayesian inference with a Gaussian prior on the weights.
2. What are we maximizing?
  The MAP Estimate of the data.
3. Derive the MAP Estimate:
  $$\begin{aligned} \hat{\theta}_ {\mathrm{MAP}} &=\arg \max_{\theta} P(\theta | y) \\ &=\arg \max_{\theta} \frac{P(y | \theta) P(\theta)}{P(y)} \\ &=\arg \max_{\theta} P(y | \theta) P(\theta) \\ &=\arg \max_{\theta} \log (P(y | \theta) P(\theta)) \\ &=\arg \max_{\theta} \log P(y | \theta)+\log P(\theta) \end{aligned}$$
4. What kind of prior do we place on the weights? What are its parameters?
  We place a Gaussian Prior on the weights, with zero mean and equal variance $\tau^2$.
List the properties of $L^2$ regularization:
1. Notice that L2-regularization has a rotational invariance. This actually makes it more sensitive to irrelevant features.
2. Adding L2-regularization to a convex function gives a strongly-convex function. So L2-regularization can make gradient descent converge much faster.
Formally describe the $L^1$ parameter regularization:
$L^1$ Regularization is another way to regulate the model by penalizing the size of its parameters; the technique adds a regularization term:
$$\Omega(\boldsymbol{\theta})=\|\boldsymbol{w}\|_{1}=\sum_{i}\left|w_{i}\right| \tag{7.18}$$

which is a sum of absolute values of the individual parameters.
1. AKA:
  LASSO.
2. Whats the regularized objective function?
  $$\tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=\alpha\|\boldsymbol{w}\|_ {1}+J(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y}) \tag{7.19}$$
3. What is its gradient?
  $$\nabla_{\boldsymbol{w}} \tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=\alpha \operatorname{sign}(\boldsymbol{w})+\nabla_{\boldsymbol{w}} J(\boldsymbol{X}, \boldsymbol{y} ; \boldsymbol{w}) \tag{7.20}$$
4. Describe the regularization contribution to the gradient compared to L2. How does it scale?
  The regularization contribution to the gradient no longer scales linearly with each $w_i$; instead it is a constant factor with a sign = $\text{sign}(w_i)$.
List the properties and applications of $L^1$ regularization:
1. Induces Sparser Solutions
2. Solutions may be non-unique
1. How is it used as a feature selection mechanism?
  LASSO: The Least Absolute Shrinkage and Selection Operator integrates an $L^1$ penalty with a linear model and a least-squares cost function.
  The $L^1$ penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.
Derivation:
$$\begin{aligned} \hat{\theta}_ {\mathrm{MAP}} &=\arg \max_{\theta} \log P(y | \theta)+\log P(\theta) \\ &=\arg \max _{\boldsymbol{w}}\left[\log \prod_{i=1}^{n} \dfrac{1}{\sigma \sqrt{2 \pi}} e^{-\dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}}+\log \prod_{j=0}^{p} \dfrac{1}{2 b} e^{-\dfrac{\left|\theta_{j}\right|}{2 b}} \right] \\ &=\arg \max _{\boldsymbol{w}} \left[-\sum_{i=1}^{n} \dfrac{\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}}{2 \sigma^{2}}-\sum_{j=0}^{p} \dfrac{\left|w_{j}\right|}{2 b}\right] \\ &=\arg \min_{\boldsymbol{w}} \dfrac{1}{2 \sigma^{2}}\left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\dfrac{\sigma^{2}}{b} \sum_{j=0}^{p}\left|w_{j}\right|\right] \\ &=\arg \min_{\boldsymbol{w}} \left[\sum_{i=1}^{n}\left(y_{i}-\boldsymbol{w}^T\boldsymbol{x}_i\right)^{2}+\lambda \sum_{j=0}^{p}\left|w_{j}\right|\right] \\ &= \arg \min_{\boldsymbol{w}} \left[ \|XW - \boldsymbol{y}\|^2 + \lambda \|\boldsymbol{w}\|_ 1\right]\end{aligned}$$
1. What is $L^1$ regularization equivalent to?
  MAP Bayesian inference with an isotropic Laplace distribution prior on the weights.
2. What kind of prior do we place on the weights? What are its parameters?
  Isotropic Laplace distribution prior on the weights: $\operatorname{Laplace}\left(w_{i} ; 0, \frac{1}{\alpha}\right)$.

Analyze $L^1$ vs $L^2$ regularization:
They both shrink the weights towards zero.

$L^2$	$L^1$
Sensitive to irrelevant features	Robust to irrelevant features
Computationally Efficient	Non-Differentiable at $0$
No Sparse Solutions	Produces Sparse Solutions
No Feature Selection	Built-in Feature Selection
Unique Solution	Possibly multiple solutions

Describe Elastic Net Regularization. Why was it devised? Any properties?
$$\Omega = \lambda\left(\alpha\|w\|_{1}+(1-\alpha)\|w\|_{2}^{2}\right), \alpha \in[0,1]$$
1. Combines both $L^1$ and $L^2$
2. Used to produce sparse solutions, but to avoid the problem of $L^1$ solutions being sometimes Non-Unique
  1. The problem mainly arises with correlated features
3. Elastic net regularization tends to have a grouping effect, where correlated input features are assigned equal weights.
Motivate Regularization for ill-posed problems:
1. What is the property that needs attention?
  Under-Constrained Problems. Under-determined systems. Specifically, when $X^TX$ is singular.
2. What would the regularized solution correspond to in this case?
  Many forms of regularization correspond to solving inverting $\boldsymbol{X}^{\top} \boldsymbol{X}+\alpha \boldsymbol{I}$ instead.
3. Are there any guarantees for the solution to be well-posed? How/Why?
  This regularized matrix, $\boldsymbol{X}^{\top} \boldsymbol{X}+\alpha \boldsymbol{I}$, is guaranteed to be invertible.
1. What is the Linear Algebraic property that needs attention?
  $X^{\top}X$ being singular.
2. What models are affected by this?
  Many linear models (e.g. Linear Regression, PCA) depend on inverting $\boldsymbol{X}^{\top}\boldsymbol{X}$.
3. What would the sol correspond to in terms of inverting $X^{\top}X$:
  Many forms of regularization correspond to solving inverting $\boldsymbol{X}^{\top} \boldsymbol{X}+\alpha \boldsymbol{I}$ instead.
4. When would $X^{\top}X$ be singular?
  1. The data-generating function truly has no variance in some direction.
  2. No Variance is observed in some direction because there are fewer examples (rows of $\boldsymbol{X}$) than input features (columns).
5. Describe the Linear Algebraic Perspective. What does it correspond to? [LAP]
  Given that the Moore-Penrose pseudoinverse $\boldsymbol{X}^{+}$ of a matrix $\boldsymbol{X}$ can solve underdetermined linear equations:
  $$\boldsymbol{X}^{+}=\lim_{\alpha \searrow 0}\left(\boldsymbol{X}^{\top} \boldsymbol{X}+\alpha \boldsymbol{I}\right)^{-1} \boldsymbol{X}^{\top} \tag{7.29}$$
  
  The equation corresponds to performing linear regression with weight-decay.
6. Can models with no closed-form solution be underdetermined? Explain. [CFS]
  Models with no closed-form solution can, also, be underdetermined:
  Take logistic regression on a linearly separable dataset, if a weight vector $\boldsymbol{w}$ is able to achieve perfect classification, then so does $2\boldsymbol{w}$ but with even higher likelihood. Thus, an iterative optimization procedure (sgd) will continually increase the magnitude of $\boldsymbol{w}$ and, in theory, will never halt.
7. What models are affected by this? [CFS]
  Many models that use MLE, log-likelihood/cross-entropy. Namely, Logistic Regression.
1. Define the Moore-Penrose Pseudoinverse:
  $$\boldsymbol{X}^{+}=\lim_{\alpha \searrow 0}\left(\boldsymbol{X}^{\top} \boldsymbol{X}+\alpha \boldsymbol{I}\right)^{-1} \boldsymbol{X}^{\top} \tag{7.29}$$
2. What can it solve? How?
  It can solve underdetermined linear equations.
  When applied to underdetermined systems w/ non-unique solutions; It finds the minimum norm solution to a linear system.
3. What does it correspond to in terms of regularization?
  The equation corresponds to performing linear regression with weight-decay.
4. What is the limit wrt?
  $7.29$ is the limit of eq $7.17$ as the regularization coefficient $\alpha$ shrinks to zero.
5. How can we interpret the pseudoinverse wrt regularization?
  We can thus interpret the pseudoinverse as stabilizing underdetermined problems using regularization.
1. Explain the problem with Logistic Regression:
  Logistic regression on a linearly separable dataset, if a weight vector $\boldsymbol{w}$ is able to achieve perfect classification, then so does $2\boldsymbol{w}$ but with even higher likelihood. Thus, an iterative optimization procedure (sgd) will continually increase the magnitude of $\boldsymbol{w}$ and, in theory, will never halt.
2. What are the possible solutions?
  Regularization: e.g. weight decay.
3. Are there any guarantees that we achieve with regularization? How?
  We can use regularization to guarantee the convergence of iterative methods applied to underdetermined problems.
  Weight decay will cause gradient descent to quit increasing the magnitude of the weights when the slope of the likelihood is equal to the weight decay coefficient.
Describe dataset augmentation and its techniques:
Having more data is the most desirable thing to improving a machine learning model’s performance. In many cases, it is relatively easy to artificially generate data.
When is it applicable?
For certain problems like classification this approach is readily usable. E.g. for a classification task, we require the model to be invariant to certain types of transformations, of which we can generate data by applying them on our current dataset.
When is it not?
This approach is not applicable to many problems, especially those that require us to learn the true data-distribution first E.g. Density Estimation.
Motivate the Noise Robustness property:
Noise Robustness is an important property for ML models:
1. For many classification and (some) regression tasks: the task should be possible to solve even if small random noise is added to the input (Local Constancy)
2. Moreover, NNs prove not to be very robust to noise.
How can Noise Robustness motivate a regularization technique?
To increase noise robustness, we will need to somehow teach our models to ignore noise. That can be done by teaching it to model data that is, perhaps, more noisy than what we actually have; thus, regularizing it from purely and completely fitting the training data.
How can we enhance noise robustness in NN?
By Noise Injection in different parts of the network.
1. Give a motivation for Noise Injection:
  It can be seen as a form of data augmentation.
2. Where can noise be injected?
  1. Input Layer
  2. Hidden Layer
  3. Weight Matrices
  4. Output Layer
3. Give Motivation, Interpretation and Applications of injecting noise in the different components (from above):
  Injecting Noise in the Input Layer:
  1. Motivation:
    We have motivated the injection of noise, to the inputs, as a dataset augmentation strategy.
  2. Interpretation:
    For some models, the addition of noise with infinitesimal variance at the input of the model is equivalent to imposing a penalty on the norm of the weights (Bishop, 1995a,b).
  Injecting Noise in the Hidden Layers:
  1. Motivation:
    We can motivate it as a variation of data augmentation.
  2. Interpretation:
    It can be seen as doing data-augmentation at multiple levels of abstraction.
  3. Applications:
    The most successful application of this type of noise injection is Dropout.
    It can be seen as a process of constructing new inputs by multiplying by noise.
  Injecting Noise in the Weight Matrices:
  1. Interpretation:
    1. It can be interpreted as a stochastic implementation of Bayesian inference over the weights.
      1. The Bayesian View:
        The Bayesian treatment of learning would consider the model weights to be uncertain and representable via a probability distribution that reflects this uncertainty. Adding noise to the weights is a practical, stochastic way to reflect this uncertainty.
    2. It can, also, be interpreted as equivalent a more traditional form of regularization, encouraging stability of the function to be learned.
  2. Applications:
    This technique has been used primarily in the context of recurrent neural networks (Jim et al., 1996; Graves, 2011).
  Injecting Noise in the Output Layer:
  1. Motivation:
    1. Most datasets have some number of mistakes in the $y$ labels. It can be harmful to maximize $\log p(y | \boldsymbol{x})$ when $y$ is a mistake. One way to prevent this is to explicitly model the noise on the labels.
      One can assume that for some small constant $\epsilon$, the training set label $y$ is correct with probability $1-\epsilon$.
      This assumption is easy to incorporate into the cost function analytically, rather than by explicitly drawing noise samples (e.g. label smoothing).
    2. MLE with a softmax classifier and hard targets may never converge - the softmax can never predict a probability of exactly $0$ or $1$, so it will continue to learn larger and larger weights, making more extreme predictions forever.{: #bodyContents33mle}
  2. Interpretation:
    For some models, the addition of noise with infinitesimal variance at the input of the
  3. Applications:
    Label Smoothing regularizes a model based on a softmax with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\dfrac{\epsilon}{k-1}$ and $1-\epsilon$, respectively.
    1. Applied to MLE problem: Label smoothing, compared to weight-decay, has the advantage of preventing the pursuit of hard probabilities without discouraging correct classification.
    2. Application in modern NN: (Szegedy et al. 2015)
1. Give an interpretation for injecting noise in the Input layer:
  For some models, the addition of noise with infinitesimal variance at the input of the model is equivalent to imposing a penalty on the norm of the weights (Bishop, 1995a,b).
2. Give an interpretation for injecting noise in the Hidden layers:
  It can be seen as doing data-augmentation at multiple levels of abstraction.
3. What is the most successful application of this technique:
  The most successful application of this type of noise injection is Dropout.
4. Describe the Bayesian View of learning:
  The Bayesian treatment of learning would consider the model weights to be uncertain and representable via a probability distribution that reflects this uncertainty. Adding noise to the weights is a practical, stochastic way to reflect this uncertainty.
5. How does it motivate injecting noise in the weight matrices?
  It can be interpreted as a stochastic implementation of Bayesian inference over the weights.
6. Describe a different, more traditional, interpretation of injecting noise to matrices. What are its effects on the function to be learned?
  It can, also, be interpreted as equivalent a more traditional form of regularization, encouraging stability of the function to be learned.
7. Whats the biggest application for this kind of regularization?
  This technique has been used primarily in the context of recurrent neural networks (Jim et al., 1996; Graves, 2011).
8. Motivate injecting noise in the Output layer:
  1. Most datasets have some number of mistakes in the $y$ labels. It can be harmful to maximize $\log p(y | \boldsymbol{x})$ when $y$ is a mistake. One way to prevent this is to explicitly model the noise on the labels.
    One can assume that for some small constant $\epsilon$, the training set label $y$ is correct with probability $1-\epsilon$.
    This assumption is easy to incorporate into the cost function analytically, rather than by explicitly drawing noise samples (e.g. label smoothing).
  2. MLE with a softmax classifier and hard targets may never converge - the softmax can never predict a probability of exactly $0$ or $1$, so it will continue to learn larger and larger weights, making more extreme predictions forever.{: #bodyContents33mle}
9. What is the biggest application of this technique?
  Label Smoothing regularizes a model based on a softmax with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\dfrac{\epsilon}{k-1}$ and $1-\epsilon$, respectively.
10. How does it compare to weight-decay when applied to MLE problems?
  Applied to MLE problem: Label smoothing, compared to weight-decay, has the advantage of preventing the pursuit of hard probabilities without discouraging correct classification.
Define “Semi-Supervised Learning”:
Semi-Supervised Learning is a class of ML tasks and techniques that makes use of both unlabeled examples from $P(\mathbf{x})$ and labeled examples from $P(\mathbf{x}, \mathbf{y})$ to estimate $P(\mathbf{y} | \mathbf{x})$ or predict $\mathbf{y}$ from $\mathbf{x}$.
1. What does it refer to in the context of DL:
  In the context of Deep Learning, Semi-Supervised Learning usually refers to learning a representation $\boldsymbol{h}=f(\boldsymbol{x})$.
2. What is its goal?
  The goal is to learn a representation such that examples from the same class have similar representations.
3. Give an example in classical ML:
  Using unsupervised learning to build representations: PCA, as a preprocessing step before applying a classifier, is a long-standing variant of this approach.
Describe an approach to applying semi-supervised learning:
Instead of separating the supervised and unsupervised criteria, we can instead have a generative model of $P(\mathbf{x})$ (or $P(\mathbf{x}, \mathbf{y})$) which shares parameters with a discriminative model $P(\mathbf{y} \vert \mathbf{x})$.
The idea is to share the unsupervised/generative criterion with the supervised criterion to express a prior belief that the structure of $P(\mathbf{x})$ (or $P(\mathbf{x}, \mathbf{y})$) is connected to the structure of $P(\mathbf{y} \vert \mathbf{x})$, which is captured by the shared parameters.
By controlling how much of the generative criterion is included in the total criterion, one can find a better trade-off than with a purely generative or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and Bengio, 2008).
How can we interpret dropout wrt data augmentation?
Dropout can be seen as a process of constructing new inputs by multiplying by noise.

Add Answers from link below for L2 applied to linear regression and how it reduces variance:
Link
When is Ridge regression favorable over Lasso regression? for correlated features?
If there exists a subset consisting of few variables that have medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.?? (ESL authors)

If features are correlated, it is so hard to determine which variables to drop, it is often better not to drop variables. Thus, use ridge over lasso since the latter produces non-unique solutions and might drop random features; while former, spreads weight more evenly.

Misc.

Explain Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying documents by subject matter.
LDA is a generative model that represents documents as a mixture of topics that each have their own probability distribution of possible words.
The “Dirichlet” distribution is simply a distribution of distributions. In LDA, documents are distributions of topics that are distributions of words.
How to deal with curse of dimensionality
1. Feature Selection
2. Feature Extraction
How to detect correlation of “categorical variables”?
1. Chi-Squared test
Define “marginal likelihood” (wrt naive bayes):
Marginal likelihood is, the probability that the word ‘FREE’ is used in any message (not given any other condition?).
KNN VS K-Means
1. KNN: Supervised, Classification/Regression Algorithm
2. K-Means: Unsupervised Clustering Algorithm
When is Ridge regression favorable over Lasso regression for correlated features?
If there are exists a subset consisting of few variables that have medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.?? (ESL authors)

If features are correlated; it is so hard to determine which variables to drop, it is often better not to drop variables. Thus, use ridge over lasso since the latter produces non-unique solutions and might drop random features; while former, spreads weight more evenly.
What is convex hull ?
In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.
Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.
OLS vs MLE
They both estimate parameters in the model. They are the same in the case of normal distribution.
What are collinearity and multicollinearity?
1. Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.
2. Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.
Describe ways to overcome scaling (scalability) issues:
nystrom methods/low-rank kernel matrix approximations, random features, local by query/near neighbors

Condition	Effect of Regularization
\(\lambda_{i}>>\alpha\)	Not much
\(\lambda_{i}<<\alpha\)	The weight value almost shrunk to \(0\)

\(L^2\)	\(L^1\)
Sensitive to irrelevant features	Robust to irrelevant features
Computationally Efficient	Non-Differentiable at \(0\)
No Sparse Solutions	Produces Sparse Solutions
No Feature Selection	Built-in Feature Selection
Unique Solution	Possibly multiple solutions

Answers to Prep Questions (Learning)