Table of Contents



Introduction and Set up

  1. The Margin:
    The margin of a linear classifier is the distance from the decision boundary to the nearest sample point.

  2. The current Problem:
    All the classifiers discussed thus far (i.e. Centroid, Perceptron) will converge to a correct classifier on linearly seprable data; however, the classifier they converge to is not unique nor the best.

    But what does it mean to be the “best” classifier?

    We assume that if we can maximize the distance between the data points to be classified and the hyperplane that classifies them, then we have reached a boundary that allows for the “best-fit”, i.e. allows for the most room for error.
  3. The Solution:
    We enforce a constraint that achieves a classifier that has a maximum-margin.
  4. The Signed Distance:
    The signed distance is the minimum distance from a point to a hyperplane. We solve for the signed distance to achieve the following formula for it: \(d = \dfrac{\| w \cdot x_0 + b \|}{\|w\|},\)
    where we have an n-dimensional hyperplane: \(w \cdot x + b = 0\) and a point \(\mathbf{x}_ n\).
    • Proof.
      • Suppose we have an affine hyperplane defined by \(w \cdot x + b\) and a point \(\mathbf{x}_ n\).
      • Suppose that \(\mathbf{x} \in \mathbf{R}^n\) is a point satisfying \(w \cdot \mathbf{x} + b = 0\), i.e. it is a point on the plane.
      • We construct the vector \(\mathbf{x}_ n−\mathbf{x}\) which points from \(\mathbf{x}\) to \(\mathbf{x}_ n\), and then, project (scalar projection==signed distance) it onto the unique vector perpendicular to the plane, i.e. \(w\),

        \[d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \left| \frac{(\mathbf{x}_ n-\mathbf{x})\cdot w}{\|w\|} \right| = \frac{|\mathbf{x}_ n \cdot w - \mathbf{x} \cdot w|}{\|w\|}.\]
      • Since \(\mathbf{x}\) is a vector on the plane, it must satisfy \(w\cdot \mathbf{x}=-b\) so we get

        \[d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \frac{|\mathbf{x}_ n \cdot w +b|}{\|w\|}\]

    Thus, we conclude that if \(\|w\| = 1\) then the signed distance from a datapoint \(X_i\) to the hyperplane is \(\|wX_i + b\|\).

    (Caltech):
    So, now we can characterize the margin, with its size, as the distance, \(\frac{1}{\|\mathbf{w}\|}\), between the hyperplane/boundary and the closest point to the plane \(\mathbf{x}_ n\), in both directions (multiply by 2) \(= \frac{2}{\|\mathbf{w}\|}\) ; given the condition we specified earlier \(\left|\mathbf{w}^{\top} \mathbf{x}_ {n} + b\right|=1\) for the closest point \(\mathbf{x}_ n\).

    Thus, we formulate the optimization problem of maximizing the margin by maximizing the distance, subject to the condition on how we derived the distance:

    $$\max_{\mathbf{w}} \dfrac{2}{\|\mathbf{w}\|} \:\:\: : \:\: \min _{n=1,2, \ldots, N}\left|\mathbf{w}^{\top} \mathbf{x}_{n}+b\right|=1$$

    Which we can reformulate by (1) Flipping and Minimizing, (2) Taking a square since it’s monotonic and convex, and (3) noticing that \(\left|\mathbf{w}^{T} \mathbf{x}_ {n}+b\right|=y_{n}\left(\mathbf{w}^{T} \mathbf{x}_ {n}+b\right)\) (since the signal and label must agree, their product will always be positive) and the \(\min\) operator can be replaced by ensuring that for all the points the condition \(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1\) holds proof (by contradiction) as:

    $$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$

    Now when we solve the “friendly” equation above, we will get the separating plane with the best possible margin (best=biggest).

    To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
    The Lagrange Formulation:

    • Formulate the Lagrangian:
      1. Take each inequality constraint and put them in the zero-form (equality with Zero)
      2. Multiply each inequality by a Lagrange Multiplier \(\alpha_n\)
      3. Add them to the objective function \(\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}\)
        The sign will be \(-\) (negative) simply because the inequality is \(\geq 0\)

      $$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$

    • Optimize the objective independently, for each of the unconstrained variables:
      1. Gradient w.r.t. \(\mathbf{w}\):

        $$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$

      2. Derivative w.r.t. \(b\):

        $$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    • Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
      • Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. \(\alpha_n\) will become free of \(\mathbf{w}\) and \(b\):

        $$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$

      • Notice that the first constraint \(\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}\) has-no-effect/doesn’t-constraint \(\alpha_n\) so it’s a vacuous constraint. However, not the second constraint \(\sum_{n=1}^{N} \alpha_{n} y_{n}=0\).
      • Set the optimization objective and the constraints, a quadratic function in \(\alpha_n\):

      $$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$

    • Set the problem as a Quadratic Programming problem:
      • Change the maximization to minimization by flipping the signs:

        $$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$

      • Isolate the Coefficients from the \(\alpha_n\)s and set in matrix-form:

        $$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text { linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text { linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text { lower bounds }} \leq \alpha \leq \underbrace{\infty}_{\text { upper bounds }} $$

        The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of \(\alpha_n\)s; and then, gives you back an \(\mathbf{\alpha}\).

      Equivalently:

      $$\min _{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text { subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$

  5. Geometric Analysis:
    First, we notice that for any given plane \(w^Tx = 0\), the equations, \(\gamma * w^Tx = 0\), where \(\gamma \in \mathbf{R}\) is a scalar, basically characterize the same plane and not many planes.
    This is because \(w^Tx = 0 \iff \gamma * w^Tx = \gamma * 0 \iff \gamma * w^Tx = 0\).
    The above implies that any model that takes input \(w\) and produces a margin, will have to be Scale Invariant.
    To get around this and simplify the analysis, I am going to consider all the representations of the same plane, and I am going to pick one where we normalize (re-scale) the weight \(w\) such that the signed distance (distance to the point closest to the margin) is equal to one:

    $$|w^Tx_n| > 0 \rightarrow |w^Tx_n| = 1$$

    , where \(x_n\) is the point closest to the plane.
    We constraint the hyperplane by normalizing \(w\) to this equation \(|w^Tx_i| = 1\) or with added bias, \(|w^Tx_i + b| = 1\).
    This implies that there exists a “slab” of width \(\dfrac{2}{\|w\|}\).

  6. The Margin, mathematically:
    Now, we can mathematically characterize the margin.
    By substituting the constraints \(\: y_i(w^TX_i+ b) \geq 1, \forall i \in [1,n]\) and the signed distance:

    $$\min_i \dfrac{1}{\|w\|} \|w^TX_i + b\| \geq \dfrac{1}{w}$$

  7. The distance of the point closest to the hyperplane:
    We find the distance of the point closest to the hyperplane.
    Let \(X_n\) be the point that is closest to the plane, and let \(\hat{w} = \dfrac{w}{\|w\|}\).
    Take any point \(X\) on the plane, and let \(\vec{v}\) be the vector \(\vec{v} = X_n - X\).
    Now, the distance, d is equal to
    \[\begin{align} d & \ = \|\hat{w}\vec{v}\| \\ & \ = \|\hat{w}(X_n - X)\| \\ & \ = \|\hat{w}X_n - \hat{w}X)\| \\ & \ = \dfrac{1}{\|w\|}\|wX_n + b - wX) - b\|, & \text{we add and subtract the bias } b\\ & \ = \dfrac{1}{\|w\|}\|(wX_n + b) - (wX + b)\| \\ & \ = \dfrac{1}{\|w\|}\|(wX_n + b) - (0)\|, & \text{from the eq. of the plane on a point on the plane} \\ & \ = \dfrac{1}{\|w\|}\|(1) - (0)\|, & \text{from the constraint on the distance of the closest point} \\ & \ = \dfrac{1}{\|w\|} \end{align}\]
  8. Slab Existence:
    The analysis done above allows us to conclusively prove that there exists a slab of width \(\dfrac{2}{\|w\|}\) containing no sample points where the hyperplane runs through (bisects) its center.
  9. Maximizing the Margin:
    To maximize the margin we need to maximize the width of the slab, i.e. maximize \(\dfrac{2}{\|w\|}\),
    or equivalently,
    \[\max_w \dfrac{2}{\|w\|} = \min_w \dfrac{\|w\|}{2} = \min_w \dfrac{1}{2}\|w\| \min_w \dfrac{1}{2}\|w\|^2\]
    subject to the constraint mentioned earlier \(\min_i \|wX + b\| = 1, \forall i \in [1,n]\), or equivalently
    \[y_i(wX_i + b) \geq 1, \forall i \in [1,n]\]
    since the equation \(y_i(wX_i + b)\) enforces the absolute value condition as was our analysis for regular linear classifiers.
  10. The Optimization Problem for Maximum Margin Classifiers:
    \[\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1, \forall i \in [1,n]\]

    The above problem is a Quadratic Program, in \(d + 1\)-diminsions and \(n\)-constraints, in standard form.

    Notice that we use the quadratic \(w^Tw\) instead of the linear \(w\) as the objective because the quadratic function is smooth at zero as opposed to the linear objective which hinders the optimization.

  11. Notes:
    • The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
      Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:

      $$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$

    • In a problem of minimizing a function:
      • Unconstrained problem:
        You set the gradient of the function to Zero and solve.
      • Constrained (regularization?):
        The gradient becomes related to the constraint; the gradient \(\nabla E_{\mathrm{in}}\) is normal to the constraint.
    • Conceptual Dichotomy between Regularization and SVM:
      img