Table of Contents
Introduction and Set up
-
The Margin:
The margin of a linear classifier is the distance from the decision boundary to the nearest sample point.
-
- The current Problem:
- All the classifiers discussed thus far (i.e. Centroid, Perceptron) will converge to a correct classifier on linearly seprable data; however, the classifier they converge to is not unique nor the best.
-
But what does it mean to be the “best” classifier?
- We assume that if we can maximize the distance between the data points to be classified and the hyperplane that classifies them, then we have reached a boundary that allows for the “best-fit”, i.e. allows for the most room for error.
-
- The Solution:
- We enforce a constraint that achieves a classifier that has a maximum-margin.
- The Signed Distance:
The signed distance is the minimum distance from a point to a hyperplane. We solve for the signed distance to achieve the following formula for it: \(d = \dfrac{\| w \cdot x_0 + b \|}{\|w\|},\)
where we have an n-dimensional hyperplane: \(w \cdot x + b = 0\) and a point \(\mathbf{x}_ n\).- Proof.
- Suppose we have an affine hyperplane defined by \(w \cdot x + b\) and a point \(\mathbf{x}_ n\).
- Suppose that \(\mathbf{x} \in \mathbf{R}^n\) is a point satisfying \(w \cdot \mathbf{x} + b = 0\), i.e. it is a point on the plane.
-
We construct the vector \(\mathbf{x}_ n−\mathbf{x}\) which points from \(\mathbf{x}\) to \(\mathbf{x}_ n\), and then, project (scalar projection==signed distance) it onto the unique vector perpendicular to the plane, i.e. \(w\),
\[d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \left| \frac{(\mathbf{x}_ n-\mathbf{x})\cdot w}{\|w\|} \right| = \frac{|\mathbf{x}_ n \cdot w - \mathbf{x} \cdot w|}{\|w\|}.\] -
Since \(\mathbf{x}\) is a vector on the plane, it must satisfy \(w\cdot \mathbf{x}=-b\) so we get
\[d=| \text{comp}_{w} (\mathbf{x}_ n-\mathbf{x})| = \frac{|\mathbf{x}_ n \cdot w +b|}{\|w\|}\]
Thus, we conclude that if \(\|w\| = 1\) then the signed distance from a datapoint \(X_i\) to the hyperplane is \(\|wX_i + b\|\).
(Caltech):
So, now we can characterize the margin, with its size, as the distance, \(\frac{1}{\|\mathbf{w}\|}\), between the hyperplane/boundary and the closest point to the plane \(\mathbf{x}_ n\), in both directions (multiply by 2) \(= \frac{2}{\|\mathbf{w}\|}\) ; given the condition we specified earlier \(\left|\mathbf{w}^{\top} \mathbf{x}_ {n} + b\right|=1\) for the closest point \(\mathbf{x}_ n\).Thus, we formulate the optimization problem of maximizing the margin by maximizing the distance, subject to the condition on how we derived the distance:
$$\max_{\mathbf{w}} \dfrac{2}{\|\mathbf{w}\|} \:\:\: : \:\: \min _{n=1,2, \ldots, N}\left|\mathbf{w}^{\top} \mathbf{x}_{n}+b\right|=1$$
Which we can reformulate by (1) Flipping and Minimizing, (2) Taking a square since it’s monotonic and convex, and (3) noticing that \(\left|\mathbf{w}^{T} \mathbf{x}_ {n}+b\right|=y_{n}\left(\mathbf{w}^{T} \mathbf{x}_ {n}+b\right)\) (since the signal and label must agree, their product will always be positive) and the \(\min\) operator can be replaced by ensuring that for all the points the condition \(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1\) holds proof (by contradiction) as:
$$\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w} \:\:\: : \:\: y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right) \geq 1 \:\: \forall i \in [1,N]$$
Now when we solve the “friendly” equation above, we will get the separating plane with the best possible margin (best=biggest).
To solve the above problem, we need something that deals with inequality constraints; thus, we use the KKT method for solving a Lagrnagian under inequality constraints.
The Lagrange Formulation:- Formulate the Lagrangian:
- Take each inequality constraint and put them in the zero-form (equality with Zero)
- Multiply each inequality by a Lagrange Multiplier \(\alpha_n\)
- Add them to the objective function \(\min_w \dfrac{1}{2} \mathbf{w}^T\mathbf{w}\)
The sign will be \(-\) (negative) simply because the inequality is \(\geq 0\)
$$\min_{w, b} \max_{\alpha_n} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \mathbf{w}^T\mathbf{w} -\sum_{n=1}^{N} \alpha_{n}\left(y_{n}\left(\mathbf{w}^{\top} \mathbf{x}_ {n}+b\right)-1\right) \:\:\: : \:\: \alpha_n \geq 0$$
- Optimize the objective independently, for each of the unconstrained variables:
- Gradient w.r.t. \(\mathbf{w}\):
$$\nabla_{\mathrm{w}} \mathcal{L}=\mathrm{w}-\sum_{n=1}^{N} \alpha_{n} y_{n} \mathrm{x}_ {n}=0 \\ \implies \\ \mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}$$
- Derivative w.r.t. \(b\):
$$\frac{\partial \mathcal{L}}{\partial b}=-\sum_{n=1}^{N} \alpha_{n} y_{n}=0 \\ \implies \\ \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$
- Gradient w.r.t. \(\mathbf{w}\):
- Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
- Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. \(\alpha_n\) will become free of \(\mathbf{w}\) and \(b\):
$$\mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m}$$
- Notice that the first constraint \(\mathbf{w}=\sum_{n=1}^{N} \alpha_{n} y_{n} \mathbf{x}_ {n}\) has-no-effect/doesn’t-constraint \(\alpha_n\) so it’s a vacuous constraint. However, not the second constraint \(\sum_{n=1}^{N} \alpha_{n} y_{n}=0\).
- Set the optimization objective and the constraints, a quadratic function in \(\alpha_n\):
$$\max_{\alpha} \mathcal{L}(\boldsymbol{\alpha})=\sum_{n=1}^{N} \alpha_{n}-\frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{n}^{\mathrm{T}} \mathbf{x}_{m} \\ \:\:\:\:\:\:\:\:\:\: : \:\: \alpha_n \geq 0 \:\: \forall \: n= 1, \ldots, N \:\: \wedge \:\: \sum_{n=1}^{N} \alpha_{n} y_{n}=0$$
- Substitute with the above conditions in the original lagrangian (such that the optimization w.r.t. \(\alpha_n\) will become free of \(\mathbf{w}\) and \(b\):
- Set the problem as a Quadratic Programming problem:
- Change the maximization to minimization by flipping the signs:
$$\min _{\alpha} \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} y_{n} y_{m} \alpha_{n} \alpha_{m} \mathbf{x}_{0}^{\mathrm{T}} \mathbf{x}_{m}-\sum_{n=1}^{N} \alpha_{n}$$
- Isolate the Coefficients from the \(\alpha_n\)s and set in matrix-form:
$$\min _{\alpha} \frac{1}{2} \alpha^{\top} \underbrace{\begin{bmatrix} y_{1} y_{1} \mathbf{x}_{1}^{\top} \mathbf{x}_{1} & y_{1} y_{2} \mathbf{x}_{1}^{\top} \mathbf{x}_{2} & \ldots & y_{1} y_{N} \mathbf{x}_{1}^{\top} \mathbf{x}_{N} \\ y_{2} y_{1} \mathbf{x}_{2}^{\top} \mathbf{x}_{1} & y_{2} y_{2} \mathbf{x}_{2}^{\top} \mathbf{x}_{2} & \ldots & y_{2} y_{N} \mathbf{x}_{2}^{\top} \mathbf{x}_{N} \\ \ldots & \ldots & \ldots & \ldots \\ y_{N} y_{1} \mathbf{x}_{N}^{\top} \mathbf{x}_{1} & y_{N} y_{2} \mathbf{x}_{N}^{\top} \mathbf{x}_{2} & \ldots & y_{N} y_{N} \mathbf{x}_{N}^{\top} \mathbf{x}_{N} \end{bmatrix}}_{\text{quadratic coefficients}} \alpha+\underbrace{\left(-1^{\top}\right)}_ {\text { linear }} \alpha \\ \:\:\:\:\:\:\:\:\:\: : \:\: \underbrace{\mathbf{y}^{\top} \boldsymbol{\alpha}=0}_{\text { linear constraint }} \:\: \wedge \:\: \underbrace{0}_{\text { lower bounds }} \leq \alpha \leq \underbrace{\infty}_{\text { upper bounds }} $$
The Quadratic Programming Package asks you for the Quadratic Term (Matrix) and the Linear Term, and for the Linear Constraint and the Range of \(\alpha_n\)s; and then, gives you back an \(\mathbf{\alpha}\).
Equivalently:
$$\min _{\alpha} \frac{1}{2} \boldsymbol{\alpha}^{\mathrm{T}} \mathrm{Q} \boldsymbol{\alpha}-\mathbf{1}^{\mathrm{T}} \boldsymbol{\alpha} \quad \text { subject to } \quad \mathbf{y}^{\mathrm{T}} \boldsymbol{\alpha}=0 ; \quad \boldsymbol{\alpha} \geq \mathbf{0}$$
- Change the maximization to minimization by flipping the signs:
- Proof.
- Geometric Analysis:
First, we notice that for any given plane \(w^Tx = 0\), the equations, \(\gamma * w^Tx = 0\), where \(\gamma \in \mathbf{R}\) is a scalar, basically characterize the same plane and not many planes.
This is because \(w^Tx = 0 \iff \gamma * w^Tx = \gamma * 0 \iff \gamma * w^Tx = 0\).
The above implies that any model that takes input \(w\) and produces a margin, will have to be Scale Invariant.
To get around this and simplify the analysis, I am going to consider all the representations of the same plane, and I am going to pick one where we normalize (re-scale) the weight \(w\) such that the signed distance (distance to the point closest to the margin) is equal to one:$$|w^Tx_n| > 0 \rightarrow |w^Tx_n| = 1$$
, where \(x_n\) is the point closest to the plane.
We constraint the hyperplane by normalizing \(w\) to this equation \(|w^Tx_i| = 1\) or with added bias, \(|w^Tx_i + b| = 1\).
This implies that there exists a “slab” of width \(\dfrac{2}{\|w\|}\). - The Margin, mathematically:
Now, we can mathematically characterize the margin.
By substituting the constraints \(\: y_i(w^TX_i+ b) \geq 1, \forall i \in [1,n]\) and the signed distance:$$\min_i \dfrac{1}{\|w\|} \|w^TX_i + b\| \geq \dfrac{1}{w}$$
-
- The distance of the point closest to the hyperplane:
- We find the distance of the point closest to the hyperplane.
- Let \(X_n\) be the point that is closest to the plane, and let \(\hat{w} = \dfrac{w}{\|w\|}\).
Take any point \(X\) on the plane, and let \(\vec{v}\) be the vector \(\vec{v} = X_n - X\). - Now, the distance, d is equal to
- \[\begin{align} d & \ = \|\hat{w}\vec{v}\| \\ & \ = \|\hat{w}(X_n - X)\| \\ & \ = \|\hat{w}X_n - \hat{w}X)\| \\ & \ = \dfrac{1}{\|w\|}\|wX_n + b - wX) - b\|, & \text{we add and subtract the bias } b\\ & \ = \dfrac{1}{\|w\|}\|(wX_n + b) - (wX + b)\| \\ & \ = \dfrac{1}{\|w\|}\|(wX_n + b) - (0)\|, & \text{from the eq. of the plane on a point on the plane} \\ & \ = \dfrac{1}{\|w\|}\|(1) - (0)\|, & \text{from the constraint on the distance of the closest point} \\ & \ = \dfrac{1}{\|w\|} \end{align}\]
-
- Slab Existence:
- The analysis done above allows us to conclusively prove that there exists a slab of width \(\dfrac{2}{\|w\|}\) containing no sample points where the hyperplane runs through (bisects) its center.
-
- Maximizing the Margin:
- To maximize the margin we need to maximize the width of the slab, i.e. maximize \(\dfrac{2}{\|w\|}\),
or equivalently, - \[\max_w \dfrac{2}{\|w\|} = \min_w \dfrac{\|w\|}{2} = \min_w \dfrac{1}{2}\|w\| \min_w \dfrac{1}{2}\|w\|^2\]
- subject to the constraint mentioned earlier \(\min_i \|wX + b\| = 1, \forall i \in [1,n]\), or equivalently
- \[y_i(wX_i + b) \geq 1, \forall i \in [1,n]\]
- since the equation \(y_i(wX_i + b)\) enforces the absolute value condition as was our analysis for regular linear classifiers.
-
- The Optimization Problem for Maximum Margin Classifiers:
- \[\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1, \forall i \in [1,n]\]
-
The above problem is a Quadratic Program, in \(d + 1\)-diminsions and \(n\)-constraints, in standard form.
-
Notice that we use the quadratic \(w^Tw\) instead of the linear \(w\) as the objective because the quadratic function is smooth at zero as opposed to the linear objective which hinders the optimization.
- Notes:
- The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason:
Since if you take any two points \(\mathbf{x}^\prime\) and \(\mathbf{x}^{\prime \prime}\) on the plane, and create the vector \(\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)\) parallel to the plane by subtracting the two points, then the following equations must hold:$$\mathbf{w}^{\top} \mathbf{x}^{\prime}+b=0 \wedge \mathbf{w}^{\top} \mathbf{x}^{\prime \prime}+b=0 \implies \mathbf{w}^{\top}\left(\mathbf{x}^{\prime}-\mathbf{x}^{\prime \prime}\right)=0$$
- In a problem of minimizing a function:
- Unconstrained problem:
You set the gradient of the function to Zero and solve. - Constrained (regularization?):
The gradient becomes related to the constraint; the gradient \(\nabla E_{\mathrm{in}}\) is normal to the constraint.
- Unconstrained problem:
- Conceptual Dichotomy between Regularization and SVM:
img
- The weight vector \(\mathbf{w}\) is orthogonal to the separating-plane/decision-boundary, defined by \(\mathbf{w}^T\mathbf{x} + b = 0\), in the \(\mathcal{X}\) space; Reason: