Notes:
- There’s a “sweet spot” for an RBF SVM:
- 500-10k samples
- 300–10k features
- all features are expected to have similar information content / scale
Introduction
-
- Why another SVM? (i.e. The Problem)
- The Hard-Margin SVM faces a few issues:
- The Hard-Margin SVM fails if the data is not linearly separable.
- The Hard-Margin SVM is quite sensetive to outliers
- The Soft-Margin SVM aims to fix/reconcile these problems.
-
- The solution:
- Allow some points to violate the margin, by introducing slack variables.
The Soft-Margin SVM
-
Procedure:
(1) Use a linear classifier
(2) But, Maximize the Margin
(3) Do so by Minimizing \(\|w\|\)
(4) But allow some points to penetrate the margin - Modified Constraints:
$$y_i(wX_i + b) \geq 1 - \zeta_i, \forall i \in [1,n]$$
where the \(\zeta_i\)s are slack variables.
We, also, enforce the non-negativity constraint on the slack variables:$$\zeta_i \geq 0, \:\:\: \forall i \in [1, n]$$
The non-negativity constraint forces the slack variables to be zero for all points that do not violate the original constraint:
i.e. are not inside the slab.
-
- Modified Objective (cost) Function:
- \[R(w) = \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i\]
- The Optimization Problem:
Find weights ‘\(w\)’, scalar ‘\(b\)’, and \(\zeta_i\)s that minimize$$ \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i$$
Subject to
$$y_i(wX_i + b) \geq 1 - \zeta_i, \zeta_i \geq 0, \forall i \in [1,n]$$
Formally,
$$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1 - \zeta_i, \zeta_i \geq 0, \forall i \in [1,n]$$
-
- Optimization Methods:
- The SVM optimization problem reduces to a Quadratic Program in \(d + n + 1\)-dimensions and \(2n\)-constraints.
-
- Effects of the Regularization Hyperparameter (\(C\)):
-
Small C Large C Desire Maximizing Margin = \(\dfrac{1}{|w|}\) keep most slack variables zero or small Danger underfitting (High Misclassification) overfitting (awesome training, awful test) outliers less sensitive very sensitive boundary more “flat” more sinuous -
The last row only applies to nonlinear decision boundaries.
-
- We choose ‘\(C\)’ with cross-validation.
An Equivalent Formulation
-
- Introduction:
- In the current SVM model, we are optimizing the objective \(R(w) = \dfrac{1}{2} w^Tw + C \sum_{i=1}^n \zeta_i\), which looks like an \(l_2\) regularization on the weights and an \(l_1\) regularization on the slack variables.
- However, usually in function estimation we prefer the standard-form objective to minimize (and trade-off); the loss + penalty form.
-
- Modified Loss Function:
- We introduce a loss function to moderate the use of the slack variables (i.e. to avoid abusing the slack variables).
- But first we motivate it by comparing it to the traditional \(0-1\) Loss function.
Notice that the \(0-1\) loss is actually non-convex. It has an infinite slope at \(0\).
On the other hand, the hinge loss is actually convex. - The hinge loss:
- \[{\displaystyle \max \left(0, 1-y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\right).}\]
- This function is zero if the constraint, \(y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\geq 1\), is satisfied, in other words, if \({\displaystyle {\vec {x}}_{i}} {\vec {x}} _ {i}\) lies on the correct side of the margin.
For data on the wrong side of the margin, the function’s value is proportional to the distance from the margin. - All of the above suggests that this loss function is ideal for binary classification as it doesn’t penalize correct classification at all.
which is something we are seeking for classification as opposed to regression.
-
- Modified Objective Function:
- \[R(w) = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y _ {i}({\vec {w}}\cdot {\vec {x}} _ {i}-b)\right)}\]
-
- Proof of Equivalence:
- Here we show that the two objectives to optimize for the SVM are actually equivalent.
- \[\begin{align} y_if\left(x_i\right) & \ \geq 1-\zeta_i, & \text{from 1st constraint } \\ \implies \zeta_i & \ \geq 1-y_if\left(x_i\right) \\ \zeta_i & \ \geq 1-y_if\left(x_i\right) \geq 0, & \text{from 2nd positivity constraint on} \zeta_i \\ \iff \zeta_i & \ \geq \max \{0, 1-y_if\left(x_i\right)\} \\ \zeta_i & \ = \max \{0, 1-y_if\left(x_i\right)\}, & \text{minimizing means } \zeta_i \text{reach lower bound}\\ \implies R(w) & \ = \dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y _ {i}({\vec {w}}\cdot {\vec {x}} _ {i}-b)\right)}, & \text{plugging in and multplying } \lambda = \dfrac{1}{C} \end{align}\]
-
- The Optimization Problem:
- Find weights ‘\(w\)’ and scalar ‘\(b\)’ that minimize
- \[\dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y _ {i}({\vec {w}}\cdot {\vec {x}} _ {i}-b)\right)}\]
- Subject to
- \[y_i(wX_i + b) \geq 1 - \zeta_i, \zeta_i \geq 0, \forall i \in [1,n]\]
- Formally,
- \[\min_{w, b}\dfrac{\lambda}{2} w^Tw + \sum_{i=1}^n {\displaystyle \max \left(0, 1-y _ {i}({\vec {w}}\cdot {\vec {x}} _ {i}-b)\right)}, \:\: \forall i \in [1,n]\]
Further Analysis
-
- Generalization:
- We notice that, geometrically, the hyperplane (the maximum margin classifier) is completely characterized by the support vectors (the vectors that lie on the margin).
- A very important conclusion arises.
The maximum margin classifier (SVM) depends only on the number of support vectors and not on the dimension of the problem.
This implies that the computation doesn’t scale up with the dimension and, also implies, that the kernel trick works very well.
-
- Properties:
-
- In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
- The hyperplane is determined solely by its support vectors.
- The SVM always converges on linearly separable data.
- The Soft-Margin SVM will converge on non-linearly separable data.
Extra info to be added:
Requiring the Margin to be bigger, puts a restriction on the growth function (i.e. the number of possible dichotomies). So, fewer VC dimension.
Why \(w\) is Orthogonal to the plane:
img
The number of non-zero paramters in a model correspond to the VC dimension
Normalizing \(w\),
Observations:
- Normalize \(w\):
The Hyperplane, \(\mathbf{w}^{\top} \mathbf{x}=0\) (which is, incidentally, the signal), defined by \(w\) is scale invariant to \(w\); since you can multiply \(w\) by any scalar and the equation of the plane will still hold. Thus, you can normalize it by dividing by a scalar.
So, we choose \(w\) (by normalizing/scaling it) such that the following equation holds for the variable \(\mathbf{x}_ n\):$$\left|\mathbf{w}^{\top} \mathbf{x}_ {n}\right|=1$$
Effect of the gamma hyperparam in RBF kernel: The gamma parameter in SVM tuning signifies the influence of points either near or far away from the hyperplane. For a low gamma, the model will be too constrained and include all points of the training dataset, without really capturing the shape. For a higher gamma, the model will capture the shape of the dataset well.
Bias = Underfit = low complexity (model) Variance = Overfit = high complexity
Regularization reduce overfitting = reduce variance = simplify model = increase bias
Increase C hparam in SVM = reduce underfitting