Ahmad Badary

Introduction - Support Vector Machines

Support Vector Machines:
Support Vector Machines (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
The SVM is a Maximum Margin Classifier that aims to find the “maximum-margin hyperplane” that divides the group of points ${\displaystyle {\vec {x}}_{i}} {\vec {x}}_{i}$ for which ${\displaystyle y_{i}=1}$ from the group of points for which ${\displaystyle y_{i}=-1}$.
Support Vectors:
Support Vectors are the data-points that lie exactly on the margin (i.e. on the boundary of the slab).
They satisfy $\|w^TX' + b\| = 1, \forall$ support vectors $X'$
The Hard-Margin SVM:
The Hard-Margin SVM is just a maximum-margin classifier with features and kernels (discussed later).

Goal:

Find weights ‘$w$’ and scalar ‘$b$’ that correctly classifies the data-points and, moreover, does so in the “best” possible way.

Where we had defined best as the classifier that admits the maximum
Procedure:
(1) Use a linear classifier (2) But, Maximize the Margin (3) Do so by Minimizing $\|w\|$
Decision Function:

\[{\displaystyle f(x)={\begin{cases}1&{\text{if }}\ w\cdot X_i+\alpha>0\\0&{\text{otherwise}}\end{cases}}}\]
Constraints:

\[y_i(wX_i + b) \geq 1, \forall i \in [1,n]\]
The Optimization Problem:
Find weights ‘$w$’ and scalar ‘$b$’ that minimize
$$ \dfrac{1}{2} w^Tw$$

Subject to

$$y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$

Formally,

$$\min_w \dfrac{1}{2}w^Tw \:\:\: : \:\: y_i(wX_i + b) \geq 1, \forall i \in [1,n]$$
Optimization Methods:

The SVM optimization problem reduces to a Quadratic Program.

Generalization:

We notice that, geometrically, the hyperplane (the maximum margin classifier) is completely characterized by the support vectors (the vectors that lie on the margin).

A very important conclusion arises.
The maximum margin classifier (SVM) depends only on the number of support vectors and not on the diminsion of the problem.
This implies that the computation doesn’t scale up with the diminsion and, also implies, that the kernel trick works very well.
Properties:
1. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
2. The hyperplane is determined solely by its support vectors.
3. The SVM always converges on linearly seprable data.
4. The Hard-Margin SVM fails if the data is not linearly separable.
5. The Hard-Margin SVM is quite sensetive to outliers