Introduction
-
- The Perceptron:
- The Perceptron is an algorithm for supervised learning of binary classifiers.
-
- Type:
- It is a type of linear classifiers.
-
- The Problem Setup:
- Consider \(n\) sample points \(X_1, X_2, ..., X_n\).
- For each sample point, let \({\displaystyle y_i = {\begin{cases} \:\: 1&{\text{if }}\ X_i \in C \\-1&{\text{if}}\ X_i \notin C\end{cases}}}\)
-
where ‘C’ is a given class of interest.
The Perceptron Method
-
- Goal:
- Find weights ‘\(w\)’ such that: \({\displaystyle {\begin{cases}X_i \cdot w \geq 0&{\text{if }}\ y_i = 1\\X_i \cdot w \leq 0&{\text{if }}\ y_i = -1\end{cases}}}\)
-
Where \(X_i \cdot w\) is the signed distance.
- Equivalently:
- \[y_iX_i \cdot w \geq 0\]
-
Where \(y_iX_i \cdot w \geq 0\) is a constraint on the problem.
-
- Procedure:
- Compute the point of greatest descent until you find a local minima and update the weights using “Gradient Descent”.
-
- Decision Function:
- \[{\displaystyle f(x)={\begin{cases}1&{\text{if }}\ w\cdot X_i+\alpha>0\\0&{\text{otherwise}}\end{cases}}}\]
- where \(\alpha\) is added by The fictitious Diminsion Trick
-
- Loss Function:
- \[{\displaystyle L(z, y_i) = {\begin{cases}0&{\text{if }}\ y_i\cdot z_i \geq 0\\-y_i z&{\text{otherwise}}\end{cases}}}\]
-
- Objective (cost) Function:
- \[R(w) = \sum_{i=1}^n L(X_i \cdot w, y_i) = \sum_{i \in V} -y_iX_i \cdot w\]
-
where \(V\) is the set of indices \(i\) for which \(y_iX_i \cdot w < 0\).
- Risk func is Convex but Non-Smooth?
-
- Constraints:
- \[y_iX_i \cdot w \geq 0\]
-
- The Optimization Problem:
- Find weights \(w\) that minimizes \(R(w)\).
-
- Optimization Methods:
- The Perceptron algorithm uses a numerical optimization method.
- Gradient Descent is the most commonly used method.
- Newtons Method can also be used to optimize the objective.
-
- The Gradient Descent Step:
- \[\begin{align} \nabla_w R(w) & \ = \\ & \ = \nabla_w \sum_{i=1}^n L(X_i \cdot w, y_i) \\ & \ = \nabla_w \sum_{i \in V} -y_iX_i \cdot w \\ & \ = \sum_{i \in V} -y_iX_i \end{align}\]
-
- The Algorithm (Frank Rosenblatt, 1957):
- (1) Choose the weights \(\vec{w}\) arbitrarily.
- (2) While \(R(\vec{w}) > 0\):
\(\:\:\:\:\:\:\:\) (3) \(V \leftarrow\) set of indices such that: \(y_iX_i \cdot w < 0\)
\(\:\:\:\:\:\:\:\) (4) \(w \leftarrow w + \epsilon \cdot \sum_{i \in V} y_iX_i \;\;\;\;\;\;\) [GD]
\(\:\:\:\:\:\:\:\) (4) \(w \leftarrow w + \epsilon \cdot y_iX_i \;\;\;\;\;\;\) [SGD] - (5) Recurse
-
- Avoiding the constriction of the separating hyperplane to passing through the origin:
- In the procedure we have just described, the separating hyperplane that this algorithm will produce will be forced to pass through the origin, since the resulting hyperplane is not translated from the origin.
- We can get around that by moving our problem to a higher diminsion.
We achieve that by adding a “fictitious” diminsion as follows: - We re-write,
- \[\vec{w}\cdot X_i \rightarrow \left(\begin{array}{c} w_1 & w_2 & \cdots & w_d & \alpha \end{array} \right) \cdot \left(\begin{array}{ccccc} x_1 \\ x_2 \\ \vdots \\ x_d\\ 1 \end{array} \right)\]
- Now, we run the perceptron algorithm in (d + 1)-dimensional space.
-
- The Boundary:
- The boundary is a hyperplane:
- \[\{x \in \mathbf{R}^d : f(x) = 0\}\]
-
where \(f(x) = wX_i + \alpha\).
Convergence and Complexity
-
- The Perceptron Convergence Theorem I:
- If the data is linearly separable, the perceptron algorithm will always find a linear classifier that classifies all the data points correctly.
-
- The Perceptron Convergence Theorem II:
- If the perceptron is guranteed to converge on a data set, then it will converge in at most \(\mathcal{O}(\dfrac{R^2}{\gamma})\) iterations.
-
where \(R = \max_i \|X_i\|\), called the radius of the data, and \(\gamma =\) max margin possible.
-
- Complexity (Runtime):
- \[\mathcal{O}(n*d)\]
- Can be made faster with ‘SGD’.
- Although the step size/learning rate doesn’t appear in that big-O expression, it does have an effect on the
running time, but the effect is hard to characterize.
The algorithm gets slower if \(\epsilon\) is too small because it has to take lots of steps to get down the hill. But it also gets slower if \(\epsilon\) is too big for a different reason: it jumps right over the region with zero risk and oscillates back and forth for a long time.
Further Analysis
- Properties:
- It is an Online Algorithm.
- The algorithm is quite slow.
- There is no way to reliably choose the learning rate.
- It is currently obsolete.
- It will not converge, nor approach any approximate solutions for non-linearly separable data.