Ahmad Badary

PCA

What?
It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Goal?
Given points $\mathbf{x}_ i \in \mathbf{R}^d$, find k-directions that capture most of the variation.
Why?
1. Find a small basis for representing variations in complex things.
  
  e.g. faces, genes.
2. Reducing the number of dimensions makes some computations cheaper.
3. Remove irrelevant dimensions to reduce over-fitting in learning algorithms.
  
  Like “subset selection” but the features are not axis aligned.
  They are linear combinations of input features.
4. Represent the data with fewer parameters (dimensions)
Finding Principal Components:
- Let ‘$X$’ be an $(n \times d)$ design matrix, centered, with mean $\hat{x} = 0$.
- Let ‘$w$’ be a unit vector.
- The Orthogonal Projection of the point ‘$x$’ onto ‘$w$’ is $\tilde{x} = (x.w)w$.
  
  Or $\tilde{x} = \dfrac{x.w}{\|w\|_2^2}w$, if $w$ is not a unit vector.
- Let ‘$X^TX$’ be the sample covariance matrix,
  $0 \leq \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_d$ be its eigenvalues and let $v_1, v_2, \cdots, v_d$ be the corresponding Orthogonal Unit Eigen-vectors.
- Given Orthonormal directions (vectors) $v_1, v_2, \ldots, v_k$, we can write:
  \[\tilde{x} = \sum_{i=1}^k (x.v_i)v_i.\]
The Principal Components: are precisely the eigenvectors of the data’s covariance matrix. Read More
Total Variance and Error Measurement:
- The Total Variance of the data can be expressed as the sum of all the eigenvalues:
$$ \mathbf{Tr} \Sigma = \mathbf{Tr} (U \Lambda U^T) = \mathbf{Tr} (U^T U \Lambda) = \mathbf{Tr} \Lambda = \lambda_1 + \ldots + \lambda_n. $$
- The Total Variance of the Projected data is:
$$ \mathbf{Tr} (P \Sigma P^T ) = \lambda_1 + \lambda_2 + \cdots + \lambda_k. $$
- The Error in the Projection could be measured with respect to variance.
  - We define the ratio of variance “explained” by the projected data (equivalently, the ratio of information “retained”) as:
$$ \dfrac{\lambda_1 + \ldots + \lambda_k}{\lambda_1 + \ldots + \lambda_n}. $$

If the ratio is high, we can say that much of the variation in the data can be observed on the projected plane.
Mathematical Formulation:
PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

Consider a data matrix, $X$, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of feature (say, the results from a particular sensor).

Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights or coefficients ${\displaystyle \mathbf {v}_ {(k)}=(v_{1},\dots ,v_{p})_ {(k)}}$ that map each row vector ${\displaystyle \mathbf {x}_ {(i)}}$ of $X$ to a new vector of principal component scores ${\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_ {(i)}}$, given by:

$${\displaystyle {t_{k}}_{(i)}=\mathbf {x}_ {(i)}\cdot \mathbf {v}_ {(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l}$$

in such a way that the individual variables ${\displaystyle t_{1},\dots ,t_{l}}$ of $t$ considered over the data set successively inherit the maximum possible variance from $X$, with each coefficient vector $v$ constrained to be a unit vector (where $l$ is usually selected to be less than ${\displaystyle p}$ to reduce dimensionality).

The Procedure and what it does:
- Finds a lower dimensional subspace (PCs) that Minimizes the RSS of projection errors
- Produces a vector (1st PC) with the highest possible variance, each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
- Results in an uncorrelated orthogonal basis set.
- PCA constructs new axes that point to the directions of maximal variance (in the original variable space)
Intuition:
PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only a commensurately small amount of information.
- Its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data.
PCA Algorithm:
- Data Preprocessing:
  - Training set: $x^{(1)}, x^{(2)}, \ldots, x^{(m)}$
  - Preprocessing (feature scaling + mean normalization):
    - mean normalization:
      $\mu_{j}=\frac{1}{m} \sum_{i=1}^{m} x_{j}^{(i)}$
      Replace each $x_{j}^{(i)}$ with $x_j^{(i)} - \mu_j$
    - feature scaling:
      If different features on different, scale features to have comparable range
      $s_j = S.D(X_j)$ (the standard deviation of feature $j$)
      Replace each $x_{j}^{(i)}$ with $\dfrac{x_j^{(i)} - \mu_j}{s_j}$
- Computing the Principal Components:
  - Compute the SVD of the matrix $X = U S V^T$
  - Compute the Principal Components:
    $$T = US = XV$$
    
    Note: The $j$-th principal component is: $Xv_j$
  - Choose the top $k$ components singular values in $S = S_k$
  - Compute the Truncated Principal Components:
    $$T_k = US_k$$
- Computing the Low-rank Approximation Matrix $X_k$:
  - Compute the reconstruction matrix:
    $$X_k = T_kV^T = US_kV^T$$
Results and Definitions:
- Columns of $V$ are principal directions/axes
- Columns of $US$ are principal components (“scores”)
- Principal Components (“scores”) VS Principal Directions/Axes
NOTE: the analysis above is valid only for (1) $X$ w/ samples in rows and variables in columns (2) $X$ is centered (mean=0)
Properties and Limitations:
Limitations:
- PCA is highly sensitive to the (relative) scaling of the data; no consensus on best scaling.
Optimality:
Optimal for Finding a lower dimensional subspace (PCs) that Minimizes the RSS of projection errors
How does PCA relate to CCA:
CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset.
How does PCA relate to ICA:
Independent component analysis (ICA) is directed to similar problems as principal component analysis, but finds additively separable components rather than successive approximations.
What’s the difference between PCA estimate and OLS estimate:

Notes:

Variance is the measure of spread along only one axis
SVD(X) vs Spectral-Decomposition($\Sigma = X^TX$):
SVD is better $\iff$ more numerically stable $iff$ faster
When are the PCs independent?
Assuming that the dataset is Gaussian distributed would guarantee that the PCs are independent. Discussion

Derivation 1. Fitting Gaussians to Data with MLE

Three Derivations of Principal Components (concise)
Better Derivations (longer)

What?
1. Fit a Gaussian to data with MLE
2. Choose k Gaussian axes of greatest variance.
  
  Notice: MLE estimates a covariance matrix; $\hat{\Sigma} = \dfrac{1}{n}X^TX$.
Algorithm:
1. Center $X$
2. Normalize $X$.
  
  Optional. Should only be done if the units of measurement of the features differ.
3. Compute the unit Eigen-values and Eigen-vectors of $X^TX$
4. Choose ‘$k$’ based on the Eigenvalue sizes
  
  Optional. Top to bottom.
5. For the best k-dim subspace, pick Eigenvectors $v_{d-k+1}, \cdots, v_d$.
6. Compute the coordinates ‘$x.v_i$’ of the trainning/test data in PC-Space.

Derivation 2. Maximizing Variance

What?
1. Find a direction ‘$w$’ that maximizes the variance of the projected data.
2. Maximize the variance
Derivation:

\[\max_{w : \|w\|_2=1} \: Var(\left\{\tilde{x_1}, \tilde{x_2}, \cdots, \tilde{x_n} \right\})\]

\[\begin{align} & \ = \max_{w : \|w\|_2=1} \dfrac{1}{n} \sum_{i=1}^{n}(x_i.\dfrac{w}{\|w\|})^2 \\ & \ = \max_{w : \|w\|_2=1} \dfrac{1}{n} \dfrac{\|xw\|^2}{\|w\|^2} \\ & \ = \max_{w : \|w\|_2=1} \dfrac{1}{n} \dfrac{w^TX^TXw}{w^Tw} \\ \end{align}\]

where $\dfrac{1}{n}\dfrac{w^TX^TXw}{w^Tw}$ is the Rayleigh Quotient.

For any Eigen-vector $v_i$, the Rayleigh Quotient is $= \lambda_i$.

$\implies$ the vector $v_d$ with the largest $\lambda_d$, achieves the maximum variance: $\dfrac{\lambda_d}{n}.$

Thus, the maximum of the Rayleigh Quotient is achieved at the Eigenvector that has the highest corresponding Eigenvalue.

We find subsequent vectors by finding the next biggest $\lambda_i$ and choosing its corresponding Eigenvector.
- Full Derivation
Another Derivation from Statistics:
First, we note that, The sample variance along direction $u$ can be expressed as a quadratic form in $u$:
$$ \sigma^2(u) = \dfrac{1}{n} \sum_{k=1}^n [u^T(x_k-\hat{x})]^2 = u^T \Sigma u,$$

The data matrix has points $x_i$; its component along a proposed axis $u$ is $(x · u)$.
The variance of this is $E(x · u − E(x · u))^2$
and the optimization problem is

$$ \begin{align} \max_{x : \|x\|_2=1} \: E(x · u − E(x · u))^2 & \\ & \ = \max_{u : \|u\|_2=1} \: E[(u \cdot (x − Ex))^2] \\ & \ = \max_{u : \|u\|_2=1} \: uE[(x − Ex) \cdot (x − Ex)^T]u \\ & \ = \max_{u : \|u\|_2=1} \: u^T \Sigma u \end{align} $$

where the matrix ${\displaystyle \Sigma \:= \dfrac{1}{n} \sum_{j=1}^n (x_j-\hat{x})(x_j-\hat{x})^T}.$
Since $\Sigma$ is symmetric, the $u$ that gives the maximum value to $u^T\Sigma u$ is the eigenvector of $\Sigma$ with the largest eigenvalue.
The second and subsequent principal component axes are the other eigenvectors sorted by eigenvalue.

Proof of variance along a direction:

$$\boldsymbol{u}^{\top} \operatorname{cov}(\boldsymbol{X}) \boldsymbol{u}=\boldsymbol{u}^{\top} \mathbb{E}\left[(\boldsymbol{X}-\mathbb{E}(\boldsymbol{X}))(\boldsymbol{X}-\mathbb{E}(\boldsymbol{X}))^{\top}\right] \boldsymbol{u}=\mathbb{E}\left[\langle\boldsymbol{u}, \boldsymbol{X}-\mathbb{E}(\boldsymbol{X})\rangle^{2}\right] \geq 0 \\ \implies \\ \operatorname{var}(\langle\boldsymbol{u}, \boldsymbol{X}\rangle)=\mathbb{E}\left[\langle\boldsymbol{u}, \boldsymbol{X}-\mathbb{E} \boldsymbol{X}\rangle^{2}\right]=\boldsymbol{u}^{\top} \operatorname{cov}(\boldsymbol{X}) \boldsymbol{u}$$
- PCA and Covariance Matrices (paper)

Derivation 3. Minimize Projection Error

What?
1. Find direction ‘$w$’ that minimizes the Projection Error.
Derivation:
$$ \begin{align} \min_{\tilde{x} : \|\tilde{x}\|_2 = 1} \; \sum_{i=1}^n \|x_i - \tilde{x_i}\|^2 & \\ & \ = \min_{w : \|w\|_2 = 1} \; \sum_{i=1}^n \|x_i -\dfrac{x_i \cdot w}{\|w\|_2^2}w\|^2 \\ & \ = \min_{w : \|w\|_2 = 1} \; \sum_{i=1}^n \left[\|x_i\|^2 - (x_i \cdot \dfrac{w}{\|w\|_2})^2\right] \\ & \ = \min_{w : \|w\|_2 = 1} \; c - n*\sum_{i=1}^n(x_i \cdot \dfrac{w}{\|w\|_2})^2 \\ & \ = \min_{w : \|w\|_2 = 1} \; c - n*Var(\left\{\tilde{x_1}, \tilde{x_2}, \cdots, \tilde{x_n} \right\}) \\ & \ = \max_{w : \|w\|_2 = 1} \; Var(\left\{\tilde{x_1}, \tilde{x_2}, \cdots, \tilde{x_n} \right\}) \end{align} $$

Thus, minimizing projection error is equivalent to maximizing variance.

PCA
Principal Component Analysis

Table of Contents

PCA

Derivation 1. Fitting Gaussians to Data with MLE

Derivation 2. Maximizing Variance

Derivation 3. Minimize Projection Error

PCA Principal Component Analysis

Table of Contents

PCA

Derivation 1. Fitting Gaussians to Data with MLE

Derivation 2. Maximizing Variance

Derivation 3. Minimize Projection Error

PCA
Principal Component Analysis