Table of Contents



Special Matrices in ML
Matrix Cookbook

Diagonal Matrices

  1. Definition.
    Diagonal matrices are square matrices \(A\) with \(A_{ij} = 0 \text{, when } i \ne j.\)

  2. Properties:

    1. Diagonal matrices correspond to quadratic functions that are simple sums of squares, of the form:

      $$q(x) = \sum_{i=1}^n \lambda_i x_i^2 = x^T \mathbf{diag}(\lambda) x.$$

    2. They are easy to invert \(A^{-1}_ {ii} = 1/A_ {ii} \: \forall i \in [1, n]\)
    3. The pseudo-inverse is easy to compute: keep zero diagonal elements as zero
    4. Allow easier matrix multiplication and powers


Symmetric Matrices

  1. Definition:
    Symmetric matrices are square matrices that satisfy \(A_{ij} = A_{ji}\) for every pair \((i,j).\)

    The set of symmetric \((n \times n)\) matrices is denoted \(\mathbf{S}^n\). This set is a subspace of \(\mathbf{R}^{n \times n}\).

  2. Properties:
    • Has orthonormal eigenvectors (even with repeated eigenvalues)
    • Its inverse is also symmetric
    • All eigenvalues of a symmetric matrix are real
    • \(A^TA\) is invertible iff columns of \(A\) are linearly independent
    • Every symmetric matrix \(S\) can be diagonalized (factorized) with \(Q\) formed by the orthonormal eigenvectors \(v_i\) of \(S\) and \(\Lambda\) is a diagonal matrix holding all the eigenvalues:

      $$\begin{aligned} S&=Q \Lambda Q^{T} \\ &= \lambda_{1} v_{1} v_{1}^{T}+\ldots+\lambda_{n} v_{n} v_{h}^{T} = \sum_{i=1}^n \lambda_i v_i v_i^T \end{aligned}$$

    • One can create a symmetric matrix from any matrix by:

      $$M = {\displaystyle {\tfrac {1}{2}}\left(A+A^{\textsf {T}}\right)}$$


  3. Examples:
    1. Representation of a weighted, undirected graph. Visit the Book
    2. Laplacian matrix of a graph. Visit the Book
    3. Hessian of a function. Visit the Book
    4. Gram matrix of data points. Visit the Book



Skew-Symmetric Matrices

  1. Definition.

  2. Properties:

  3. Asynchronous:


Covariance Matrices

  1. Definition.
    Standard Form:

    $$\Sigma :=\mathrm {E} \left[\left(\mathbf {X} -\mathrm {E} [\mathbf {X} ]\right)\left(\mathbf {X} -\mathrm {E} [\mathbf {X} ]\right)^{\rm {T}}\right]$$

    $$ \Sigma := \dfrac{1}{m} \sum_{k=1}^m (x_k - \hat{x})(x_k - \hat{x})^T. $$

    Matrix Form:

    $$ \Sigma := \dfrac{X^TX}{n} $$

    valid only for (1) \(X\) w/ samples in rows and variables in columns (2) \(X\) is centered (mean=0)

  2. Properties:
    1. The sample covariance matrix allows to find the variance along any direction in data space.

    2. The diagonal elements of \(\Sigma\) give the variances of each vector in the data.

    3. The trace of \(\Sigma\) gives the sum of all the variances.

    4. The matrix \(\Sigma\) is positive semi-definite, since the associated quadratic form \(u \rightarrow u^T \Sigma u\) is non-negative everywhere.

    5. It is Symmetric.

    6. Every symmetric positive semi-definite matrix is a covariance matrix.
      Proof.
      OR, Visit the website
    7. The sample variance along direction \(u\) can be expressed as a quadratic form in \(u\):
      \(\sigma^2(u) = \dfrac{1}{n} \sum_{k=1}^n [u^T(x_k-\hat{x})]^2 = u^T \Sigma u,\)
    8. The diminsion of the matrix is \((n \times n)\), where \(n\) is the number of variables/features/columns.

    9. The inverse of this matrix, \({\displaystyle \Sigma ^{-1},}\) if it exists, is the inverse covariance matrix, also known as the concentration matrix or precision matrix.

    10. If a vector of \(n\) possibly correlated random variables is jointly normally distributed, or more generally elliptically distributed, then its probability density function can be expressed in terms of the covariance matrix.

    11. \(\Sigma =\mathrm {E} (\mathbf {XX^{\rm {T}}} )-{\boldsymbol {\mu }}{\boldsymbol {\mu }}^{\rm {T}}\).

    12. \({\displaystyle \operatorname {var} (\mathbf {AX} +\mathbf {a} )=\mathbf {A} \,\operatorname {var} (\mathbf {X} )\,\mathbf {A^{\rm {T}}} }\).

    13. \(\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )=\operatorname {cov} (\mathbf {Y} ,\mathbf {X} )^{\rm {T}}\).

    14. \(\operatorname {cov} (\mathbf {X}_ {1}+\mathbf {X}_ {2},\mathbf {Y} )=\operatorname {cov} (\mathbf {X}_ {1},\mathbf {Y} )+\operatorname {cov} (\mathbf {X}_ {2},\mathbf {Y} )\).

    15. If \((p = q)\), then \(\operatorname {var} (\mathbf {X} +\mathbf {Y} )=\operatorname {var} (\mathbf {X} )+\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )+\operatorname {cov} (\mathbf {Y} ,\mathbf {X} )+\operatorname {var} (\mathbf {Y} )\).

    16. \(\operatorname {cov} (\mathbf {AX} +\mathbf {a} ,\mathbf {B} ^{\rm {T}}\mathbf {Y} +\mathbf {b} )=\mathbf {A} \,\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )\,\mathbf {B}\).

    17. If \({\displaystyle \mathbf {X} }\) and \({\displaystyle \mathbf {Y} }\) are independent (or somewhat less restrictedly, if every random variable in \({\displaystyle \mathbf {X} }\) is uncorrelated with every random variable in \({\displaystyle \mathbf {Y} }\)), then \({\displaystyle \operatorname {cov} (\mathbf {X} ,\mathbf {Y} )=\mathbf {0} }\).

    18. \(\operatorname {var} (\mathbf {b} ^{\rm {T}}\mathbf {X} )= \mathbf {b} ^{\rm {T}}\operatorname {cov} (\mathbf {X} )\mathbf {b} = \operatorname {cov} (\mathbf{b}^T\mathbf {X}, \mathbf{b}^T\mathbf {X} ) \geq 0,\,\).

      This quantity is NON-Negative because it’s variance.
      Maybe \(= \mathbf {b} ^{\rm {T}}\operatorname {var} (\mathbf {X} )\mathbf {b}\)??

    19. An identity covariance matrix, \(\Sigma = I\) has variance \(= 1\) for all variables.

    20. A covariance matrix of the form, \(\Sigma=\sigma^2I\) has variance \(= \sigma^2\) for all variables.

    21. A diagonal covariance matrix has variance \(\sigma_i^2\) for the \(i-th\) variable.

    22. When the mean \(\hat{x}\) is not known the denominator of the “SAMPLE COVARIANCE MATRIX” should be \((n-1)\) and not \(n\).

    where, \({\displaystyle \mathbf {X} ,\mathbf {X} _{1}},\) and \({\displaystyle \mathbf {X} _{2}}\) are random \((p\times 1)\) vectors, \({\displaystyle \mathbf {Y} }\) is a random \((q\times 1)\) vector, \({\displaystyle \mathbf {a} }\) is a \((q\times 1)\) vector, \({\displaystyle \mathbf {b} }\) is a \((p\times 1)\) vector, and \({\displaystyle \mathbf {A} }\) and \({\displaystyle \mathbf {B} }\) are \((q\times p)\) matrices of constants.

  3. \(\Sigma\) as a Linear Operator:
    • Applied to one vector, the covariance matrix maps a linear combination, \(c\), of the random variables, \(X\), onto a vector of covariances with those variables:
    \[{\displaystyle \mathbf {c} ^{\rm {T}}\Sigma =\operatorname {cov} (\mathbf {c} ^{\rm {T}}\mathbf {X} ,\mathbf {X} )}\]
    • Treated as a bilinear form, it yields the covariance between the two linear combinations:
    \[{\displaystyle \mathbf {d} ^{\rm {T}}\Sigma \mathbf {c} =\operatorname {cov} (\mathbf {d} ^{\rm {T}}\mathbf {X} ,\mathbf {c} ^{\rm {T}}\mathbf {X} )}\]
    • The variance of a linear combination is then (its covariance with itself)
    \[{\displaystyle \mathbf {c} ^{\rm {T}}\Sigma \mathbf {c} }\]
    • The (pseudo-)inverse covariance matrix provides an inner product, \({\displaystyle \langle c-\mu \|\Sigma ^{+}\| c-\mu \rangle }\) which induces the Mahalanobis distance, a measure of the “unlikelihood” of \(c\).
  4. Applications [Examples]:
    1. The Whitening Transformation: allows one to completely decorrelate the data, Equivalently,
      allows one to find an optimal basis for representing the data in a compact way.

    2. Rayleigh Quotient:

    3. Principle Component Analysis [PCA]

    4. The Karhunen-Loève transform (KL-transform)

    5. Mutual fund separation theorem

    6. Capital asset pricing model

    7. Portfolio Theory: The matrix of covariances among various assets’ returns is used to determine, under certain assumptions, the relative amounts of different assets that investors should (in a normative analysis) or are predicted to (in a positive analysis) choose to hold in a context of diversification.


Positive Semi-Definite Matrices

  1. Definition:
    A symmetric \({\displaystyle n\times n}\) real matrix \({\displaystyle M}\) is said to be positive semi-definite if the scalar \({\displaystyle z^{\textsf {T}}Mz}\) is non-negative for every non-zero column vector \({\displaystyle z}\) of \(n\) real numbers.

    Mathematically:

    $$M \text { positive semi-definite } \Longleftrightarrow \quad x^{\top} M x \geq 0 \text { for all } x \in \mathbb{R}^{n}$$


  2. Properties:

    • \(AA^T\) and \(A^TA\) are PSD
    • \(M\) is positive definite if All pivots > 0
    • \(M\) is positive definite if and only if all of its eigenvalues are non-negative.
    • Covariance Matrices \(\Sigma\) are PSD


Positive Definite Matrices

  1. Definition:
    A symmetric \({\displaystyle n\times n}\) real matrix \({\displaystyle M}\) is said to be positive definite if the scalar \({\displaystyle z^{\textsf {T}}Mz}\) is strictly positive for every non-zero column vector \({\displaystyle z}\) of \(n\) real numbers.

    Mathematically:

    $$M \text { positive definite } \Longleftrightarrow x^{\top} M x>0 \text { for all } x \in \mathbb{R}^{n} \backslash \mathbf{0}$$

  2. Properties:

    • \(M\) is positive definite if and only if all of its eigenvalues are positive
    • The matrix \({\displaystyle M}\) is positive definite if and only if the bilinear form \({\displaystyle \langle z,w\rangle =z^{\textsf {T}}Mw}\) is positive definite
    • A symmetric matrix \({\displaystyle M}\) is positive definite if and only if its quadratic form is a strictly convex function
    • Every positive definite matrix is invertible and its inverse is also positive definite.
    • \(M\) is positive definite if All pivots > 0
    • Covariance Matrices \(\Sigma\) are positive definite unless one variable is an exact linear function of the others. Conversely, every positive semi-definite matrix is the covariance matrix of some multivariate distribution
    • Any quadratic function from \({\displaystyle \mathbb {R} ^{n}}\) to \({\displaystyle \mathbb {R} }\) can be written as \({\displaystyle x^{\textsf {T}}Mx+x^{\textsf {T}}b+c}\) where \({\displaystyle M}\) is a symmetric \({\displaystyle n\times n}\) matrix, \(b\) is a real \(n\)-vector, and \(c\) a real constant. This quadratic function is strictly convex, and hence has a unique finite global minimum, if and only if \({\displaystyle M}\) is positive definite.


Orthogonal Matrices

  1. Definition.
    Orthogonal (or, unitary (complex)) matrices are square matrices, such that the columns form an orthonormal basis.

  2. Properties:
    1. If \(U = [u_1, \ldots, u_n]\) is an orthogonal matrix, then

      $$u_i^Tu_j = \left\{ \begin{array}{ll} 1 & \mbox{if } i=j, \\ 0 & \mbox{otherwise.} \end{array} \right. $$

    2. \(U^TU = UU^T = I_n\).

    3. Easy inverse:

      $$U^{-1} = U^T$$

    4. Geometrically, orthogonal matrices correspond to rotations (around a point) or reflections (around a line passing through the origin).

      i.e. they preserve length and angles!
      Proof. Part 5 and 6.

    5. For all vectors \(\vec{x}\),
      \(\|Ux\|_2^2 = (Ux)^T(Ux) = x^TU^TUx = x^Tx = \|x\|_2^2 .\)

      Known as, the rotational invariance of the Euclidean norm.
      Thus, If we multiply x with an orthogonal matrix, the errors present in x will not be magnified. This behavior is very desirable for maintaining numerical stability.

    6. If \(x, y\) are two vectors with unit norm, then the angle \(\theta\) between them satisfies \(\cos \theta = x^Ty\)
      while the angle \(\theta'\) between the rotated vectors \(x' = Ux, y' = Uy\) satisfies \(\cos \theta' = (x')^Ty'\).
      Since, \((Ux)^T(Uy) = x^T U^TU y = x^Ty,\) we obtain that the angles are the same.

  3. Examples:

Dyads

  1. Definition.
    A matrix \(A \in \mathbf{R}^{m \times n}\) is a dyad if it is of the form \(A = uv^T\) for some vectors \(u \in \mathbf{R}^m, v \in \mathbf{R}^n\).

    The dyad acts on an input vector \(x \in \mathbf{R}^n\) as follows:

    $$ Ax = (uv^T) x = (v^Tx) u.$$


  2. Properties:
    1. The output always points in the same direction \(u\) in output space (\(\mathbf{R}^m\)), no matter what the input \(x\) is.

    2. The output is always a simple scaled version of \(u\).

    3. The amount of scaling depends on the vector \(v\), via the linear function \(x \rightarrow v^Tx\).

  3. Examples:


  4. Normalized dyads:
    We can always normalize the dyad, by assuming that both u,v are of unit (Euclidean) norm, and using a factor to capture their scale.
    That is, any dyad can be written in normalized form:
    \[A = uv^T = (\|u\|_2 \cdot |v|_2 ) \cdot (\dfrac{u}{\|u\|_2}) ( \dfrac{v}{\|v\|_2}) ^T = \sigma \tilde{u}\tilde{v}^T,\]
    where \(\sigma > 0\), and \(\|\tilde{u}\|_2 = \|\tilde{v}\|_2 = 1.\)
  5. Symmetric dyads:
    Another important class of symmetric matrices is that of the form \(uu^T\), where \(u \in \mathbf{R}^n\). The matrix has elements \(u_iu_j\), and is symmetric.

    If \(\|u\|_2 = 1\), then the dyad is said to be normalized.

    \[uu^T = \left(\begin{array}{ccc} u_1^2 & u_1u_2 & u_1u_3 \\ u_1u_2 & u_2^2 & u_2u_3 \\ u_1u_3 & u_2u_3 & u_3^2 \end{array} \right)\]
    • Properties:
      1. Symmetric dyads corresponds to quadratic functions that are simply squared linear forms:
        \(q(x) = (u^Tx)^2\)
      2. When the vector \(u\) is normalized (unit), then:
        \(\mathbf{Tr}(uu^T) = \|u\|_2^2 = 1^2 = 1\)

        This follows from the fact that the diagonal entries of a symmetric dyad are just \(u_i^2, \forall i \in [1, n]\)


Correlation matrix

  1. Definition.
    \[{\text{corr}}(\mathbf {X} )=\left({\text{diag}}(\Sigma )\right)^{-{\frac {1}{2}}}\,\Sigma \,\left({\text{diag}}(\Sigma )\right)^{-{\frac {1}{2}}}\]
  2. Properties:
    1. It is the matrix of “Pearson product-moment correlation coefficients” between each of the random variables in the random vector \({\displaystyle \mathbf {X} }\).

    2. The correlation matrix can be seen as the covariance matrix of the standardized random variables \({\displaystyle X_{i}/\sigma (X_{i})}\) for \({\displaystyle i=1,\dots ,n}\).

    3. Each element on the principal diagonal of a correlation matrix is the correlation of a random variable with itself, which always equals 1.

    4. Each off-diagonal element is between 1 and –1 inclusive.