Table of Contents



Clustering - Introduction

  1. Clustering:
  1. Notes:

    EM Algorithm Video

    Clustering Types:

    • Hard Clustering: clusters do not overlap
      • Elements either belong to a cluster or it doesn’t
    • Soft Clustering: cluster may overlap
      • Computes a strength of association between clusters and instances

    Mixture Models:

    • probabilistically-grounded way of doing soft clustering
    • each cluster: a generative model (Gaussian or multinomial)
    • parameters (e.g. mean/covariance are unknown)

    Mixture Models as Latent Variable Models:
    A mixture model can be described more simply by assuming that each observed data point has a corresponding unobserved data point, or latent variable, specifying the mixture component to which each data point belongs.

    Expectation Maximization (EM):

    • Chicken and egg problem:
      • need \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\) to guess source of points
      • need to know source to estimate \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\)
    • EM algorithm
      • start with two randomly placed Gaussians \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\)
      • for each point: \(P(b \vert x_i)\) does it look like it came from \(b\)?
      • adjust \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\) to fit points assigned to them