Ahmad Badary

Clustering - Introduction

Notes:

EM Algorithm Video

Clustering Types:
- Hard Clustering: clusters do not overlap
  - Elements either belong to a cluster or it doesn’t
- Soft Clustering: cluster may overlap
  - Computes a strength of association between clusters and instances
Mixture Models:
- probabilistically-grounded way of doing soft clustering
- each cluster: a generative model (Gaussian or multinomial)
- parameters (e.g. mean/covariance are unknown)
Mixture Models as Latent Variable Models:
A mixture model can be described more simply by assuming that each observed data point has a corresponding unobserved data point, or latent variable, specifying the mixture component to which each data point belongs.

Expectation Maximization (EM):
- Chicken and egg problem:
  - need \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\) to guess source of points
  - need to know source to estimate \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\)
- EM algorithm
  - start with two randomly placed Gaussians \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\)
  - for each point: \(P(b \vert x_i)\) does it look like it came from \(b\)?
  - adjust \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\) to fit points assigned to them