Table of Contents
Clustering - Introduction
- Clustering:
-
Notes:
Clustering Types:
- Hard Clustering: clusters do not overlap
- Elements either belong to a cluster or it doesn’t
- Soft Clustering: cluster may overlap
- Computes a strength of association between clusters and instances
Mixture Models:
- probabilistically-grounded way of doing soft clustering
- each cluster: a generative model (Gaussian or multinomial)
- parameters (e.g. mean/covariance are unknown)
Mixture Models as Latent Variable Models:
A mixture model can be described more simply by assuming that each observed data point has a corresponding unobserved data point, or latent variable, specifying the mixture component to which each data point belongs.
Expectation Maximization (EM):
- Chicken and egg problem:
- need \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\) to guess source of points
- need to know source to estimate \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\)
- EM algorithm
- start with two randomly placed Gaussians \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\)
- for each point: \(P(b \vert x_i)\) does it look like it came from \(b\)?
- adjust \(\left(\mu_{a}, \sigma_{a}^{2}\right),\left(\mu_{b}, \sigma_{b}^{2}\right)\) to fit points assigned to them
- Hard Clustering: clusters do not overlap