- Latent Variable Model Intuition (slides!)
- Radford Neal’s Research: Latent Variable Models (publications)
- Basics of Statistical Machine Learning: models, estimation, MLE, inference (paper/note!)
Statistical Models
-
Statistical Models:
A Statistical Model is a, non-deterministic, mathematical model that embodies a set of statistical assumptions concerning the generation of sample data.
It is specified as a mathematical relationship between one or more random variables and other non-random variables.Formal Definition:
A Statistical Model consists of a pair \((S, \mathcal{P})\) where \(S\) is the set of possible observations (the sample space) and \(\mathcal{P}\) is a set of probability distributions on \(S\).The set \(\mathcal{P}\) can be (and is usually) parametrized:
$$\mathcal{P}=\left\{P_{\theta} : \theta \in \Theta\right\}$$
The set \(\Theta\) defines the parameters of the model.
Notes:
- It is important that a statistical model consists of a set of probability distributions,
while a probability model is just one known distribution.
- It is important that a statistical model consists of a set of probability distributions,
- Parametric Model:
A parametric model is a set of probability distributions indexed by a parameter \(\theta \in \Theta\). We denote this as:$$\{p(y ; \theta) | \theta \in \Theta\},$$
where \(\theta\) is the parameter and \(\Theta\) is the Parameter-Space.
Notes:
- The parametric way to classify would be to decide a model (Gaussian, Bernoulli, etc.) for the features of \(\boldsymbol{x}\), and typically the models are different for different classes \(y\).
| In machine learning we are often interested in a function of the distribution \(T(F)\), for example, the mean. We call \(T\) the statistical functional, viewing \(F\) the distribution itself a function of \(x\). However, we will also abuse the notation and say \(\theta=T(F)\) is a “parameter” even for nonparametric models.
- The parametric way to classify would be to decide a model (Gaussian, Bernoulli, etc.) for the features of \(\boldsymbol{x}\), and typically the models are different for different classes \(y\).
-
Non-Parametric Model:
A non-parametric model is one which cannot be parametrized by a fixed number of parameters.
Non-parametric models differ from parametric models in that the model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance.Examples:
- A histogram is a simple nonparametric estimate of a probability distribution.
- Kernel density estimation provides better estimates of the density than histograms.
- Nonparametric regression and semiparametric regression methods have been developed based on kernels, splines, and wavelets.
- Data envelopment analysis provides efficiency coefficients similar to those obtained by multivariate analysis without any distributional assumption.
- KNNs classify the unseen instance based on the K points in the training set which are nearest to it.
- A support vector machine (SVM) (with a Gaussian kernel) is a nonparametric large-margin classifier.
- Method of moments (statistics) with polynomial probability distributions.
- Other classes of Statistical Models:
Given \(\mathcal{P}=\left\{P_{\theta} : \theta \in \Theta\right\}\), the set of probability distributions on \(S\).- A model is “parametric” if all the parameters are in finite-dimensional parameter spaces; i.e. \(\Theta\) has finite dimension
- A model is “non-parametric” if all the parameters are in infinite-dimensional parameter spaces
- A “semi-parametric” model contains finite-dimensional parameters of interest and infinite-dimensional nuisance parameters
- A “semi-nonparametric” model has both finite-dimensional and infinite-dimensional unknown parameters of interest
- Types of Statistical Models:
- Linear Model
- GLM - General Linear Model
- GiLM - Generalized Linear Model
- Latent Variable Model
- The Statistical Model for Linear Regression:
Given a (random) sample \(\left(Y_{i}, X_{i 1}, \ldots, X_{i p}\right), i=1, \ldots, n\) the relation between the observations \(Y_i\) and the independent variables \(X_{ij}\) is formulated as:$$Y_{i}=\beta_{0}+\beta_{1} \phi_{1}\left(X_{i 1}\right)+\cdots+\beta_{p} \phi_{p}\left(X_{i p}\right)+\varepsilon_{i} \qquad i=1, \ldots, n$$
where \({\displaystyle \phi_{1},\ldots ,\phi_{p}}\) may be nonlinear functions. In the above, the quantities \(\varepsilon_i\) are random variables representing errors in the relationship.
The Linearity of the Model:
The “linear” part of the designation relates to the appearance of the regression coefficients, \(\beta_j\) in a linear way in the above relationship.
Alternatively, one may say that the predicted values corresponding to the above model, namely:$$\hat{Y}_{i}=\beta_{0}+\beta_{1} \phi_{1}\left(X_{i 1}\right)+\cdots+\beta_{p} \phi_{p}\left(X_{i p}\right) \qquad(i=1, \ldots, n)$$
are linear functions of the coefficients \(\beta_j\).
Estimating the Parameters \(\beta_j\):
Assuming an estimation on the basis of a least-squares analysis, estimates of the unknown parameters \(\beta_j\) are determined by minimizing a sum of squares function:$$S=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} \phi_{1}\left(X_{i 1}\right)-\cdots-\beta_{p} \phi_{p}\left(X_{i p}\right)\right)^{2}$$
Effects of Linearity:
- The function to be minimized is a quadratic function of the \(\beta_j\) for which minimization is a relatively simple problem
- The derivatives of the function are linear functions of the \(\beta_j\) making it easy to find the minimizing values
- The minimizing values \(\beta_j\) are linear functions of the observations \(Y_i\)
- The minimizing values \(\beta_j\) are linear functions of the random errors \(\varepsilon_i\) which makes it relatively easy to determine the statistical properties of the estimated values of \(\beta_j\).
-
Latent Variable Models:
Latent Variable Models are statistical models that relate a set of observable variables (so-called manifest variables) to a set of latent variables.Core Assumption - Local Independence:
Local Independence:
The observed items are conditionally independent of each other given an individual score on the latent variable(s). This means that the latent variable explains why the observed items are related to another.In other words, the targets/labels on the observations are the result of an individual’s position on the latent variable(s), and that the observations have nothing in common after controlling for the latent variable.
$$p(A,B\vert z) = p(A\vert z) \times (B\vert z)$$
Methods for inferring Latent Variables:
- Hidden Markov models (HMMs)
- Factor analysis
- Principal component analysis (PCA)
- Partial least squares regression
- Latent semantic analysis and probabilistic latent semantic analysis
- EM algorithms
- Pseudo-Marginal Metropolis-Hastings algorithm
- Bayesian Methods: LDA
Notes:
- Latent Variables encode information about the data
e.g. in compression, a 1-bit latent variable can encode if a face is Male/Female. - Data Projection:
You “hypothesis” how the data might have been generated (by LVs).
Then, the LVs generate the data/observations.
- Latent Variable Models/Gaussian Mixture Models
- Expectation-Maximization/EM-Algorithm for Latent Variable Models
- Three ways to build classifiers:
- Generative models (e.g. LDA) [We’ll learn about LDA next lecture.]
- Assume sample points come from probability distributions, different for each class.
- Guess form of distributions
- For each class \(C\), fit distribution parameters to class \(C\) points, giving \(P(X\vert Y = C)\)
- For each \(C\), estimate \(P(Y = C)\)
- Bayes’ Theorem gives \(P(Y\vert X)\)
- If \(0-1\) loss, pick class \(C\) that maximizes \(P(Y = C\vert X = x)\) [posterior probability] equivalently, maximizes \(P(X = x\vert Y = C) P(Y = C)\)
- Discriminative models (e.g. logistic regression) [We’ll learn about logistic regression in a few weeks.]
- Model \(P(Y\vert X)\) directly
- Find decision boundary (e.g. SVM)
- Model \(r(x)\) directly (no posterior)
Advantage of (1 & 2): \(P(Y\vert X)\) tells you probability your guess is wrong
[This is something SVMs don’t do.]
Advantage of (1): you can diagnose outliers: \(P(X)\) is very small
Disadvantages of (1): often hard to estimate distributions accurately;
real distributions rarely match standard ones. - Generative models (e.g. LDA) [We’ll learn about LDA next lecture.]
Regression Models
- Linear Models:
A Linear Model takes an input \(x\) and computes a signal \(s = \sum_{i=0}^d w_ix_i\) that is a linear combination of the input with weights, then apply a scoring function on the signal \(s\).- Linear Classifier as a Parametric Model:
Linear classifiers \(f(x, W)=W x+b\) are an example of a parametric model that sums up the knowledge of the training data in the parameter: weight-matrix \(W\). - Scoring Function:
- Linear Classification:
\(h(x) = sign(s)\) - Linear Regression:
\(h(x) = s\) - Logistic Regression:
\(h(x) = \sigma(s)\)
- Linear Classification:
- Linear Classifier as a Parametric Model: