Table of Contents
Estimation
Maximum Likelihood Estimation (MLE)
- Maximum Likelihood Estimation:
Likelihood in Parametric Models:Suppose we have a parametric model \(\{p(y ; \theta) \vert \theta \in \Theta\}\) and a sample \(D=\left\{y_{1}, \ldots, y_{n}\right\}\):
- The likelihood of parameter estimate \(\hat{\theta} \in \Theta\) for sample \(\mathcal{D}\) is:
$$p(\mathcal{D} ; \hat{\theta})=\prod_{i=1}^{n} p\left(y_{i} ; \hat{\theta}\right)$$
- In practice, we prefer to work with the log-likelihood. Same maximum but
$$\log p(\mathcal{D} ; \hat{\theta})=\sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right)$$
and sums are easier to work with than products.
Likelihood is the probability of the data given the parameters of the model
MLE for Parametric Models:
The maximum likelihood estimator (MLE) for \(\theta\) in the (parametric) model \(\{p(y, \theta) \vert \theta \in \Theta\}\) is:
$$\begin{aligned} \hat{\theta} &=\underset{\theta \in \Theta}{\arg \max } \log p(\mathcal{D}, \hat{\theta}) \\ &=\underset{\theta \in \Theta}{\arg \max } \sum_{i=1}^{n} \log p\left(y_{i} ; \theta\right) \end{aligned}$$
You are finding the value of the parameter \(\theta\) that, if used (in the model) to generate the probability of the data, would make the data most “likely” to occur.
- MLE Intuition:
If I choose a hypothesis \(h\) underwhich the observed data is very plausible then the hypothesis is very likely. - Maximum Likelihood as Empirical Risk Minimization
- Finding the MLE is an optimization problem.
- For some model families, calculus gives a closed form for the MLE
- Can also use numerical methods we know (e.g. SGD)
Notes:
- Why maximize the natural log of the likelihood?
-
- Numerical Stability: change products to sums
- The logarithm of a member of the family of exponential probability distributions (which includes the ubiquitous normal) is polynomial in the parameters (i.e. max-likelihood reduces to least-squares for normal distributions)
\(\log\left(\exp\left(-\frac{1}{2}x^2\right)\right) = -\frac{1}{2}x^2\) - The latter form is both more numerically stable and symbolically easier to differentiate than the former. It increases the dynamic range of the optimization algorithm (allowing it to work with extremely large or small values in the same way).
- The logarithm is a monotonic transformation that preserves the locations of the extrema (in particular, the estimated parameters in max-likelihood are identical for the original and the log-transformed formulation)
- Gradient methods generally work better optimizing \(log_p(x)\) than \(p(x)\) because the gradient of \(log_p(x)\) is generally more well-scaled. link Justification: the gradient of the original term will include a \(e^{\vec{x}}\) multiplicative term that scales very quickly one way or another, requiring the step-size to equally scale/stretch in the opposite direction.
-
- The likelihood of parameter estimate \(\hat{\theta} \in \Theta\) for sample \(\mathcal{D}\) is: