Generalized Linear Models and Exponential Family Distributions (Blog!)
Logistic regression as a neural network (Blog!)
- Least-Squares Linear Regression:
MLE + Noise Normally Distributed + Conditional Probability Normally Distributed - Logistic Regression:
MLE + Noise \(\sim\) Logistic Distribution (latent) + Conditional Probability \(\sim\) Bernoulli Distributed - Ridge Regression: MAP + Noise Normally Distributed + Conditional Probability Normally Distributed + Weight Prior Normally Distributed
Regression
Linear Regression
Assume that the target distribution is a sum of a deterministic function \(f(x; \theta)\) and a normally distributed error \(\epsilon \sim \mathcal{N}\left(0, \sigma^{2}\right)\):
$$y = f(x; \theta) + \epsilon$$
Thus, \(y \sim \mathcal{N}\left(f(x; \theta), \sigma^{2}\right)\), and (we assume) there is a distribution \(p(y\vert x)\) where \(y \sim \mathcal{N}\left(f(x; \theta), \sigma^{2}\right)\).
- Notice that, \(\epsilon = y - \hat{y} \implies\)
$$\begin{align} \epsilon &\sim \mathcal{N}\left(0, \sigma^{2}\right) \\ &\sim \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{\left(\epsilon\right)^{2}}{2 \sigma^{2}}} \\ &\sim \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{\left(y-\hat{y}\right)^{2}}{2 \sigma^{2}}} \end{align}$$
In LR, the equivalent is:
We assume that we are given data \(x_{1}, \ldots, x_{n}\) and outputs \(y_{1}, \ldots, y_{n}\) where \(x_{i} \in \mathbb{R}^{d}\) and \(y_{i} \in \mathbb{R}\) and that there is a distribution \(p(y \vert x)\) where \(y \sim \mathcal{N}\left(w^{\top} x, \sigma^{2}\right)\).
- In other words, we assume that the conditional distribution of \(Y_i \vert \theta\) is a Gaussian (Each individual term \(p\left(y_{i} \vert \mathbf{x}_ {i}, \boldsymbol{\theta}\right)\) comes from a Gaussian):
$$Y_{i} \vert \boldsymbol{\theta} \sim \mathcal{N}\left(h_{\boldsymbol{\theta}}\left(\mathbf{x}_ {i}\right), \sigma^{2}\right)$$
In other words, we assume that there is a true linear model weighted by some true \(w\) and the values generated are scattered around it with some error \(\epsilon \sim \mathcal{N}\left(0, \sigma^{2}\right)\).
Then we just want to obtain the max likelihood estimation:
$$\begin{aligned} p(Y \vert X, w) &=\prod_{i=1}^{n} p\left(y_{i} \vert x_{i}, w\right) \\ \log p(\cdot) &=\sum_{i}-\log \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}}\left(y_{i}-w^{\top} x_{i}\right)^{2} \end{aligned}$$
Logistic Regression
The errors are not directly observable, since we never observe the actual probabilities directly.
Latent Variable Interpretation:
The logistic regression can be understood simply as finding the \(\beta\) parameters that best fit:
$$y=\left\{\begin{array}{ll}{1} & {\beta_{0}+\beta_{1} x+\varepsilon>0} \\ {0} & {\text { else }}\end{array}\right.$$
where $\varepsilon$ is an error distributed by the standard logistic distribution.
The associated latent variable is \({\displaystyle y'=\beta _{0}+\beta _{1}x+\varepsilon }\). The error term \(\varepsilon\) is not observed, and so the \(y'\) is also an unobservable, hence termed “latent” (the observed data are values of \(y\) and \(x\)). Unlike ordinary regression, however, the \(\beta\) parameters cannot be expressed by any direct formula of the \(y\) and \(x\) values in the observed data. Instead they are to be found by an iterative search process.
Notes:
- Can be used with a polynomial kernel.
- Convex Cost Function
- No closed form solution
LR, MINIMIZING the ERROR FUNCTION (DERIVATION):
Linear Classification and Regression, and Non-Linear Transformations:
A Third Linear Model - Logistic Regression:
Logistic Regression Algorithm:
Summary of Linear Models: