Ahmad Badary

Least-Squares Linear Regression:
MLE + Noise Normally Distributed + Conditional Probability Normally Distributed
Logistic Regression:
MLE + Noise $\sim$ Logistic Distribution (latent) + Conditional Probability $\sim$ Bernoulli Distributed
Ridge Regression: MAP + Noise Normally Distributed + Conditional Probability Normally Distributed + Weight Prior Normally Distributed

Regression

Linear Regression

Linear Regression as a Statistical Model

Assume that the target distribution is a sum of a deterministic function $f(x; \theta)$ and a normally distributed error $\epsilon \sim \mathcal{N}\left(0, \sigma^{2}\right)$:

$$y = f(x; \theta) + \epsilon$$

Thus, $y \sim \mathcal{N}\left(f(x; \theta), \sigma^{2}\right)$, and (we assume) there is a distribution $p(y\vert x)$ where $y \sim \mathcal{N}\left(f(x; \theta), \sigma^{2}\right)$.
- Notice that, $\epsilon = y - \hat{y} \implies$

$$\begin{align} \epsilon &\sim \mathcal{N}\left(0, \sigma^{2}\right) \\ &\sim \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{\left(\epsilon\right)^{2}}{2 \sigma^{2}}} \\ &\sim \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{\left(y-\hat{y}\right)^{2}}{2 \sigma^{2}}} \end{align}$$

In LR, the equivalent is:
We assume that we are given data $x_{1}, \ldots, x_{n}$ and outputs $y_{1}, \ldots, y_{n}$ where $x_{i} \in \mathbb{R}^{d}$ and $y_{i} \in \mathbb{R}$ and that there is a distribution $p(y \vert x)$ where $y \sim \mathcal{N}\left(w^{\top} x, \sigma^{2}\right)$.

In other words, we assume that the conditional distribution of $Y_i \vert \theta$ is a Gaussian (Each individual term $p\left(y_{i} \vert \mathbf{x}_ {i}, \boldsymbol{\theta}\right)$ comes from a Gaussian):

$$Y_{i} \vert \boldsymbol{\theta} \sim \mathcal{N}\left(h_{\boldsymbol{\theta}}\left(\mathbf{x}_ {i}\right), \sigma^{2}\right)$$

In other words, we assume that there is a true linear model weighted by some true $w$ and the values generated are scattered around it with some error $\epsilon \sim \mathcal{N}\left(0, \sigma^{2}\right)$.
Then we just want to obtain the max likelihood estimation:

$$\begin{aligned} p(Y \vert X, w) &=\prod_{i=1}^{n} p\left(y_{i} \vert x_{i}, w\right) \\ \log p(\cdot) &=\sum_{i}-\log \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}}\left(y_{i}-w^{\top} x_{i}\right)^{2} \end{aligned}$$

Logistic Regression

The errors are not directly observable, since we never observe the actual probabilities directly.

Latent Variable Interpretation:
The logistic regression can be understood simply as finding the $\beta$ parameters that best fit:

$$y=\left\{\begin{array}{ll}{1} & {\beta_{0}+\beta_{1} x+\varepsilon>0} \\ {0} & {\text { else }}\end{array}\right.$$

where $\varepsilon$ is an error distributed by the standard logistic distribution.
The associated latent variable is ${\displaystyle y'=\beta _{0}+\beta _{1}x+\varepsilon }$. The error term $\varepsilon$ is not observed, and so the $y'$ is also an unobservable, hence termed “latent” (the observed data are values of $y$ and $x$). Unlike ordinary regression, however, the $\beta$ parameters cannot be expressed by any direct formula of the $y$ and $x$ values in the observed data. Instead they are to be found by an iterative search process.

Notes:

Can be used with a polynomial kernel.
Convex Cost Function
No closed form solution

LR, MINIMIZING the ERROR FUNCTION (DERIVATION):

Linear Classification and Regression, and Non-Linear Transformations:

A Third Linear Model - Logistic Regression:

Logistic Regression Algorithm:

Summary of Linear Models:

Regression

Table of Contents

Regression

Linear Regression

Logistic Regression