Loss Functions

Abstractly, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number, intuitively, representing some “cost” associated with the event.

Formally, a loss function is a function $L :(\hat{y}, y) \in \mathbb{R} \times Y \longmapsto L(\hat{y}, y) \in \mathbb{R}$ that takes as inputs the predicted value $\hat{y}$ corresponding to the real data value $y$ and outputs how different they are.

Loss Functions for Regression

Introduction

Regression Losses usually only depend on the residual $r = y - \hat{y}$ (i.e. what you have to add to your prediction to match the target)

Distance-Based Loss Functions:
A Loss function $L(\hat{y}, y)$ is called distance-based if it:

Only depends on the residual:
$$L(\hat{y}, y) = \psi(y-\hat{y}) \:\: \text{for some } \psi : \mathbb{R} \longmapsto \mathbb{R}$$
Loss is $0$ when residual is $0$:
$$\psi(0) = 0$$

Translation Invariance:
Distance-based losses are translation-invariant:

$$L(\hat{y}+a, y+a) = L(\hat{y}, y)$$

Sometimes Relative-Error $\dfrac{\hat{y}-y}{y}$ is a more natural loss but it is NOT translation-invariant

MSE

The MSE minimizes the sum of squared differences between the predicted values and the target values.

$$L(\hat{y}, y) = \dfrac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_ {i}\right)^{2}$$

MAE

The MAE minimizes the sum of absolute differences between the predicted values and the target values.

$$L(\hat{y}, y) = \dfrac{1}{n} \sum_{i=1}^{n}\vert y_{i}-\hat{y}_ {i}\vert$$

Properties:

Solution may be Non-unique
Robustness to outliers
Unstable Solutions:
Data-points “Latching”¹:
- Unique Solution:
  If there are $k$ features (including the constant), then at least one optimal regression surface will pass through $k$ of the data points; unless there are multiple solutions.
- Multiple Solutions:
  The region of valid least absolute deviations solutions will be bounded by at least $k$ lines, each of which passes through at least $k$ data points.
  
  Wikipedia

Huber Loss

AKA: Smooth Mean Absolute Error

$$L(\hat{y}, y) = \left\{\begin{array}{cc}{\frac{1}{2}(y-\hat{y})^{2}} & {\text { if }|(y-\hat{y})|<\delta} \\ {\delta(y-\hat{y})-\frac{1}{2} \delta} & {\text { otherwise }}\end{array}\right.$$

Properties:

It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.

Code:

def Huber(yHat, y, delta=1.):
    return np.where(np.abs(y-yHat) < delta,.5*(y-yHat)**2 , delta*(np.abs(y-yHat)-0.5*delta))

KL-Divergence

$$L(\hat{y}, y) = $$

Analysis and Discussion

MSE vs MAE:

MSE	MAE
Sensitive to outliers	Robust to outliers
Differentiable Everywhere	Non-Differentiable at $0$
Stable² Solutions	Unstable Solutions
Unique Solution	Possibly multiple³ solutions

Statistical Efficiency:
- “For normal observations MSE is about $12\%$ more efficient than MAE” - Fisher
- $1\%$ Error is enough to make MAE more efficient
- 2/1000 bad observations, make the median more efficient than the mean
Subgradient methods are slower than gradient descent
- you get a lot better convergence rate guarantees for MSE

Notes

Loss Functions for Classification

$0-1$ Loss

$$L(\hat{y}, y) = I(\hat{y} \neq y) = \left\{\begin{array}{ll}{0} & {\hat{y}=y} \\ {1} & {\hat{y} \neq y}\end{array}\right.$$

MSE

We can write the loss in terms of the margin $m = y\hat{y}$:
$L(\hat{y}, y)=(y - \hat{y})^{2}=(1-y\hat{y})^{2}=(1-m)^{2}$

Since $y \in {-1,1} \implies y^2 = 1$

$$L(\hat{y}, y) = (1-y \hat{y})^{2}$$

Hinge Loss

$$L(\hat{y}, y) = \max (0,1-y \hat{y})=|1-y \hat{y}|_ {+}$$

Properties:

Continuous, Convex, Non-Differentiable

Logistic Loss

AKA: Log-Loss, Logarithmic Loss

$$L(\hat{y}, y) = \log{\left(1+e^{-y \hat{y}}\right)}$$

Properties:

The logistic loss function does not assign zero penalty to any points. Instead, functions that correctly classify points with high confidence (i.e., with high values of ${\displaystyle \vert f({\vec {x}})\vert}$) are penalized less. This structure leads the logistic loss function to be sensitive to outliers in the data.

Cross-Entropy (Log Loss)

$$L(\hat{y}, y) = -\sum_{i} y_i \log \left(\hat{y}_ {i}\right)$$

Binary Cross-Entropy:

$$L(\hat{y}, y) = -\left[y \log \hat{y}+\left(1-y\right) \log \left(1-\hat{y}_ {n}\right)\right]$$

Cross-Entropy and Negative-Log-Probability:
The Cross-Entropy is equal to the Negative-Log-Probability (of predicting the true class) in the case that the true distribution that we are trying to match is peaked at a single point and is identically zero everywhere else; this is usually the case in ML when we are using a one-hot encoded vector with one class $y = [0 \: 0 \: \ldots \: 0 \: 1 \: 0 \: \ldots \: 0]$ peaked at the $j$-th position
$\implies$

$$L(\hat{y}, y) = -\sum_{i} y_i \log \left(\hat{y}_ {i}\right) = - \log (\hat{y}_ {j})$$

Cross-Entropy and Log-Loss:
The Cross-Entropy is equal to the Log-Loss in the case of $0, 1$ classification.

Equivalence of Binary Cross-Entropy and Logistic-Loss:
Given $p \in\{y, 1-y\}$ and $q \in\{\hat{y}, 1-\hat{y}\}$:

$$H(p,q)=-\sum_{x }p(x)\,\log q(x) = -y \log \hat{y}-(1-y) \log (1-\hat{y}) = L(\hat{y}, y)$$

Notice the following property of the logistic function $\sigma$ (used in derivation below):
$\sigma(-x) = 1-\sigma(x)$

Reference (Understanding binary-cross-entropy-log-loss)

Cross-Entropy as Negative-Log-Likelihood (w/ equal probability outcomes):

Cross-Entropy and KL-Div:
When comparing a distribution ${\displaystyle q}$ against a fixed reference distribution ${\displaystyle p}$, cross entropy and KL divergence are identical up to an additive constant (since ${\displaystyle p}$ is fixed): both take on their minimal values when ${\displaystyle p=q}$, which is ${\displaystyle 0}$ for KL divergence, and ${\displaystyle \mathrm {H} (p)}$ for cross entropy.

Basically, minimizing either will result in the same solution.

Cross-Entropy VS MSE (& Classification Loss):
Basically, CE > MSE because the gradient of MSE $z(1-z)$ leads to saturation when then output $z$ of a neuron is near $0$ or $1$ making the gradient very small and, thus, slowing down training.
CE > Class-Loss because Class-Loss is binary and doesn’t take into account “how well” are we actually approximating the probabilities as opposed to just having the target class be slightly higher than the rest (e.g. $[c_1=0.3, c_2=0.3, c_3=0.4]$).

Why You Should Use Cross-Entropy Error Instead Of Classification Error Or Mean Squared Error For Neural Network Classifier Training

Exponential Loss

$$L(\hat{y}, y) = e^{-\beta y \hat{y}}$$

Perceptron Loss

$${\displaystyle L(\hat{y}_i, y_i) = {\begin{cases}0&{\text{if }}\ y_i\cdot \hat{y}_i \geq 0\\-y_i \hat{y}_i&{\text{otherwise}}\end{cases}}}$$

Notes

Logistic loss diverges faster than hinge loss (image). So, in general, it will be more sensitive to outliers. Reference. Bad info?

Reference ↩
Stability ↩
Reason is that the errors are equally weighted; so, tilting the regression line (within a region) will decrease the distance to a particular point and will increase the distance to other points by the same amount. ↩
$f(x) = w^Tx$ in logistic regression ↩
We have to redefine the indicator/target variable to establish the equality. ↩

Loss Functions

Table of Contents

Loss Functions

Loss Functions

Loss Functions for Regression

Introduction

MSE

MAE

Huber Loss

KL-Divergence

Analysis and Discussion

Notes

Loss Functions for Classification

\(0-1\) Loss

MSE

Hinge Loss

Logistic Loss

Cross-Entropy (Log Loss)

Exponential Loss

Perceptron Loss

Notes