Table of Contents



Loss Functions (blog)
Information Theory (Cross-Entropy and MLE, MSE, Nash, etc.)

Loss Functions

Loss Functions

Abstractly, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number, intuitively, representing some “cost” associated with the event.

Formally, a loss function is a function \(L :(\hat{y}, y) \in \mathbb{R} \times Y \longmapsto L(\hat{y}, y) \in \mathbb{R}\) that takes as inputs the predicted value \(\hat{y}\) corresponding to the real data value \(y\) and outputs how different they are.


Loss Functions for Regression

img

Introduction

Regression Losses usually only depend on the residual \(r = y - \hat{y}\) (i.e. what you have to add to your prediction to match the target)

Distance-Based Loss Functions:
A Loss function \(L(\hat{y}, y)\) is called distance-based if it:

Translation Invariance:
Distance-based losses are translation-invariant:

$$L(\hat{y}+a, y+a) = L(\hat{y}, y)$$

Sometimes Relative-Error \(\dfrac{\hat{y}-y}{y}\) is a more natural loss but it is NOT translation-invariant


MSE

The MSE minimizes the sum of squared differences between the predicted values and the target values.

$$L(\hat{y}, y) = \dfrac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_ {i}\right)^{2}$$

img



MAE

The MAE minimizes the sum of absolute differences between the predicted values and the target values.

$$L(\hat{y}, y) = \dfrac{1}{n} \sum_{i=1}^{n}\vert y_{i}-\hat{y}_ {i}\vert$$

Properties:

Huber Loss

AKA: Smooth Mean Absolute Error

$$L(\hat{y}, y) = \left\{\begin{array}{cc}{\frac{1}{2}(y-\hat{y})^{2}} & {\text { if }|(y-\hat{y})|<\delta} \\ {\delta(y-\hat{y})-\frac{1}{2} \delta} & {\text { otherwise }}\end{array}\right.$$

Properties:

Code:

def Huber(yHat, y, delta=1.):
    return np.where(np.abs(y-yHat) < delta,.5*(y-yHat)**2 , delta*(np.abs(y-yHat)-0.5*delta))


KL-Divergence

$$L(\hat{y}, y) = $$


Analysis and Discussion

MSE vs MAE:

MSE MAE
Sensitive to outliers Robust to outliers
Differentiable Everywhere Non-Differentiable at \(0\)
Stable2 Solutions Unstable Solutions
Unique Solution Possibly multiple3 solutions


Notes


Loss Functions for Classification

img

\(0-1\) Loss

$$L(\hat{y}, y) = I(\hat{y} \neq y) = \left\{\begin{array}{ll}{0} & {\hat{y}=y} \\ {1} & {\hat{y} \neq y}\end{array}\right.$$


MSE

We can write the loss in terms of the margin \(m = y\hat{y}\):
\(L(\hat{y}, y)=(y - \hat{y})^{2}=(1-y\hat{y})^{2}=(1-m)^{2}\)

Since \(y \in {-1,1} \implies y^2 = 1\)

$$L(\hat{y}, y) = (1-y \hat{y})^{2}$$

img


Hinge Loss

$$L(\hat{y}, y) = \max (0,1-y \hat{y})=|1-y \hat{y}|_ {+}$$

Properties:

img

Logistic Loss

AKA: Log-Loss, Logarithmic Loss

$$L(\hat{y}, y) = \log{\left(1+e^{-y \hat{y}}\right)}$$

img

Properties:

Cross-Entropy (Log Loss)

$$L(\hat{y}, y) = -\sum_{i} y_i \log \left(\hat{y}_ {i}\right)$$

Binary Cross-Entropy:

$$L(\hat{y}, y) = -\left[y \log \hat{y}+\left(1-y\right) \log \left(1-\hat{y}_ {n}\right)\right]$$

img

Cross-Entropy and Negative-Log-Probability:
The Cross-Entropy is equal to the Negative-Log-Probability (of predicting the true class) in the case that the true distribution that we are trying to match is peaked at a single point and is identically zero everywhere else; this is usually the case in ML when we are using a one-hot encoded vector with one class \(y = [0 \: 0 \: \ldots \: 0 \: 1 \: 0 \: \ldots \: 0]\) peaked at the \(j\)-th position
\(\implies\)

$$L(\hat{y}, y) = -\sum_{i} y_i \log \left(\hat{y}_ {i}\right) = - \log (\hat{y}_ {j})$$

Cross-Entropy and Log-Loss:
The Cross-Entropy is equal to the Log-Loss in the case of \(0, 1\) classification.

Equivalence of Binary Cross-Entropy and Logistic-Loss:
Given \(p \in\{y, 1-y\}\) and \(q \in\{\hat{y}, 1-\hat{y}\}\):

$$H(p,q)=-\sum_{x }p(x)\,\log q(x) = -y \log \hat{y}-(1-y) \log (1-\hat{y}) = L(\hat{y}, y)$$

Reference (Understanding binary-cross-entropy-log-loss)

Cross-Entropy as Negative-Log-Likelihood (w/ equal probability outcomes):

Cross-Entropy and KL-Div:
When comparing a distribution \({\displaystyle q}\) against a fixed reference distribution \({\displaystyle p}\), cross entropy and KL divergence are identical up to an additive constant (since \({\displaystyle p}\) is fixed): both take on their minimal values when \({\displaystyle p=q}\), which is \({\displaystyle 0}\) for KL divergence, and \({\displaystyle \mathrm {H} (p)}\) for cross entropy.

Basically, minimizing either will result in the same solution.

Cross-Entropy VS MSE (& Classification Loss):
Basically, CE > MSE because the gradient of MSE \(z(1-z)\) leads to saturation when then output \(z\) of a neuron is near \(0\) or \(1\) making the gradient very small and, thus, slowing down training.
CE > Class-Loss because Class-Loss is binary and doesn’t take into account “how well” are we actually approximating the probabilities as opposed to just having the target class be slightly higher than the rest (e.g. \([c_1=0.3, c_2=0.3, c_3=0.4]\)).


Exponential Loss

$$L(\hat{y}, y) = e^{-\beta y \hat{y}}$$


Perceptron Loss

$${\displaystyle L(\hat{y}_i, y_i) = {\begin{cases}0&{\text{if }}\ y_i\cdot \hat{y}_i \geq 0\\-y_i \hat{y}_i&{\text{otherwise}}\end{cases}}}$$


Notes



  1. Reference 

  2. Stability 

  3. Reason is that the errors are equally weighted; so, tilting the regression line (within a region) will decrease the distance to a particular point and will increase the distance to other points by the same amount. 

  4. \(f(x) = w^Tx\) in logistic regression 

  5. We have to redefine the indicator/target variable to establish the equality.