Modern NN are miscalibrated: not well-calibrated. They tend to be very confident. We cannot interpret the softmax probabilities as reflecting the true probability distribution or as a measure of confidence.
Miscalibration: is the discrepancy between model confidence and model accuracy.
You assume that if a model gives \(80\%\) confidence for 100 images, then \(80\) of them will be accurate and the other \(20\) will be inaccurate.

Model Confidence: probability of correctness.
Calibrated Confidence (softmax scores) \(\hat{p}\): \(\hat{p}\) represents a true probability.


Probability Calibration:
Predicted scores (model outputs) of many classifiers do not represent “true” probabilities.
They only respect the mathematical definition (conditions) of what a probability function is:
- Each “probability” is between 0 and 1
- When you sum the probabilities of an observation being in any particular class, they sum to 1.
-
Calibration Curves: A calibration curve plots the predicted probabilities against the actual rate of occurance.
I.E. It plots the predicted probabilities against the actual probabilities.

-
Approach:
Calibrating a classifier consists of fitting a regressor (called a calibrator) that maps the output of the classifier (as given bydecision_functionorpredict_proba- sklearn) to a calibrated probability in \([0, 1]\).
Denoting the output of the classifier for a given sample by \(f_i\), the calibrator tries to predict \(p\left(y_i=1 \mid f_i\right)\). - Methods:
-
Platt Scaling: Platt scaling basically fits a logistic regression on the original model’s.
The closer the calibration curve is to a sigmoid, the more effective the scaling will be in correcting the model.

-
Assumptions:
The sigmoid method assumes the calibration curve can be corrected by applying a sigmoid function to the raw predictions.
This assumption has been empirically justified in the case of Support Vector Machines with common kernel functions on various benchmark datasets but does not necessarily hold in general. -
Limitations:
- The logistic model works best if the calibration error is symmetrical, meaning the classifier output for each binary class is normally distributed with the same variance.
This can be a problem for highly imbalanced classification problems, where outputs do not have equal variance.
- The logistic model works best if the calibration error is symmetrical, meaning the classifier output for each binary class is normally distributed with the same variance.
-
-
Isotonic Method: The ‘isotonic’ method fits a non-parametric isotonic regressor, which outputs a step-wise non-decreasing function.

This method is more general when compared to ‘sigmoid’ as the only restriction is that the mapping function is monotonically increasing. It is thus more powerful as it can correct any monotonic distortion of the un-calibrated model. However, it is more prone to overfitting, especially on small datasets.
-
Comparison:
- Platt Scaling is most effective when the un-calibrated model is under-confident and has similar calibration errors for both high and low outputs.
- Isotonic Method is more powerful than Platt Scaling: Overall, ‘isotonic’ will perform as well as or better than ‘sigmoid’ when there is enough data (greater than ~ 1000 samples) to avoid overfitting.
-
-
Limitations of recalibration:
Different calibration methods have different weaknesses depending on the shape of the calibration curve.
E.g. Platt Scaling works better the more the calibration curve resembles a sigmoid.
- Multi-Class Support:
Note: The samples that are used to fit the calibrator should not be the same samples used to fit the classifier, as this would introduce bias. This is because performance of the classifier on its training data would be better than for novel data. Using the classifier output of training data to fit the calibrator would thus result in a biased calibrator that maps to probabilities closer to 0 and 1 than it should.
- On Calibration of Modern Neural Networks
Paper that defines the problem and gives multiple effective solution for calibrating Neural Networks. - Calibration of Convolutional Neural Networks (Thesis!)
- For calibrating output probabilities in Deep Nets; Temperature scaling outperforms Platt scaling. paper
- Plot and Explanation
- Blog on How to do it
- Interpreting outputs of a logistic classifier (Blog)