Ahmad Badary

FIRST

Data Snooping:

The Principle:
If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.

Analysis:
- Making decision by examining the dataset makes you a part of the learning algorithm.
  However, you didn’t consider your contribution to the learning algorithm when making e.g. VC-Analysis for Generalization.
- Thus, you are vulnerable to designing the model (or choices of learning) according to the idiosyncrasies of the dataset.
- The real problem is that you are not “charging” for the decision you made by examining the dataset.
What’s allowed?
- You are allowed (even encouraged) to look at all other information related to the target function and input space.
  e.g. number/range/dimension/scale/etc. of the inputs, correlations, properties (monotonicity), etc.
- EXCEPT, for the specific realization of the training dataset.
Manifestations of Data Snooping with Examples (one/manifestation):
- Changing the Parameters of the model (Tricky):
  - Complexity:
    Decreasing the order of the fitting polynomial by observing geometric properties of the training set.
- Using statistics of the Entire Dataset (Tricky):
  - Normalization:
    Normalizing the data with the mean and variance of the entire dataset (training+testing).
    - E.g. In Financial Forecasting; the average affects the outcome by exposing the trend.
- Reuse of a Dataset:
  If you keep Trying one model after the other on the same data set, you will eventually ‘succeed’.
  “If you torture the data long enough, it will confess”.
  This bad because the final model you selected, is the union of all previous models: since some of those models were rejected by you (a learning algorithm).
  - Fixed (deterministic) training set for Model Selection:
    Selecting a model by trying many models on the same fixed (deterministic) Training dataset.
- Bias via Snooping:
  By looking at the data in the future when you are not allowed to have the data (it wouldn’t have been possible); you are creating sampling bias caused by “snooping”.
  - E.g. Testing a Trading algorithm using the currently traded companies (in S&P500).
    You shouldn’t have been able to know which companies are being currently traded (future).
Remedies/Solutions to Data Snooping:
1. Avoid Data Snooping:
  A strict discipline (very hard).
2. Account for Data Snooping:
  By quantifying “How much data contamination”.
Mismatched Data:
Mismatched Classes:
Sampling Bias:
Sampling Bias occurs when: \(\exists\) Region with zero-probability \(P=0\) in training, but with positive-probability \(P>0\) in testing.

The Principle:
If the data is sampled in a biased way, learning will produce a similarly biased outcome.

Example: 1948 Presidential Elections
- Newspaper conducted a Telephone poll between: Jackson and Truman
- Jackson won the poll decisively.
- The result was NOT unlucky:
  No matter how many times the poll was re-conducted, and no matter how many times the sample sized is increased; the outcome will be fixed.
- The reason is the Telephone:
  (1) Telephones were expensive and only rich people had Telephones.
  (2) Rich people favored Jackson.
  Thus, the result was well reflective of the (mini) population being sampled.
How to sample:
Sample in a way that matches the distributions of train and test samples.

The solution Fails (doesn’t work) if:
\(\exists\) Region with zero-probability \(P=0\) in training, but with positive-probability \(P>0\) in testing.

This is when sampling bias exists.

Notes:
- Medical sources sometimes refer to sampling bias as ascertainment bias.
- Sampling bias could be viewed as a subtype of selection bias.
Model Uncertainty:

Interpreting Softmax Output Probabilities:
Softmax outputs only measure Aleatoric Uncertainty.
In the same way that in regression, a NN with two outputs, one representing mean and one variance, that parameterise a Gaussian, can capture aleatoric uncertainty, even though the model is deterministic.
Bayesian NNs (dropout included), aim to capture epistemic (aka model) uncertainty.

Dropout for Measuring Model (epistemic) Uncertainty:
Dropout can give us principled uncertainty estimates.
Principled in the sense that the uncertainty estimates basically approximate those of our Gaussian process.

Theoretical Motivation: dropout neural networks are identical to variational inference in Gaussian processes.
Interpretations of Dropout:
- Dropout is just a diagonal noise matrix with the diagonal elements set to either 0 or 1.
- What My Deep Model Doesn’t Know (Blog! - Yarin Gal)
Probability Calibration:
Modern NN are miscalibrated: not well-calibrated. They tend to be very confident. We cannot interpret the softmax probabilities as reflecting the true probability distribution or as a measure of confidence.

Miscalibration: is the discrepancy between model confidence and model accuracy.
You assume that if a model gives \(80\%\) confidence for 100 images, then \(80\) of them will be accurate and the other \(20\) will be inaccurate.

Model Confidence: probability of correctness.
Calibrated Confidence (softmax scores) \(\hat{p}\): \(\hat{p}\) represents a true probability.

Probability Calibration:
Predicted scores (model outputs) of many classifiers do not represent “true” probabilities.
They only respect the mathematical definition (conditions) of what a probability function is:
1. Each “probability” is between 0 and 1
2. When you sum the probabilities of an observation being in any particular class, they sum to 1.
- Calibration Curves: A calibration curve plots the predicted probabilities against the actual rate of occurance.
  I.E. It plots the predicted probabilities against the actual probabilities.
- Approach:
  Calibrating a classifier consists of fitting a regressor (called a calibrator) that maps the output of the classifier (as given by decision_function or predict_proba - sklearn) to a calibrated probability in \([0, 1]\).
  Denoting the output of the classifier for a given sample by \(f_i\), the calibrator tries to predict \(p\left(y_i=1 \mid f_i\right)\).
- Methods:
  - Platt Scaling: Platt scaling basically fits a logistic regression on the original model’s.
    The closer the calibration curve is to a sigmoid, the more effective the scaling will be in correcting the model.
    - Assumptions:
      The sigmoid method assumes the calibration curve can be corrected by applying a sigmoid function to the raw predictions.
      This assumption has been empirically justified in the case of Support Vector Machines with common kernel functions on various benchmark datasets but does not necessarily hold in general.
    - Limitations:
      - The logistic model works best if the calibration error is symmetrical, meaning the classifier output for each binary class is normally distributed with the same variance.
        This can be a problem for highly imbalanced classification problems, where outputs do not have equal variance.
  - Isotonic Method: The ‘isotonic’ method fits a non-parametric isotonic regressor, which outputs a step-wise non-decreasing function.
    
    This method is more general when compared to ‘sigmoid’ as the only restriction is that the mapping function is monotonically increasing. It is thus more powerful as it can correct any monotonic distortion of the un-calibrated model. However, it is more prone to overfitting, especially on small datasets.
  - Comparison:
    - Platt Scaling is most effective when the un-calibrated model is under-confident and has similar calibration errors for both high and low outputs.
    - Isotonic Method is more powerful than Platt Scaling: Overall, ‘isotonic’ will perform as well as or better than ‘sigmoid’ when there is enough data (greater than ~ 1000 samples) to avoid overfitting.
- Limitations of recalibration:
  Different calibration methods have different weaknesses depending on the shape of the calibration curve.
  E.g. Platt Scaling works better the more the calibration curve resembles a sigmoid.
- Multi-Class Support:
Note: The samples that are used to fit the calibrator should not be the same samples used to fit the classifier, as this would introduce bias. This is because performance of the classifier on its training data would be better than for novel data. Using the classifier output of training data to fit the calibrator would thus result in a biased calibrator that maps to probabilities closer to 0 and 1 than it should.
- On Calibration of Modern Neural Networks
  Paper that defines the problem and gives multiple effective solution for calibrating Neural Networks.
- Calibration of Convolutional Neural Networks (Thesis!)
- For calibrating output probabilities in Deep Nets; Temperature scaling outperforms Platt scaling. paper
- Plot and Explanation
- Blog on How to do it
- Interpreting outputs of a logistic classifier (Blog)
Debugging Strategies for Deep ML Models:
1. Visualize the model in action:
  Directly observing qualitative results of a model (e.g. located objects, generated speech) can help avoid evaluation bugs or mis-leading evaluation results. It can also help guide the expected quantitative performance of the model.
2. Visualize the worst mistakes:
  By viewing the training set examples that are the hardest to model correctly by using a confidence measure (e.g. softmax probabilities), one can often discover problems with the way the data have been preprocessed or labeled.
3. Reason about Software using Training and Test Error:
  It is hard to determine whether the underlying software is correctly implemented.
  We can use the training/test errors to help guide us:
  - If training error is low but test error is high, then:
    - it is likely that that the training procedure works correctly,and the model is overfitting for fundamental algorithmic reasons.
    - or that the test error is measured incorrectly because of a problem with saving the model after training then reloading it for test set evaluation, or because the test data was prepared differently from the training data.
  - If both training and test errors are high, then:
    it is difficult to determine whether there is a software defect or whether the model is underfitting due to fundamental algorithmic reasons.
    This scenario requires further tests, described next.
4. Fit a Tiny Dataset:
  If you have high error on the training set, determine whether it is due to genuine underfitting or due to a software defect.
  Usually even small models can be guaranteed to be able fit a suﬃciently small dataset. For example, a classification dataset with only one example can be fit just by setting the biase sof the output layer correctly.
  This test can be extended to a small dataset with few examples.
5. Monitor histograms of Activations and Gradients:
  It is often useful to visualize statistics of neural network activations and gradients, collected over a large amount of training iterations (maybe one epoch).
  The preactivation value of hidden units can tell us if the units saturate, or how often they do.
  For example, for rectifiers,how often are they off? Are there units that are always off?
  For tanh units,the average of the absolute value of the preactivations tells us how saturated the unit is.
  In a deep network where the propagated gradients quickly grow or quickly vanish, optimization may be hampered.
  Finally, it is useful to compare the magnitude of parameter gradients to the magnitude of the parameters themselves. As suggested by Bottou (2015), we would like the magnitude of parameter updates over a minibatch to represent something like 1 percent of the magnitude of the parameter, not 50 percent or 0.001 percent (which would make the parametersmove too slowly). It may be that some groups of parameters are moving at a good pace while others are stalled. When the data is sparse (like in natural language) some parameters may be very rarely updated, and this should be kept in mind when monitoring their evolution.
6. Finally, many deep learning algorithms provide some sort of guarantee about the results produced at each step.
  For example, in part III, we will see some approximate inference algorithms that work by using algebraic solutions to optimization problems.
  Typically these can be debugged by testing each of their guarantees.Some guarantees that some optimization algorithms offer include that the objective function will never increase after one step of the algorithm, that the gradient with respect to some subset of variables will be zero after each step of the algorithm,and that the gradient with respect to all variables will be zero at convergence.Usually due to rounding error, these conditions will not hold exactly in a digital computer, so the debugging test should include some tolerance parameter.
The Machine Learning Algorithm Recipe:
Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe in both Supervised and Unsupervised settings:
- A combination of:
  - A specification of a dataset
  - A cost function
  - An optimization procedure
  - A model
- Ex: Linear Regression
  - A specification of a dataset:
    The Dataset consists of \(X\) and \(y\).
  - A cost function:
    \(J(\boldsymbol{w}, b)=-\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim \hat{p}_{\text {data }}} \log p_{\text {model }}(y | \boldsymbol{x})\)
  - An optimization procedure:
    in most cases, the optimization algorithm is defined by solving for where the gradient of the cost is zero using the normal equation.
  - A model:
    The Model Specification is:
    \(p_{\text {model}}(y \vert \boldsymbol{x})=\mathcal{N}\left(y ; \boldsymbol{x}^{\top} \boldsymbol{w}+b, 1\right)\)
- Ex: PCA
  - A specification of a dataset:
    \(X\)
  - A cost function:
    \(J(\boldsymbol{w})=\mathbb{E}_{\mathbf{x} \sim \hat{p}_{\text {data }}}\|\boldsymbol{x}-r(\boldsymbol{x} ; \boldsymbol{w})\|_ {2}^{2}\)
  - An optimization procedure:
    Constrained Convex optimization or Gradient Descent.
  - A model:
    Defined to have \(\boldsymbol{w}\) with norm \(1\) and reconstruction function \(r(\boldsymbol{x})=\boldsymbol{w}^{\top} \boldsymbol{x} \boldsymbol{w}\).
- Specification of a Dataset:
  Could be labeled (supervised) or unlabeled (unsupervised).
- Cost Function:
  The cost function typically includes at least one term that causes the learning process to perform statistical estimation. The most common cost function is the negative log-likelihood, so that minimizing the cost function causes maximum likelihood estimation.
- Optimization Procedure:
  Could be closed-form or iterative or special-case.
  If the cost function does not allow for closed-form solution (e.g. if the model is specified as non-linear), then we usually need iterative optimization algorithms e.g. gradient descent.
  If the cost can’t be computed for computational problems then we can approximate it with an iterative numerical optimization as long as we have some way to approximating its gradients.
- Model:
  Could be linear or non-linear.
If a machine learning algorithm seems especially unique or hand designed, it can usually be understood as using a special-case optimizer.
Some models, such as decision trees and k-means, require special-case optimizers because their cost functions have flat regions that make them inappropriate for minimization by gradient-based optimizers.

Recognizing that most machine learning algorithms can be described using this recipe helps to see the different algorithms as part of a taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.

Recall is more important where Overlooked Cases (False Negatives) are more costly than False Alarms (False Positive). The focus in these problems is finding the positive cases.

Precision is more important where False Alarms (False Positives) are more costly than Overlooked Cases (False Negatives). The focus in these problems is in weeding out the negative cases.

Interview practice with P and R (Blog)

ROC Curve and AUC:

Note:

ROC Curve only cares about the ordering of the scores, not the values.
- Probability Calibration and ROC: The calibration doesn’t change the order of the scores, it just scales them to make a better match, and the ROC score only cares about the ordering of the scores.
ROC and Credit Score Example (Blog)
AUC: The AUC is also the probability that a randomly selected positive example has a higher score than a randomly selected negative example.

ROC in Radiology (Paper)
Includes discussion for Partial AUC when only a portion of the entire ROC curve needs to be considered.

Limited Training Data:
- LinkedIn Post on Dealing with Limited Training Data in Supervised Learning (Blog)

Practical Concepts in Machine Learning

Table of Contents

SECOND