Table of Contents



  1. Data Snooping:

    The Principle:
    If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.

    Analysis:

    • Making decision by examining the dataset makes you a part of the learning algorithm.
      However, you didn’t consider your contribution to the learning algorithm when making e.g. VC-Analysis for Generalization.
    • Thus, you are vulnerable to designing the model (or choices of learning) according to the idiosyncrasies of the dataset.
    • The real problem is that you are not “charging” for the decision you made by examining the dataset.

    What’s allowed?

    • You are allowed (even encouraged) to look at all other information related to the target function and input space.
      e.g. number/range/dimension/scale/etc. of the inputs, correlations, properties (monotonicity), etc.
    • EXCEPT, for the specific realization of the training dataset.

    Manifestations of Data Snooping with Examples (one/manifestation):

    • Changing the Parameters of the model (Tricky):
      • Complexity:
        Decreasing the order of the fitting polynomial by observing geometric properties of the training set.
    • Using statistics of the Entire Dataset (Tricky):
      • Normalization:
        Normalizing the data with the mean and variance of the entire dataset (training+testing).
        • E.g. In Financial Forecasting; the average affects the outcome by exposing the trend.
    • Reuse of a Dataset:
      If you keep Trying one model after the other on the same data set, you will eventually ‘succeed’.
      “If you torture the data long enough, it will confess”.
      This bad because the final model you selected, is the union of all previous models: since some of those models were rejected by you (a learning algorithm).
      • Fixed (deterministic) training set for Model Selection:
        Selecting a model by trying many models on the same fixed (deterministic) Training dataset.
    • Bias via Snooping:
      By looking at the data in the future when you are not allowed to have the data (it wouldn’t have been possible); you are creating sampling bias caused by “snooping”.
      • E.g. Testing a Trading algorithm using the currently traded companies (in S&P500).
        You shouldn’t have been able to know which companies are being currently traded (future).

    Remedies/Solutions to Data Snooping:

    1. Avoid Data Snooping:
      A strict discipline (very hard).
    2. Account for Data Snooping:
      By quantifying “How much data contamination”.


  2. Mismatched Data:

  3. Mismatched Classes:

  4. Sampling Bias:
    Sampling Bias occurs when: \(\exists\) Region with zero-probability \(P=0\) in training, but with positive-probability \(P>0\) in testing.

    The Principle:
    If the data is sampled in a biased way, learning will produce a similarly biased outcome.

    Example: 1948 Presidential Elections

    • Newspaper conducted a Telephone poll between: Jackson and Truman
    • Jackson won the poll decisively.
    • The result was NOT unlucky:
      No matter how many times the poll was re-conducted, and no matter how many times the sample sized is increased; the outcome will be fixed.
    • The reason is the Telephone:
      (1) Telephones were expensive and only rich people had Telephones.
      (2) Rich people favored Jackson.
      Thus, the result was well reflective of the (mini) population being sampled.

    How to sample:
    Sample in a way that matches the distributions of train and test samples.

    The solution Fails (doesn’t work) if:
    \(\exists\) Region with zero-probability \(P=0\) in training, but with positive-probability \(P>0\) in testing.

    This is when sampling bias exists.

    Notes:

    • Medical sources sometimes refer to sampling bias as ascertainment bias.
    • Sampling bias could be viewed as a subtype of selection bias.

  5. Model Uncertainty:

    Interpreting Softmax Output Probabilities:
    Softmax outputs only measure Aleatoric Uncertainty.
    In the same way that in regression, a NN with two outputs, one representing mean and one variance, that parameterise a Gaussian, can capture aleatoric uncertainty, even though the model is deterministic.
    Bayesian NNs (dropout included), aim to capture epistemic (aka model) uncertainty.

    Dropout for Measuring Model (epistemic) Uncertainty:
    Dropout can give us principled uncertainty estimates.
    Principled in the sense that the uncertainty estimates basically approximate those of our Gaussian process.

    Theoretical Motivation: dropout neural networks are identical to variational inference in Gaussian processes.
    Interpretations of Dropout:

  6. Probability Calibration:
    Modern NN are miscalibrated: not well-calibrated. They tend to be very confident. We cannot interpret the softmax probabilities as reflecting the true probability distribution or as a measure of confidence.

    Miscalibration: is the discrepancy between model confidence and model accuracy.
    You assume that if a model gives \(80\%\) confidence for 100 images, then \(80\) of them will be accurate and the other \(20\) will be inaccurate.

    Model Confidence: probability of correctness.
    Calibrated Confidence (softmax scores) \(\hat{p}\): \(\hat{p}\) represents a true probability.

    Probability Calibration:
    Predicted scores (model outputs) of many classifiers do not represent “true” probabilities.
    They only respect the mathematical definition (conditions) of what a probability function is:

    1. Each “probability” is between 0 and 1
    2. When you sum the probabilities of an observation being in any particular class, they sum to 1.
    • Calibration Curves: A calibration curve plots the predicted probabilities against the actual rate of occurance.
      I.E. It plots the predicted probabilities against the actual probabilities.

    • Approach:
      Calibrating a classifier consists of fitting a regressor (called a calibrator) that maps the output of the classifier (as given by decision_function or predict_proba - sklearn) to a calibrated probability in \([0, 1]\).
      Denoting the output of the classifier for a given sample by \(f_i\), the calibrator tries to predict \(p\left(y_i=1 \mid f_i\right)\).

    • Methods:
      • Platt Scaling: Platt scaling basically fits a logistic regression on the original model’s.
        The closer the calibration curve is to a sigmoid, the more effective the scaling will be in correcting the model.

        • Assumptions:
          The sigmoid method assumes the calibration curve can be corrected by applying a sigmoid function to the raw predictions.
          This assumption has been empirically justified in the case of Support Vector Machines with common kernel functions on various benchmark datasets but does not necessarily hold in general.

        • Limitations:

          • The logistic model works best if the calibration error is symmetrical, meaning the classifier output for each binary class is normally distributed with the same variance.
            This can be a problem for highly imbalanced classification problems, where outputs do not have equal variance.
      • Isotonic Method: The ‘isotonic’ method fits a non-parametric isotonic regressor, which outputs a step-wise non-decreasing function.

        This method is more general when compared to ‘sigmoid’ as the only restriction is that the mapping function is monotonically increasing. It is thus more powerful as it can correct any monotonic distortion of the un-calibrated model. However, it is more prone to overfitting, especially on small datasets.

      • Comparison:

        • Platt Scaling is most effective when the un-calibrated model is under-confident and has similar calibration errors for both high and low outputs.
        • Isotonic Method is more powerful than Platt Scaling: Overall, ‘isotonic’ will perform as well as or better than ‘sigmoid’ when there is enough data (greater than ~ 1000 samples) to avoid overfitting.
    • Limitations of recalibration:
      Different calibration methods have different weaknesses depending on the shape of the calibration curve.
      E.g. Platt Scaling works better the more the calibration curve resembles a sigmoid.

    • Multi-Class Support:

    Note: The samples that are used to fit the calibrator should not be the same samples used to fit the classifier, as this would introduce bias. This is because performance of the classifier on its training data would be better than for novel data. Using the classifier output of training data to fit the calibrator would thus result in a biased calibrator that maps to probabilities closer to 0 and 1 than it should.

  7. Debugging Strategies for Deep ML Models:


  8. The Machine Learning Algorithm Recipe:
    Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe in both Supervised and Unsupervised settings:
    • A combination of:
      • A specification of a dataset
      • A cost function
      • An optimization procedure
      • A model
    • Ex: Linear Regression
    • Ex: PCA
    • Specification of a Dataset:
      Could be labeled (supervised) or unlabeled (unsupervised).
    • Cost Function:
      The cost function typically includes at least one term that causes the learning process to perform statistical estimation. The most common cost function is the negative log-likelihood, so that minimizing the cost function causes maximum likelihood estimation.
    • Optimization Procedure:
      Could be closed-form or iterative or special-case.
      If the cost function does not allow for closed-form solution (e.g. if the model is specified as non-linear), then we usually need iterative optimization algorithms e.g. gradient descent.
      If the cost can’t be computed for computational problems then we can approximate it with an iterative numerical optimization as long as we have some way to approximating its gradients.
    • Model:
      Could be linear or non-linear.

    If a machine learning algorithm seems especially unique or hand designed, it can usually be understood as using a special-case optimizer.
    Some models, such as decision trees and k-means, require special-case optimizers because their cost functions have flat regions that make them inappropriate for minimization by gradient-based optimizers.

    Recognizing that most machine learning algorithms can be described using this recipe helps to see the different algorithms as part of a taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.


Recall is more important where Overlooked Cases (False Negatives) are more costly than False Alarms (False Positive). The focus in these problems is finding the positive cases.

Precision is more important where False Alarms (False Positives) are more costly than Overlooked Cases (False Negatives). The focus in these problems is in weeding out the negative cases.

ROC Curve and AUC:

Note:

  1. Limited Training Data:

SECOND