Table of Contents



Lecture on Statistical Learning Theory from Risk perspective & Bayes Decision Rule
Learning Theory Andrew NG (CS229 Stanford)
Empirical Risk Minimization (Cornell)

Fundamental Problem of Machine Learning: It is Ill-Posed:
Learning appears impossible: Learning a truly “unknown” function is impossible.

Two Views of Learning and their corresponding Strategies:

  1. Learning is the removal of our remaining uncertainty
    – Suppose we knew that the unknown function was an m-of-n boolean function. Then we could use the training examples to deduce which function it is.
    – Our prior “knowledge” might be wrong
    • Strategy: Develop Languages for Expressing Prior Knowledge
      Rule grammars, stochastic models, Bayesian networks
  2. Learning requires guessing a good, small hypothesis class.
    – We can start with a very small class and enlarge it until it contains an hypothesis that fits the data.
    – Our guess of the hypothesis class could be wrong: The smaller the class, the more likely we are wrong.
    • Strategy: Develop Flexible Hypothesis Spaces
      Nested collections of hypotheses: decision trees, neural networks, cases, SVMs

Key Issues in Machine Learning:

A Framework for Hypothesis Spaces:

A Framework for Learning Algorithms:

The Learning Problem

  1. When to use ML:
    When:
    1. A pattern Exists
    2. We cannot pin the pattern down mathematically
    3. We have Data

    We usually can do without the first two. But the third condition we CANNOT do without.
    The Theory of Learning only depends on the data.

    “We have to have data. We are learning from data. So if someone knocks on my door with an interesting machine learning application, and they tell me how exciting it is, and how great the application would be, and how much money they would make, the first question I ask, ‘what data do you have?’. If you have data, we are in business. If you don’t, you are out of luck.” - Prof. Ng

  2. The ML Approach to Problem Solving:
    Consider the Netflix problem: Predicting how a viewer will rate a movie.
    • Direct Approach:
      img
      • Ask each user to give a rank/rate for the different “factors/features” (E.g. Action, Comedy, etc.)
      • Watch each movie and assign a rank/rate for the same factors
      • Match the factors and produce a rating
    • ML Approach:
      img
      Essentially it is a Reversed approach
      • Start with the Ratings (dataset) that the users assigned to each movie
      • Then deduce the “factors/features” that are consistent with those Ratings
        Note: we usually start with random initial numbers for the factors

  3. Components of Learning:
    • Input: \(\vec{x}\)
    • Output: \(y\)
    • Data: \({(\vec{x}_ 1, y_ 1), (\vec{x}_ 2, y_ 2), ..., (\vec{x}_ N, y_ N)}\)
    • Target Function: \(f : \mathcal{X} \rightarrow \mathcal{Y}\) (Unknown/Unobserved)
    • Hypothesis: \(g : \mathcal{X} \rightarrow \mathcal{Y}\)

  4. Components of the Solution:
    • The Learning Model:
      • The Hypothesis Set: \(\mathcal{H}=\{h\}, g \in \mathcal{H}\)

        E.g. Perceptron, SVM, FNNs, etc.

      • The Learning Algorithm: picks \(g \approx f\) from a hypothesis set \(\mathcal{H}\)

        E.g. Backprop, Quadratic Programming, etc.

    Motivating the inclusion of a Hypothesis Set:

    • No Downsides: There is no loss of generality by including a hypothesis set, since any restrictions on the elements of the set have no effect on what the learning algorithms
      Basically, there is no downside because from a practical POV thats what you do; by choosing an initial approach, e.g. SVM, Linear Regression, Neural Network, etc., we are already dictating a hypothesis set. If we don’t choose one, then the hypothesis set has no restrictions and is the set of all possible hypothesis without loss of generalization.
    • Upside: The hypothesis set plays a pivotal role in the theory of learning, by dictating whether we can learn or not.

  5. The Basic Premise/Goal of Learning:
    “Using a set of observations to uncover an underlying process”
    Rephrased mathematically, the Goal of Learning is:
    Use the Data to find a hypothesis \(g \in \mathcal{H}\), from the hypothesis set \(\mathcal{H}=\{h\}\), that approximates \(f\) well.

  6. Types of Learning:
    • Supervised Learning: the task of learning a function that maps an input to an output based on example input-output pairs.
      img
    • Unsupervised Learning: the task of making inferences, by learning a better representation, from some datapoints that do not have any labels associated with them.
      img

      Unsupervised Learning is another name for Hebbian Learning

    • Reinforcement Leaning: the task of learning how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
      img

  7. The Learning Diagram:
    img

The Feasibility of Learning

The Goal of this Section is to answer the question: Can we make any statements/inferences outside of the sample data that we have?

  1. The Problem of Learning:
    Learning a truly Unknown function is Impossible, since outside of the observed values, the function could assume any value it wants.

  2. The Bin Analogy - A Related Experiment::
  3. The Learning Analogy:
    For an exam, the practice problems are the training set. You’re going to look at the question. You’re going to answer. You’re going to compare it with the real answer. And then you are going to adjust your hypothesis, your understanding of the material, in order to do it better, and go through them and perhaps go through them again, until you get them right or mostly right or figure out the material (this makes you better at taking the exam). We don’t give out the actual exams questions because acing the final is NOT the goal, the goal is to learn the material (have a small \(E_{\text{out}}\)). The final exam is only a way of gauging how well you actually learned. And in order for it to gauge how well you actually learned, I have to give you the final at the point you have already fixed your hypothesis. You prepared. You studied. You discussed with people. You now sit down to take the final exam. So you have one hypothesis. And you go through the exam. hopefully, will reflect what your understanding will be outside.

    The exam measures \(E_{\text{in}}\), and we know that it tracks \(E_{\text{out}}\) (by Hoeffding), so it tracks well how you understand the material proper.

  1. Notes:
    • Learning Feasibility:
      When learning we only deal with In-Sample Errors \([E_{\text{in}}(\mathbf{w})]\); we never handle the out-sample error explicitly; we take the theoretical guarantee that when you do well in-sample \(\implies\) you do well out-sample (Generalization).

Error and Noise

The Current Learning Diagram:
img

  1. Error Measures:
    Error Measures aim to answer the question:
    “What does it mean for \(h\) to approximate \(f\) (\(h \approx f\))?”
    The Error Measure: \(E(h, f)\)
    It is almost always defined point-wise: \(\mathrm{e}(h(\mathbf{X}), f(\mathbf{X}))\).
    Examples:
    • Square Error: \(\:\:\:\mathrm{e}(h(\mathbf{x}), f(\mathbf{x}))=(h(\mathbf{x})-f(\mathbf{x}))^{2}\)
    • Binary Error: \(\:\:\:\mathrm{e}(h(\mathbf{x}), f(\mathbf{x}))=[h(\mathbf{x}) \neq f(\mathbf{x})]\) (1 if true else 0)

    The overall error \(E(h,f) =\) average of pointwise errors \(\mathrm{e}(h(\mathbf{x}), f(\mathbf{x}))\):

    • In-Sample Error:

      $$E_{\mathrm{in}}(h)=\frac{1}{N} \sum_{n=1}^{N} \mathrm{e}\left(h\left(\mathbf{x}_{n}\right), f\left(\mathbf{x}_{n}\right)\right)$$

    • Out-Sample Error:

      $$E_{\text {out }}(h)=\mathbb{E}_ {\mathbf{x}}[\mathrm{e}(h(\mathbf{x}), f(\mathbf{x}))]$$

  2. The Learning Diagram - with pointwise error:
    img
    There are two additions to the diagram:
    • The first is to realize that we are defining the error measure on a point.
    • Another is that in deciding whether \(g\) is close to \(f\) , which is the goal of learning, we test this with a point \(x\). And the criterion for deciding whether \(g(x)\) is approximately the same as \(f(x)\) is our pointwise error measure.
  3. Defining the Error Measure:

    Types:

    • False Positive
    • False Negative

    There is no inherent merit to choosing one error function over another. It’s not an analytic question. It’s an application-domain question.

    The error measure should be specified by the user. Since, that’s not always possible, the alternatives:

    • Plausible Measures: measures that have an analytic argument for their merit, based on certain assumptions.
      E.g. Squared Error comes from the Gaussian Noise Assumption.
    • Friendly Measures: An easy-to-use error measure, without much justification.
      E.g. Linear Regression error leads to the easy closed-form solution, Convex Error measures are easy to optimize, etc.
  4. The Learning Diagram - with the Error Measure:
    img
    The Error Measure provides a quantitative assessment of the statement \(g(\mathbf{x}) \approx f(\mathbf{x})\).

  5. Noisy Targets:
    The ‘Target Function’ is not always a function because two ‘identical’ input points can be mapped to two different outputs (i.e. they have different labels).

    The solution: Replacing the target function with a Target Distribution.
    Instead of \(y = f(x)\) we use the conditional target distribution: \(P(y | \mathbf{x})\). What changes now is that, instead of \(y\) being deterministic of \(\mathbf{x}\), once you generate \(\mathbf{x}\), \(y\) is also probabilistic– generated by \(P(y | \mathbf{x})\).
    \((\mathbf{x}, y)\) is now generated by the joint distribution:

    $$P(\mathbf{x}) P(y | \mathbf{x})$$

    Equivalently, we can define a Noisy Target as a deterministic (target) function \(\:f(\mathbf{x})=\mathbb{E}(y | \mathbf{x})\:\) PLUS Noise \(\: y-f(x)\).
    This can be done WLOG since a deterministic target is a special kind of a noisy target:
    Define \(P(y \vert \mathbf{x})\) to be identically Zero, except for \(y = f(x)\).

  6. The Learning Diagram - with the Noisy Target:
    img

    Now, \(E_{\text {out}}(h) = \mathbb{E}_ {x, y}[e(h(x), y)]\) instead of \(\mathbb{E}_ {\mathbf{x}}[\mathrm{e}(h(\mathbf{x}), f(\mathbf{x}))]\), and
    \(\left(\mathbf{x}_{1}, y_{1}\right), \cdots,\left(\mathbf{x}_{N}, y_{N}\right)\) are generated independently of each (each tuple).

  7. Distinction between \(P(y | \mathbf{x})\) and \(P(\mathbf{x})\):
    The Target Distribution \(P(y \vert \mathbf{x})\) is what we are trying to learn.
    The Input Distribution \(P(\mathbf{x})\), only, quantifies relative importance of \(\mathbf{x}\); we are NOT trying to learn this distribution.

    Rephrasing: Supervised learning only learns \(P(y \vert \mathbf{x})\) and not \(P(\mathbf{x})\); \(P(\mathbf{x}, y)\) is NOT a target distribution for Supervised Learning.

    Merging \(P(\mathbf{x})P(y \vert \mathbf{x})\) as \(P(\mathbf{x}, y)\), although allows us to generate examples \((\mathbf{x}, y)\), mixes the two concepts that are inherently different.

  8. Preamble to Learning Theory:
    Generalization VS Learning:
    We know that Learning is Feasible.
    • Generalization:
      It is likely that the following condition holds:

      $$\: E_{\text {out }}(g) \approx E_{\text {in }}(g) \tag{3.1}$$

      This is equivalent to “good” Generalization.

    • Learning:
      Learning corresponds to the condition that \(g \approx f\), which in-turn corresponds to the condition:

      $$E_{\text {out }}(g) \approx 0 \tag{3.2}$$

    How to achieve Learning:
    We achieve \(E_{\text {out }}(g) \approx 0\) through:

    1. \(E_{\mathrm{out}}(g) \approx E_{\mathrm{in}}(g)\)
      A theoretical result achieved through Hoeffding PROBABILITY THEORY .
    2. \(E_{\mathrm{in}}(g) \approx 0\)
      A Practical result of minimizing the In-Sample Error Function (ERM) Optimization .

    Learning is, thus, reduced to the 2 following questions:

    1. Can we make sure that \(E_{\text {out }}(g)\) is close enough to \(E_{\text {in }}(g)\)? (theoretical)
    2. Can we make \(E_{\text {in}}(g)\) small enough? (practical)

    What the Learning Theory will achieve:

    • Characterizing the feasibility of learning for infinite \(M\) (hypothesis).
      We are going to measure the model not by the number of hypotheses, but by a single parameter which tells us the sophistication of the model. And that sophistication will reflect the out-of-sample performance as it relates to the in-sample performance (through the Hoeffding (then VC) inequalities).
    • Characterizing the tradeoff:
      img
      In words:
      We realized that we would like our model, the hypothesis set, to be elaborate, in order to be able to fit the data. The more parameters you have, the more likely you are going to fit the data and get here. So the \(E_{\text{in}}\) goes down if you use more complex models. However, if you make the model more complex, the discrepancy between \(E_{\text{out}}\) and \(E_{\text{in}}\) gets worse and worse. \(E_{\text{in}}\) tracks \(E_{\text{out}}\) much more loosely than it used to.

Linear Models I

img

img

The Linear Model II

  1. Linear Models - Logistic Regression:
    img
    The Logistic Regression applies a non-linear transform on the signal; it’s a softer approximation to the hard-threshold non-linearity applied by Linear Classification.

  2. The Logistic Function \(\theta\):

    $$\theta(s)=\frac{e^{s}}{1+e^{s}}=\frac{1}{1+e^{-s}}$$

    img

    • Soft Threshold: corresponds to uncertainty; interpreted as probabilities.
    • Sigmoid: looks like a flattened out ‘S’.
  3. The Probability Interpretation:
    \(h(\mathbf{x})=\theta(s)\) is interpreted as a probability.
    It is in-fact a Genuine Probability. The output of logistic regression is treated genuinely as a probability even during learning.
    Justification:
    Data \((\mathbf{x}, y)\) with binary \(y\), (we don’t have direct access to probability, but the binary \(y\) is affected by the probability), generated by a noisy target:

    $$P(y | \mathbf{x})=\left\{\begin{array}{ll}{f(\mathbf{x})} & {\text {for } y=+1} \\ {1-f(\mathbf{x})} & {\text {for } y=-1}\end{array}\right.$$

    The target \(f : \mathbb{R}^{d} \rightarrow[0,1]\) is the probability.
    We learn \(\:\:\:\: g(\mathbf{x})=\theta\left(\mathbf{w}^{\top} \mathbf{x}\right) \approx f(\mathbf{x})\).

    In words: So I’m going to call the probability the target function itself. The probability that someone gets heart attack is \(f(\mathbf{x})\). And I’m trying to learn \(f\), notwithstanding the fact that the examples that I am getting are giving me just sample values of \(y\), that happen to be generated by \(f\).


  4. Deriving the Error Measure (Cross-Entropy) from Likelihood:
    The error measure for logistic regression is based on likelihood - it is both, plausible and friendly/well-behaved? (for optimization).
    For each \((\mathbf{x}, y)\), \(y\) is generated wit probability \(f(\mathbf{x})\).

    Likelihood: We are maximizing the likelihood of this hypothesis, under the data set that we were given, with respect to the weights. I.E. Given the data set, how likely is this hypothesis? Which means, what is the probability of that data set under the assumption that this hypothesis is indeed the target?

    • Deriving the Likelihood:
      1. We start with:

        $$P(y | \mathbf{x})=\left\{\begin{array}{ll}{h(\mathbf{x})} & {\text {for } y=+1} \\ {1-h(\mathbf{x})} & {\text {for } y=-1}\end{array}\right.$$

      2. Substitute \(h(\mathbf{x})=\theta \left(\mathbf{w}^{\top} \mathbf{x}\right)\):

        $$P(y | \mathbf{x})=\left\{\begin{array}{ll}{\theta(\mathbf{w}^T\mathbf{x})} & {\text {for } y=+1} \\ {1-\theta(\mathbf{w}^T\mathbf{x})} & {\text {for } y=-1}\end{array}\right.$$

      3. Since we know that \(\theta(-s)=1-\theta(s)\), we can simplify the piece-wise function:

        $$P(y | \mathbf{x})=\theta\left(y \mathbf{w}^{\top} \mathbf{x}\right)$$

      4. To get the likelihood of the dataset \(\mathcal{D}=\left(\mathbf{x}_{1}, y_{1}\right), \ldots,\left(\mathbf{x}_{N}, y_{N}\right)\):

        $$\prod_{n=1}^{N} P\left(y_{n} | \mathbf{x}_{n}\right) =\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\mathrm{T}} \mathbf{x}_ {n}\right)$$

    • Maximizing the Likelihood (Deriving the Cross-Entropy Error):
      1. Maximize:

        $$\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)$$

      2. Take the natural log to avoid products:

        $$\ln \left(\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)\right)$$

        Motivation:

        • The inner quantity is non-negative and non-zero.
        • The natural log is monotonically increasing (its max, is the max of its argument)
      3. Take the average (still monotonic):

        $$\frac{1}{N} \ln \left(\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)\right)$$

      4. Take the negative and Minimize:

        $$-\frac{1}{N} \ln \left(\prod_{n=1}^{N} \theta\left(y_{n} \mathbf{w}^{\top} \mathbf{x}_ {n}\right)\right)$$

      5. Simplify:

        $$=\frac{1}{N} \sum_{n=1}^{N} \ln \left(\frac{1}{\theta\left(y_{n} \mathbf{w}^{\tau} \mathbf{x}_ {n}\right)}\right)$$

      6. Substitute \(\left[\theta(s)=\frac{1}{1+e^{-s}}\right]\):

        $$\frac{1}{N} \sum_{n=1}^{N} \underbrace{\ln \left(1+e^{-y_{n} \mathbf{w}^{\top} \mathbf{x}_{n}}\right)}_{e\left(h\left(\mathbf{x}_{n}\right), y_{n}\right)}$$

      7. Use this as the Cross-Entropy Error Measure:

        $$E_{\mathrm{in}}(\mathrm{w})=\frac{1}{N} \sum_{n=1}^{N} \underbrace{\ln \left(1+e^{-y_{n} \mathrm{w}^{\top} \mathbf{x}_{n}}\right)}_{\mathrm{e}\left(h\left(\mathrm{x}_{n}\right), y_{n}\right)}$$

  5. The Decision Boundary of Logistic Regression:
    Decision Boundary: It is the set of \(x\) such that:

    $$\frac{1}{1+e^{-\theta \cdot x}}=0.5 \implies 0=-\theta \cdot x=-\sum_{i=0}^{n} \theta_{i} x_{i}$$

  6. The Logistic Regression Algorithm:
    img

  7. Summary of Linear Models:
    img

  8. Nonlinear Transforms:

    $$\mathbf{x}=\left(x_{0}, x_{1}, \cdots, x_{d}\right) \stackrel{\Phi}{\longrightarrow} \mathbf{z}=\left(z_{0}, z_{1}, \cdots \cdots \cdots \cdots \cdots, z_{\tilde{d}}\right)$$

    $$\text {Each } z_{i}=\phi_{i}(\mathbf{x}) \:\:\:\:\: \mathbf{z}=\Phi(\mathbf{x})$$

    Example: \(\mathbf{z}=\left(1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right)\)
    The Final Hypothesis \(g(\mathbf{x})\) in \(\mathcal{X}\) space:

    • Classification: \(\operatorname{sign}\left(\tilde{\mathbf{w}}^{\top} \Phi(\mathbf{x})\right)\)
    • Regression: \(\tilde{\mathbf{w}}^{\top} \Phi(\mathbf{x})\)

    Two Non-Separable Cases:

    • Almost separable with some outliers:
      img
      1. Accept that \(E_{\mathrm{in}}>0\); use a linear model in \(\mathcal{X}\).
      2. Insist on \(E_{\mathrm{in}}=0\); go to a high-dimensional \(\mathcal{Z}\).
        This has a worse chance for generalizing.
    • Completely Non-Linear:
      img
      Data-snooping example: it is hard to choose the right transformations; biggest flop is to look at the data to choose the right transformations; it invalidates the VC inequality guarantee.

      Think of the VC inequality as providing you with a warranty.

The Bias Variance Decomposition

img

img

  1. Reference: Equivalence of Generalized-LS and MLE in Exponential Family