Table of Contents



FIRST


SECOND

  1. Bayesian Learning:
    Main Idea:
    Instead of looking for the most likely setting of the parameters of a model we should consider all possible settings of the parameters and try and estimate for each of those possible settings how probable it is given the data we observed.

    The Bayesian Framework:

    • Prior-Belief Assumption:
      The Bayesian framework assumes that we always have a prior distribution for everything.
      • The prior may be very vague
      • When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution.
      • The likelihood term takes into account how probable the observed data is given the parameters of the model:
        • It favors parameter settings that make the data likely
        • It fights the prior
        • With enough data the likelihood terms always wins
      • Continue NoteTaking (has great example) (Hinton Lec)
    • Bayes Theorem:

      $$p(\mathcal{D}) p(\mathbf{\theta} \vert \mathcal{D})=\underbrace{p(\mathcal{D}, \mathbf{\theta})}_ {\text{joint probability}}=p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})$$

      $$\implies \\ p(\mathbf{\theta} \vert \mathcal{D}) = \dfrac{p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})}{p(\mathcal{D})} = \dfrac{p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})}{\int_{\mathbf{\theta}} p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})}$$

    Bayesian Probability:

    • Interpreting the Prior:
      The prior probability of any event \(q\), \(p(q)\), quantifies the current state of knowledge (Uncertainty) of \(q\).
      Regardless whether \(q\) is deterministic or random.
    • Modeling Randomness:
      If randomness is being modeled it would be modeled as a stochastic process with fixed parameters.
      For example random noise is often modeled as being generated from a normal distribution with some fixed (but possibly unknown) mean and covariance.
    • Interpreting Parameters:
      Bayesians do not view parameters as being stochastic.
      So, for instance, if we find that according to the posterior p(0.1 < p_1 < 0.2) = 0.10 that would be interpreted as “There is a 10% chance p_1 is between 0.1 and 0.2” not “p_1 is between 0.1 and 0.2 10% of the time”.

    Notes:

    • A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.

  2. Bayesian vs Frequentist Learning:

    Differences:

    • Translating Events into the Theory - Assigning a Probability Distribution:
      • Bayesian: no need for Random Variables.
        A probability distribution is assigned to a quantity because it is unknown - which means that it cannot be deduced logically from the information we have.
      • Frequentist: needs a Random Variable.
        A quantity/event that is stochastic/random can be modeled as a random variable.
    • Unknown vs Random
      • Bayesian: assumes quantities can be unknown.
        Subjective View: “being unknown” depends on which person you are asking about that quantity - hence it is a property of the statistician doing the analysis.
      • Frequentist: assumes quantities can be random/stochastic.
        Objective View: “randomness”/”stochasticity” is described as a property of the actual quantity.
        This generally does not hold: “randomness” cannot be a property of some standard examples, by simply asking two frequentists who are given different information about the same quantity to decide if its “random” (e.g. Bernoulli Urn).


      Bayesian Frequentist
    Uncertainty credible interval confidence interval
    Probability Interp. Subjective: Degree of Belief (Logic) Objective: Relative Frequency of Events
    Uncertainty credible interval confidence interval
    estimation/inference use data to best estimate unknown parameters - pinpoint a value of parameter space as well as possible by using data to update belief
    - all inference follow posterior
    - use simulation method: generate samples from the posterior and use them to estimate the quantities of interest
    parameter of the model - Fixed, unknown Constants
    - can NOT make probabilistic statements about the parameters
    - Random Variables (parameters can’t be determined exactly, uncertainty is expressed in probability statements or distributions)
    - can make probability statements about the parameters
    interval estimate Confidence Interval: a claim that the region covers the true parameter, reflecting uncertainty in sampling procedure. Credible Interval: a claim that the true parameter is inside the region with measurable probability.
    Main Problem Variability of Data Uncertainty of Knowledge

    Probability Interpretation:

    • Bayesian:
      A Bayesian defines a “probability” in exactly the same way that most non-statisticians do - namely an indication of the plausibility of a proposition or a situation. If you ask him a question, he will give you a direct answer assigning probabilities describing the plausibilities of the possible outcomes for the particular situation (and state his prior assumptions).
    • Frequentist:
      A Frequentist is someone that believes probabilities represent long run frequencies with which events occur; if needs be, he will invent a fictitious population from which your particular situation could be considered a random sample so that he can meaningfully talk about long run frequencies. If you ask him a question about a particular situation, he will not give a direct answer, but instead make a statement about this (possibly imaginary) population.

    Statistical Methods:

    • Bayesian:
      • Probability refers to degree of belief
      • Inference about a parameter \(\theta\) is by producing a probability distributions on it. Typically, one starts with a prior distribution \(p(\theta)\). One also chooses a likelihood function \(p(x \mid \theta)-\) note this is a function of \(\theta\), not \(x\). After observing data \(x\), one applies the Bayes Theorem to obtain the posterior distribution \(p(\theta \mid x)\).

        $$p(\theta \mid x)=\frac{p(\theta) p(x \mid \theta)}{\int p\left(\theta^{\prime}\right) p\left(x \mid \theta^{\prime}\right) d \theta^{\prime}} \propto p(\theta) p(x \mid \theta)$$

        where \(Z \equiv \int p\left(\theta^{\prime}\right) p\left(x \mid \theta^{\prime}\right) d \theta^{\prime}\) is known as the normalizing constant. The posterior distribution is a complete characterization of the parameter.
        Sometimes, one uses the mode of the posterior as a simple point estimate, known as the maximum aposteriori (MAP) estimate of the parameter:
        \(\theta^{\text {MAP }}=\operatorname{argmax}_ {\theta} p(\theta \mid x)\)

        Note MAP is not a proper Bayesian approach.

      • Prediction under an unknown parameter is done by integrating it out:
        \(p(x \mid \text {Data})=\int p(x \mid \theta) p(\theta \mid \text{Data}) d \theta\)
    • Frequentist:
      • Probability refers to limiting relative frequency
      • Data are random
      • Estimators are random because they are functions of data
      • Parameters are fixed, unknown constants not subject to probabilistic statements
      • Procedures are subject to probabilistic statements, for example 95% confidence intervals traps the trueparameter value 95