Ahmad Badary

FIRST

SECOND

Bayesian Learning:
Main Idea:
Instead of looking for the most likely setting of the parameters of a model we should consider all possible settings of the parameters and try and estimate for each of those possible settings how probable it is given the data we observed.

The Bayesian Framework:
- Prior-Belief Assumption:
  The Bayesian framework assumes that we always have a prior distribution for everything.
  - The prior may be very vague
  - When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution.
  - The likelihood term takes into account how probable the observed data is given the parameters of the model:
    - It favors parameter settings that make the data likely
    - It fights the prior
    - With enough data the likelihood terms always wins
  - Continue NoteTaking (has great example) (Hinton Lec)
- Bayes Theorem:
  $$p(\mathcal{D}) p(\mathbf{\theta} \vert \mathcal{D})=\underbrace{p(\mathcal{D}, \mathbf{\theta})}_ {\text{joint probability}}=p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})$$
  
  $$\implies \\ p(\mathbf{\theta} \vert \mathcal{D}) = \dfrac{p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})}{p(\mathcal{D})} = \dfrac{p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})}{\int_{\mathbf{\theta}} p(\mathbf{\theta}) p(\mathcal{D} \vert \mathbf{\theta})}$$
Bayesian Probability:
- Interpreting the Prior:
  The prior probability of any event $q$, $p(q)$, quantifies the current state of knowledge (Uncertainty) of $q$.
  Regardless whether $q$ is deterministic or random.
- Modeling Randomness:
  If randomness is being modeled it would be modeled as a stochastic process with fixed parameters.
  For example random noise is often modeled as being generated from a normal distribution with some fixed (but possibly unknown) mean and covariance.
- Interpreting Parameters:
  Bayesians do not view parameters as being stochastic.
  So, for instance, if we find that according to the posterior p(0.1 < p_1 < 0.2) = 0.10 that would be interpreted as “There is a 10% chance p_1 is between 0.1 and 0.2” not “p_1 is between 0.1 and 0.2 10% of the time”.
Notes:
- A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.

Bayesian vs Frequentist Learning:

Differences:

Translating Events into the Theory - Assigning a Probability Distribution:
- Bayesian: no need for Random Variables.
  A probability distribution is assigned to a quantity because it is unknown - which means that it cannot be deduced logically from the information we have.
- Frequentist: needs a Random Variable.
  A quantity/event that is stochastic/random can be modeled as a random variable.
Unknown vs Random
- Bayesian: assumes quantities can be unknown.
  Subjective View: “being unknown” depends on which person you are asking about that quantity - hence it is a property of the statistician doing the analysis.
- Frequentist: assumes quantities can be random/stochastic.
  Objective View: “randomness”/”stochasticity” is described as a property of the actual quantity.
  This generally does not hold: “randomness” cannot be a property of some standard examples, by simply asking two frequentists who are given different information about the same quantity to decide if its “random” (e.g. Bernoulli Urn).

	Bayesian	Frequentist
Uncertainty	credible interval	confidence interval
Probability Interp.	Subjective: Degree of Belief (Logic)	Objective: Relative Frequency of Events
Uncertainty	credible interval	confidence interval
estimation/inference	use data to best estimate unknown parameters	- pinpoint a value of parameter space as well as possible by using data to update belief - all inference follow posterior - use simulation method: generate samples from the posterior and use them to estimate the quantities of interest
parameter of the model	- Fixed, unknown Constants - can NOT make probabilistic statements about the parameters	- Random Variables (parameters can’t be determined exactly, uncertainty is expressed in probability statements or distributions) - can make probability statements about the parameters
interval estimate	Confidence Interval: a claim that the region covers the true parameter, reflecting uncertainty in sampling procedure.	Credible Interval: a claim that the true parameter is inside the region with measurable probability.
Main Problem	Variability of Data	Uncertainty of Knowledge

Probability Interpretation:

Bayesian:
A Bayesian defines a “probability” in exactly the same way that most non-statisticians do - namely an indication of the plausibility of a proposition or a situation. If you ask him a question, he will give you a direct answer assigning probabilities describing the plausibilities of the possible outcomes for the particular situation (and state his prior assumptions).
- Probability is Logic
  my “non-plain english” reason for this is that the calculus of propositions is a special case of the calculus of probabilities, if we represent truth by 1 and falsehood by 0. Additionally, the calculus of probabilities can be derived from the calculus of propositions. This conforms with the “bayesian” reasoning most closely - although it also extends the bayesian reasoning in applications by providing principles to assign probabilities, in addition to principles to manipulate them. Of course, this leads to the follow up question “what is logic?” for me, the closest thing I could give as an answer to this question is “logic is the common sense judgements of a rational person, with a given set of assumptions” (what is a rational person? etc. etc.). Logic has all the same features that Bayesian reasoning has. For example, logic does not tell you what to assume or what is “absolutely true”. It only tells you how the truth of one proposition is related to the truth of another one. You always have to supply a logical system with “axioms” for it to get started on the conclusions. They also has the same limitations in that you can get arbitrary results from contradictory axioms. But “axioms” are nothing but prior probabilities which have been set to 1. For me, to reject Bayesian reasoning is to reject logic. For if you accept logic, then because Bayesian reasoning “logically flows from logic” (how’s that for plain english :P ), you must also accept Bayesian reasoning.
Frequentist:
A Frequentist is someone that believes probabilities represent long run frequencies with which events occur; if needs be, he will invent a fictitious population from which your particular situation could be considered a random sample so that he can meaningfully talk about long run frequencies. If you ask him a question about a particular situation, he will not give a direct answer, but instead make a statement about this (possibly imaginary) population.
- Probability is Frequency
  although I’m not sure “frequency” is a plain english term in the way it is used here - perhaps “proportion” is a better word. I wanted to add into the frequentist answer that the probability of an event is thought to be a real, measurable (observable?) quantity, which exists independently of the person/object who is calculating it. But I couldn’t do this in a “plain english” way.
  So perhaps a “plain english” version of one the difference could be that frequentist reasoning is an attempt at reasoning from “absolute” probabilities, whereas bayesian reasoning is an attempt at reasoning from “relative” probabilities.

Statistical Methods:

Bayesian:
- Probability refers to degree of belief
- Inference about a parameter $\theta$ is by producing a probability distributions on it. Typically, one starts with a prior distribution $p(\theta)$. One also chooses a likelihood function $p(x \mid \theta)-$ note this is a function of $\theta$, not $x$. After observing data $x$, one applies the Bayes Theorem to obtain the posterior distribution $p(\theta \mid x)$.
  $$p(\theta \mid x)=\frac{p(\theta) p(x \mid \theta)}{\int p\left(\theta^{\prime}\right) p\left(x \mid \theta^{\prime}\right) d \theta^{\prime}} \propto p(\theta) p(x \mid \theta)$$
  
  where $Z \equiv \int p\left(\theta^{\prime}\right) p\left(x \mid \theta^{\prime}\right) d \theta^{\prime}$ is known as the normalizing constant. The posterior distribution is a complete characterization of the parameter.
  Sometimes, one uses the mode of the posterior as a simple point estimate, known as the maximum aposteriori (MAP) estimate of the parameter:
  $\theta^{\text {MAP }}=\operatorname{argmax}_ {\theta} p(\theta \mid x)$
  
  Note MAP is not a proper Bayesian approach.
- Prediction under an unknown parameter is done by integrating it out:
  $p(x \mid \text {Data})=\int p(x \mid \theta) p(\theta \mid \text{Data}) d \theta$
Frequentist:
- Probability refers to limiting relative frequency
- Data are random
- Estimators are random because they are functions of data
- Parameters are fixed, unknown constants not subject to probabilistic statements
- Procedures are subject to probabilistic statements, for example 95% confidence intervals traps the trueparameter value 95

Computational Learning Theory

Table of Contents

FIRST

SECOND