Table of Contents
Generative vs Discriminative Models
Given some data \(\{(d,c)\}\) of paired observations \(d\) and hidden classes \(c\):
-
- Generative (Joint) Models:
- Generative Models are Joint Models.
- Joint Models place probabilities \(\left(P(c,d)\right)\) over both the observed data and the “target” (hidden) variables that can only be computed from those observed.
- Generative models are typically probabilistic, specifying a joint probability distribution (\(P(d,c)\)) over observation and target (label) values,
and tries to Maximize this joint Likelihood.Choosing weights turn out to be trivial: chosen as the relative frequencies.
- Examples:
- n-gram Models
- Naive Bayes Classifiers
- Hidden Markov Models (HMMs)
- Probabilistic Context-Free Grammars (PCFGs)
- IBM Machine Translation Alignment Models
-
- Discriminative (Conditional) Models:
- Discriminative Models are Conditional Models.
- Conditional Models provide a model only for the “target” (hidden) variabless.
They take the data as given, and put a probability \(\left(P(c \| d)\right)\) over the “target” (hidden) structures given the data. - Conditional Models seek to Maximize the Conditional Likelihood.
This (maximization) task is usually harder to do.
- Examples:
- Logistic Regression
- Conditional LogLinear/Maximum Entropy Models
- Condtional Random Fields
- SVMs
- Perceptrons
-
- Generative VS Discriminative Models:
- Basically, Discriminative Models infer outputs based on inputs,
while Generative Models generate, both, inputs and outputs (typically given some hidden parameters). - However, notice that the two models are usually viewed as complementary procedures.
One does not necessarily outperform the other, in either classification or regression tasks.
Feature Extraction for Discriminative Models in NLP
-
- Features (Intuitively):
- Features (\(f\)) are elementary pieces of evidence that link aspects of what we observe (\(d\)) with a category (\(c\)) that we want to predict.
-
- Features (Mathematically):
- A Feature \(f\) is a function with a bounded real value.
- \[f : \: C \times D \rightarrow \mathbf{R}\]
-
- Models and Features:
- Models will assign a weight to each Feature:
- A Positive Weight votes that this configuration is likely Correct.
- A Negative Weight votes that this configuration is likely Incorrect.
-
- Feature Expectations:
-
- Empirical Expectation (count):
- \[E_{\text{emp}}(f_i) = \sum_{(c,d)\in\text{observed}(C,D)} f_i(c,d)\]
-
- Model Expectation:
- \[E(f_i) = \sum_{(c,d)\in(C,D)} P(c,d)f_i(c,d)\]
-
The two Expectations represent the Actual and the Predicted Counts of a feature firing, respectively.
- Features in NLP:
In NLP, features have a particular form.
They consist of:- Indicator Function: a boolean matching function of properties of the input
- A Particular Class: specifies some class \(c_j\)
$$f_i(c,d) \cong [\Phi(d) \wedge c=c_j] = \{0 \vee 1\}$$
where \(\Phi(d)\) is a given predicate on the data \(d\), and \(c_j\) is a particular class.
Basically, each feature picks out a data subset and suggests a label for it.
Feature-Based Linear Classifiers
-
- Linear Classifiers (Classification):
-
- We have a Linear Function from the feature sets \(\{f_i\}\) to the classes \(\{c\}\)
- Assign Weights \(\lambda_i\) to each feature \(f_i\)
- Consider each class for an observed datum \(d\)
-
- Features Vote with their weights :
- \[\text{vote}(c) = \sum \lambda_i f_i(c,d)\]
- Classification:
choose the class \(c\) which Maximizes the vote \(\sum \lambda_i f_i(c,d)\)
-
- Exponential Models:
- Exponential Models make a probabilistic model from the linear combination \(\sum\lambda_if_i(c,d)\)
-
- Making the Value Positive:
- \[\sum\lambda_if_i(c,d) \rightarrow e^{\sum\lambda_if_i(c,d)}\]
-
- Normalizing the Value (Making a Probability):
- \[e^{\sum\lambda_if_i(c,d)} \rightarrow \dfrac{e^{\sum\lambda_if_i(c,d)}}{\sum_{c \in C} e^{\sum\lambda_if_i(c,d)}}\]
- \[\implies\]
- \[P(c \| d, \vec{\lambda}) = \dfrac{e^{\sum\lambda_if_i(c,d)}}{\sum_{c \in C} e^{\sum\lambda_if_i(c,d)}}\]
- The function \(P(c \| d,\vec{\lambda})\) is referred to as the Soft-Max function.
- Here, the Weights are the Paramters of the probability model, combined via a Soft-Max function.
- Learning:
- Given this model form, we want to choose paramters \(\{\lambda_i\}\) that Maximize the Conditional Likelihood of the data according to this model (i.e. the soft-max func.).
- Exponential Models, construct not onlt classifications but, also, Probability Distributions over the classifications.
- Examples:
- Log-Linear Model
- Max Entropy (MaxEnt) Model
- Logistic Regression
- Gibbs Model
- Exponential Models (Training) | Maximizing the Likelihood:
The Likelihood Value:- The (log) conditional likelihood of a MaxEnt model is a function of the i.i.d. data \((C,D)\) and the parameters (\(\lambda\)):
$$\log P(C \| D,\lambda) = \log \prod_{(c,d) \in (C,D)} P(c \| d,\lambda) = \sum_{(c,d) \in (C,D)} \log P(c \| d,\lambda)$$
- If there aren’t many values of \(c\), it’s easy to calculate:
$$\log P(c \| d,\lambda) = \sum_{(c,d) \in (C,D)} \log \dfrac{e^{\sum_i \lambda_if_i(c,d)}}{\sum_c e^{\sum_i \lambda_if_i(c,d)}}$$
- We can separate this into two components:
$$\log P(c \| d,\lambda) = \sum_{(c,d) \in (C,D)} \log e^{\sum_i \lambda_if_i(c,d)} - \sum_{(c,d) \in (C,D)} \log \sum_c' e^{\sum_i \lambda_if_i(c',d)}$$
$$\implies$$
$$\log P(C \| D, \lambda) = N(\lambda) - M(\lambda)$$
- The Derivative of the Numerator is easy to calculate:
$$\dfrac{\partial N(\lambda)}{\partial \lambda_i} = \dfrac{\partial \sum_{(c,d) \in (C,D)} \log e^{\sum_i \lambda_if_i(c,d)}}{\partial \lambda_i} \\= \dfrac{\partial \sum_{(c,d) \in (C,D)} \sum_i \lambda_if_i(c,d)}{\partial \lambda_i} \\\\= \sum_{(c,d) \in (C,D)} \dfrac{\partial \sum_i \lambda_if_i(c,d)}{\partial \lambda_i} \\\\= \sum_{(c,d) \in (C,D)} f_i(c,d)$$
- The derivative of the Numerator is the Empirical Expectation, \(E_{\text{emp}}(f_i)\)
- The Derivative of the Denominator:
$$\dfrac{\partial M(\lambda)}{\partial \lambda_i} = \dfrac{\partial \sum_{(c,d) \in (C,D)} \log \sum_c' e^{\sum_i \lambda_if_i(c',d)}}{\partial \lambda_i} \\\\= \sum_{(c,d) \in (C,D)} \sum_c' P(c' \| d, \lambda)f_i(c', d)$$
- The derivative of the Denominator is equal to the Predicted Expectation (count), \(E(f_i, \lambda)\)
- Thus, the derivative of the log likelihood is:
$$\dfrac{\partial \log P(C \| D, \vec{\lambda})}{\partial \lambda_i} = \text{Actual Count}(f_i, C) - \text{Predicted Count}(f_i, \vec{\lambda})$$
- Thus, the optimum parameters are those for which each feature’s predicted expectation equals its empirical expectation.
The Optimum Distribution is always:
- Unique (parameters need not be unique)
- Exists (if feature counts are from actual data)
These models are called Maximum Entropy (Maxent) Models because we find the model having the maximum entropy, and satisfying the constraints:
$$E_p(f_j) = E_\hat{p}(f_j), \:\:\: \forall j$$
Finally, to find the optimal parameters \(\lambda_1, \dots, \lambda_d\) one needs to optimize (maximize) the log likelihood, or equivalently, minimize the -ve likelihood.
One can do that in variety of ways using optimization methods.Common Optimization Methods:
- (Stochastic) Gradient Descent
- Iterative Proportional Fitting Methods:
- Generalized Iterative Scaling (GIS)
- Improved Iterative Scaling (IIS)
- Conjugate Gradient (CG) (+ Preconditioning)
- Quasi-Newton Methods - Limited-Memory Variable Metric (LMVM):
- L-BFGS This one is the most commonly used.
- The (log) conditional likelihood of a MaxEnt model is a function of the i.i.d. data \((C,D)\) and the parameters (\(\lambda\)):