Gradient-Based Optimization
- Define Gradient Methods:
- Give examples of Gradient-Based Algorithms:
- What is Gradient Descent:
- Explain it intuitively:
- Give its derivation:
- What is the learning rate?
- Where does it come from?
- How does it relate to the step-size?
- We go from having a fixed step-size to [blank]:
- What is its range?
- How do we choose the learning rate?
- Compare Line Search vs Trust Region:
- Learning Rate Schedule:
- Define:
- List Types:
- Describe the convergence of the algorithm:
- How does GD relate to Euler?
- List the variants of GD:
- How do they differ? Why?:
- BGD:
- SGD:
- How should we handle the lr in this case? Why?
- M-BGD:
- What advantages does it have?
- Explain the different kinds of gradient-descent optimization procedures:
- State the difference between SGD and GD?
- When would you use GD over SDG, and vice-versa?
-
What is the problem of vanilla approaches to GD?
- List the challenges that account for the problem above:
- List the different strategies for optimizing GD:
- List the different variants for optimizing GD:
- Momentum:
- Motivation:
- Definitions/Algorithm:
- Intuition:
- Parameter Settings:
- Nesterov Accelerated Gradient (Momentum):
- Motivation:
- Definitions/Algorithm:
- Intuition:
- Parameter Settings:
- Successful Applications:
- Adagrad
- Motivation:
- Definitions/Algorithm:
- Intuition:
- Parameter Settings:
- Successful Application:
- Properties:
- Adadelta
- Motivation:
- Definitions/Algorithm:
- Intuition:
- Parameter Settings:
- Properties:
- RMSprop
- Motivation:
- Definitions/Algorithm:
- Intuition:
- Parameter Settings:
- Properties:
- Adam
- Motivation:
- Definitions/Algorithm:
- Intuition:
- Parameter Settings:
- Properties:
- Which methods have trouble with saddle points?
- How should you choose your optimizer?
- Summarize the different variants listed above. How do they compare to each other?
- What’s a common choice in many research papers?
- List additional strategies for optimizing SGD:
Maximum Margin Classifiers
- Define Margin Classifiers:
- What is a Margin for a linear classifier?
- Give the motivation for margin classifiers:
- Define the notion of the “best” possible classifier
- How can we achieve the “best” classifier?
- What unique vector is orthogonal to the hp? Prove it:
- What do we mean by “signed distance”? Derive its formula:
- Given the formula for signed distance, calculate the “distance of the point closest to the hyperplane”:
- Use geometric properties of the hp to Simplify the expression for the distance of the closest point to the hp, above
- Characterize the margin, mathematically:
- Characterize the “Slab Existence”:
- Formulate the optimization problem of maximizing the margin wrt analysis above:
- Reformulate the optimization problem above to a more “friendly” version (wrt optimization -> put in standard form):
- Give the final (standard) formulation of the “Optimization problem for maximum margin classifiers”:
- What kind of formulation is it (wrt optimization)? What are the parameters?
Hard-Margin SVMs
- Define:
- SVMs:
- Support Vectors:
- Hard-Margin SVM:
- Define the following wrt hard-margin SVM:
- Goal:
- Procedure:
- Decision Function:
- Constraints:
- The Optimization Problem:
- The Optimization Method:
- Elaborate on the generalization analysis:
- List the properties:
- Give the solution to the optimization problem for H-M SVM:
- What method does it require to be solved:
- Formulate the Lagrangian:
- Optimize the objective for each variable:
- Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
- Set the problem as a Quadratic Programming problem:
- What are the inputs and outputs to the Quadratic Program Package?
- Give the final form of the optimization problem in standard form:
Soft-Margin SVM
- Motivate the soft-margin SVM:
- What is the main idea behind it?
- Define the following wrt soft-margin SVM:
- Goal:
- Procedure:
- Decision Function:
- Constraints:
- Why is there a non-negativity constraint?
- Objective/Cost Function:
- The Optimization Problem:
- The Optimization Method:
- Properties:
- Specify the effects of the regularization hyperparameter \(C\):
- Describe the effect wrt over/under fitting:
- How do we choose \(C\)?
- Give an equivalent formulation in the standard form objective for function estimation (what should it minimize?)
Loss Functions
- Define:
- Loss Functions - Abstractly and Mathematically:
- Distance-Based Loss Functions:
- What are they used for?
- Describe an important property of dist-based losses:
Translation Invariance:
- Relative Error - What does it lack?
- List 3 Regression Loss Functions
- MSE
- What does it minimize:
- Formula:
- Graph:
- Derivation:
- MAE
- What does it minimize:
- Formula:
- Graph:
- Derivation:
- List properties:
- Huber Loss
- AKA:
- What does it minimize:
- Formula:
- Graph:
- List properties:
- Analyze MSE vs MAE ref:
- List 7 Classification Loss Functions
- \(0-1\) loss
- What does it minimize:
- Formula:
- Graph:
- MSE
- Formula:
- Graph:
- Derivation (for classification) - give assumptions:
- Properties:
- Hinge Loss
- What does it minimize:
- Formula:
- Graph:
- Properties:
- Describe the properties of the Hinge loss and why it is used?
- Logistic Loss
- AKA:
- What does it minimize:
- Formula:
- Graph:
- Derivation:
- Properties:
- Cross-Entropy
- What does it minimize:
- Formula:
- Binary Cross-Entropy:
- Graph:
- CE and Negative-Log-Probability:
- CE and Log-Loss:
- Derivation:
- CE and KL-Div:
- Exponential Loss
- Formula:
- Properties:
- Perceptron Loss
- Formula:
- Analysis
- Logistic vs Hinge Loss:
- Cross-Entropy vs MSE:
Information Theory
- What is Information Theory? In the context of ML?
- Describe the Intuition for Information Theory. Intuitively, how does the theory quantify information (list)?
- Measuring Information - Definitions and Formulas:
- In Shannons Theory, how do we quantify “transmitting 1 bit of information”?
- What is the amount of information transmitted?
- What is the uncertainty reduction factor?
- What is the amount of information in an event \(x\)?
- Define the Self-Information - Give the formula:
- What is it defined with respect to?
- Define Shannon Entropy - What is it used for?
- Describe how Shannon Entropy relate to distributions with a graph:
- Define Differential Entropy:
- How does entropy characterize distributions?
- Define Relative Entropy - Give it’s formula:
- Give an interpretation:
- List the properties:
- Describe it as a distance:
- List the applications of relative entropy:
- How does the direction of minimization affect the optimization:
- Define Cross Entropy - Give it’s formula:
- What does it measure?
- How does it relate to relative entropy?
- When are they equivalent?
Recommendation Systems
- Describe the different algorithms for recommendation systems:
Ensemble Learning
- What are the two paradigms of ensemble methods?
- Random Forest VS GBM?
Data Processing and Analysis
- What are 3 data preprocessing techniques to handle outliers?
- Describe the strategies to dimensionality reduction?
- What are 3 ways of reducing dimensionality?
- List methods for Feature Selection
- List methods for Feature Extraction
- How to detect correlation of “categorical variables”?
- Feature Importance
- Capturing the correlation between continuous and categorical variable? If yes, how?
- What cross validation technique would you use on time series data set?
- How to deal with missing features? (Imputation?)
- Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
- What are collinearity and multicollinearity?
ML/Statistical Models
- What are parametric models?
- What is a classifier?
K-NN
PCA
- What is PCA?
- What is the Goal of PCA?
- List the applications of PCA:
- Give formulas for the following:
- Assumptions on \(X\):
- SVD of \(X\):
- Principal Directions/Axes:
- Principal Components (scores):
- The \(j\)-th principal component:
- Define the transformation, mathematically:
-
What does PCA produce/result in?
- Finds a lower dimensional subspace spanned by what?:
- Finds a lower dimensional subspace that minimizes what?:
- What does each PC have (properties)?
- What does the procedure find in terms of a “basis”?
- What does the procedure find in terms of axes? (where do they point?):
-
Describe the PCA algorithm:
- What Data Processing needs to be done?
- How to compute the Principal Components?
- How do you compute the Low-Rank Approximation Matrix \(X_k\)?
- Describe the Optimality of PCA:
- List limitations of PCA:
-
Intuition:
- What property of the internal structure of the data does PCA reveal/explain?
- What object does it fit to the data?:
- Should you remove correlated features b4 PCA?
- How can we measure the “Total Variance” of the data?
- How can we measure the “Total Variance” of the projected data?
- How can we measure the “Error in the Projection”?
- What does it mean when this ratio is high?
- How does PCA relate to CCA?
- How does PCA relate to ICA?
The Centroid Method
- Define “The Centroid”:
- Describe the Procedure:
- What is the Decision Function:
- Describe the Decision Boundary:
K-Means
- What is K-Means?
- What is the idea behind K-Means?
- What does K-Mean find?
- Formal Description of the Model:
- What is the Objective?
- Description of the Algorithm:
-
What is the Optimization method used? What class does it belong to?
- How does the optimization method relate to EM?
- What is the Complexity of the algorithm?
-
Describe the convergence and prove it:
<button>Show Proof</button>{: .showText value="show" onclick="showTextPopHide(event);"}
- Describe the Optimality of the Algorithm:
- Derive the estimated parameters of the algorithm:
- Objective Function:
- Optimization Objective:
-
Derivation:
<button>Show Derivation</button>{: .showText value="show" onclick="showTextPopHide(event);"} <button>Show Derivation</button>{: .showText value="show" onclick="showTextPopHide(event);"}
- When does K-Means fail to give good results?
Naive Bayes
- Define:
- Naive Bayes:
- Naive Bayes Classifiers:
- Bayes Theorem:
- List the assumptions of Naive Bayes:
-
List some properties of Naive Bayes:
- Is it a Bayesian Method or Frequentest Method?
- Is it a Bayes Classifier? What does that mean?:
-
Define the Probabilistic Model for the method:
- What kind of model is it?
- What is a conditional probability model?
- Decompose the conditional probability w/ Bayes Theorem:
- How does the new expression incorporate the joint probability model?
- Use the chain rule to re-write the joint probability model:
- Use the Naive Conditional Independence assumption to rewrite the joint model:
- What is the conditional distribution over the class variable \(C_k\):
-
Construct the classifier. What are its components? Formally define it.
- What’s the decision rule used?
- List the difference between the Naive Bayes Estimate and the MAP Estimate:
- What are the parameters to be estimated for the classifier?:
- What method do we use to estimate the parameters?:
-
What are the estimates for each of the following parameters?:
- The prior probability of each class:
- The conditional probability of each feature (word) given a class:
CNNs
- What is a CNN?
- What kind of data does it work on? What is the mathematical property?
- What are the layers of a CNN?
- What are the four important ideas and their benefits that the convolution affords CNNs:
Benefits:
Benefits:
Benefits: - What is the inspirational model for CNNs:
- Describe the connectivity pattern of the neurons in a layer of a CNN:
- Describe the process of a ConvNet:
- Convolution Operation:
- Define:
- Formula (continuous):
- Formula (discrete):
- Define the following:
- Feature Map:
- Does the operation commute?
- Cross Correlation:
- Define:
- Formulae:
- What are the differences/similarities between convolution and cross-correlation:
- Write down the Convolution operation and the cross-correlation over two axes and:
- Convolution:
- Convolution (commutative):
- Cross-Correlation:
- The Convolutional Layer:
- What are the parameters and how do we choose them?
- Describe what happens in the forward pass:
- What is the output of the forward pass:
- How is the output configured?
- Spatial Arrangements:
- List the Three Hyperparameters that control the output volume:
- How to compute the spatial size of the output volume?
- How can you ensure that the input & output volume are the same?
- In the output volume, how do you compute the \(d\)-th depth slice:
- Calculate the number of parameters for the following config:
- Definitions:
- Receptive Field:
- Suppose the input volume has size \([ 32 × 32 × 3 ]\) and the receptive field (or the filter size) is \(5 × 5\) , then each neuron in the Conv Layer will have weights to a __Blank__ region in the input volume, for a total of __Blank__ weights:
- How can we achieve the greatest reduction in the spatial dims of the network (for classification):
- Pooling Layer:
- Define:
- List key ideas/properties and benefits:
- List the different types of Pooling:
-
List variations of pooling and their definitions:
- What is “Learned Pooling”:
- What is “Dynamical Pooling”:
- List the hyperparams of Pooling Layer:
- How to calculate the size of the output volume:
- How many parameters does the pooling layer have:
- What are other ways to perform downsampling:
- Weight Priors:
- Define “Prior Prob Distribution on the parameters”:
-
Define “Weight Prior” and its types/classes:
- Weak Prior:
- Strong Prior:
- Infinitely Strong Prior:
- Describe the Conv Layer as a FC Layer using priors:
-
What are the key insights of using this view:
- When is the prior imposed by convolution INAPPROPRIATE:
- What happens when the priors imposed by convolution and pooling are not suitable for the task?
- What kind of other models should Convolutional models be compared to? Why?:
- When do multi-channel convolutions commute?
- Why do we use several different kernels in a given conv-layer?
- Strided Convolutions
- Define:
- What are they used for?
- What are they equivalent to?
- Formula:
- Zero-Padding:
- Definition/Usage:
- List the types of padding:
- Locally Connected Layers/Unshared Convolutions:
- Bias Parameter:
- How many bias terms are used per output channel in the tradional convolution:
- Dilated Convolutions
- Define:
- What are they used for?
- Stacked Convolutions
- Define:
- What are they used for?
Theory
RNNs
- What is an RNN?
- Definition:
- What machine-type is the standard RNN:
- What is the big idea behind RNNs?
- Dynamical Systems:
- Standard Form:
- RNN as a Dynamical System:
- Unfolding Computational Graphs
- Definition:
- List the Advantages introduced by unfolding and the benefits:
- Graph and write the equations of Unfolding hidden recurrence:
- Describe the State of the RNN, its usage, and extreme cases of the usage:
- RNN Architectures:
- List the three standard architectures of RNNs:
- Graph:
- Architecture:
- Equations:
- Total Loss:
- Complexity:
- Properties:
- List the three standard architectures of RNNs:
- Teacher Forcing:
- Definition:
- Application:
- Disadvantages:
- Possible Solutions for Mitigation:
Optimization
- Define the sigmoid function and some of its properties:
- Backpropagation:
- Definition:
- Derive Gradient Descent Update:
- Explain the difference kinds of gradient-descent optimization procedures:
- List the different optimizers and their properties:
- Error-Measures:
- Define what an error measure is:
- List the 5 most common error measures and where they are used:
-
Specific Questions:
- Derive MSE carefully:
- Derive the Binary Cross-Entropy Loss function:
- Explain the difference between Cross-Entropy and MSE and which is better (for what task)?
- Describe the properties of the Hinge loss and why it is used?
- Show that the weight vector of a linear signal is orthogonal to the decision boundary?
- What does it mean for a function to be well-behaved from an optimization pov?
- Write \(\|\mathrm{Xw}-\mathrm{y}\|^{2}\) as a summation
- Compute:
- \(\dfrac{\partial}{\partial y}\vert{x-y}\vert=\)
- State the difference between SGD and GD?
- When would you use GD over SDG, and vice-versa?
- What is convex hull ?
- OLS vs MLE
ML Theory
- Explain intuitively why Deep Learning works?
- List the different types of Learning Tasks and their definitions:
- Describe the relationship between supervised and unsupervised learning?
- Describe the differences between Discriminative and Generative Models?
- Describe the curse of dimensionality and its effects on problem solving:
- How to deal with curse of dimensionality?
- Describe how to initialize a NN and any concerns w/ reasons:
- Describe the difference between Learning and Optimization in ML:
- List the 12 Standard Tasks in ML:
- What is the difference between inductive and deductive learning?
Statistical Learning Theory
- Define Statistical Learning Theory:
How can we affect performance on the test set when we can only observe the training set?
-
What assumptions are made by the theory?
- Define the i.i.d assumptions?
- Why assume a joint probability distribution \(p(x,y)\)?
- Why do we need to model \(y\) as a target-distribution and not a target-function?
-
Give the Formal Definition of SLT:
- The Definitions:
- The Assumptions:
- The Inference Problem:
- The Expected Risk:
- The Target Function:
- The Empirical Risk:
- Define Empirical Risk Minimization:
-
What is the Complexity of ERM?
- How do you Cope with the Complexity?
- Definitions:
- Generalization:
- Generalization Error:
- Generalization Gap:
- Computing the Generalization Gap:
- What is the goal of SLT in the context of the Generalization Gap given that it can’t be computed?
- Achieving (“good”) Generalization:
- Empirical Distribution:
- Describe the difference between Learning and Optimization in ML:
- Describe the difference between Generalization and Learning in ML:
- How to achieve Learning?
- What does the (VC) Learning Theory Achieve?
- Why do we need the probabilistic framework?
- What is the Approximation-Generalization Tradeoff:
- What are the factors determining how well an ML-algo will perform?
- Define the following and their usage/application & how they relate to each other:
- Underfitting:
- Overfitting:
- Capacity:
- Models with Low-Capacity:
- Models with High-Capacity:
- Hypothesis Space:
- VC-Dimension:
- What does it measure?
- Graph the relation between Error, and Capacity in the ctxt of (Underfitting, Overfitting, Training Error, Generalization Err, and Generalization Gap):
- What is the most important result in SLT that show that learning is feasible?
Bias-Variance Decomposition Theory
- What is the Bias-Variance Decomposition Theory:
- What are the Assumptions made by the theory?
- What is the question that the theory tries to answer? What assumption is important? How do you achieve the answer/goal?
- What is the Bias-Variance Decomposition:
- Define each term w.r.t. source of the error:
- What does each of the following measure? Describe it in Words? Give their AKA in statistics?
- Bias:
- Variance:
- Give the Formal Definition of the Decomposition (Formula):
- What is the Expectation over?
- Define the Bias-Variance Tradeoff:
- Effects of Bias:
- Effects of Variance:
- Draw the Graph of the Tradeoff (wrt model capacity):
- Derive the Bias-Variance Decomposition with explanations:
- What are the key Takeaways from the Tradeoff?
- What are the most common ways to negotiate the Tradeoff?
- How does the decomposition relate to Classification?
- Increasing/Decreasing Bias&Variance:
Activation Functions
-
Describe the Desirable Properties for activation functions:
- Non-Linearity:
- Range:
- Continuously Differentiable:
- Monotonicity:
- Smoothness with Monotonic Derivatives:
- Approximating Identity near Origin:
- Zero-Centered Range:
-
Describe the NON-Desirable Properties for activation functions:
- Saturation:
- Vanishing Gradients:
- Range Not Zero-Centered:
-
List the different activation functions used in ML?
Names, Definitions, Properties (pros&cons), Derivatives, Applications, pros/cons:
- Fill in the following table:
- Tanh VS Sigmoid for activation?
- ReLU:
- What makes it superior/advantageous?
- What problems does it have?
- What solution do we have to mitigate the problem?
- Compute the derivatives of all activation functions:
- Graph all activation functions and their derivatives:
Kernels
- Define “Local Kernel” and give an analogy to describe it:
- Write the following kernels:
- Polynomial Kernel of degree, up to, \(d\):
- Gaussian Kernel:
- Sigmoid Kernel:
- Polynomial Kernel of degree, exactly, \(d\):
Math
- What is a metric?
- Describe Binary Relations and their Properties?
- Formulas:
- Set theory:
- Number of subsets of a set of \(N\) elements:
- Number of pairs \((a,b)\) of a set of N elements:
- Binomial Theorem:
- Binomial Coefficient:
- Expansion of \(x^n - y^n =\)
- Number of ways to partition \(N\) data points into \(k\) clusters:
- \(\log_x(y) =\)
- The length of a vector \(\mathbf{x}\) along a direction (projection):
- \(\sum_{i=1}^{n} 2^{i}=\)
- Set theory:
- List 6 proof methods:
- Important Formulas
- Projection \(\tilde{\mathbf{x}}\) of a vector \(\mathbf{x}\) onto another vector \(\mathbf{u}\):
Statistics
- ROC curve:
- Definition:
- Purpose:
- How do you create the plot?
- How to identify a good classifier:
- How to identify a bad classifier:
- What is its application in tuning the model?
- AUC - AUROC:
- Definition:
- Range:
- What does it measure:
- Usage in ML:
- Define Statistical Efficiency (of an estimator)?
- Whats the difference between Errors and Residuals:
- Compute the statistical errors and residuals of the univariate, normal distribution defined as \(X_{1}, \ldots, X_{n} \sim N\left(\mu, \sigma^{2}\right)\):
- What is a biased estimator?
- Why would we prefer biased estimators in some cases?
- What is the difference between “Probability” and “Likelihood”:
- Estimators:
- Define:
- Formula:
- Whats a good estimator?
- What are the Assumptions made regarding the estimated parameter:
- What is Function Estimation:
- Whats the relation between the Function Estimator \(\hat{f}\) and Point Estimator:
- Define “marginal likelihood” (wrt naive bayes):
(Statistics) - MLE
- Clearly Define MLE and derive the final formula:
- Write MLE as an expectation wrt the Empirical Distribution:
- Describe formally the relationship between MLE and the KL-divergence:
- Extend the argument to show the link between MLE and Cross-Entropy. Give an example:
- How does MLE relate to the model distribution and the empirical distribution?
- What is the intuition behind using MLE?
- What does MLE find/result in?
- What kind of problem is MLE and how to solve for it?
- How does it relate to SLT:
- Explain clearly why we maximize the natural log of the likelihood
Text-Classification | Classical
- List some Classification Methods:
- List some Applications of Txt Classification:
NLP
- List some problems in NLP:
- List the Solved Problems in NLP:
- List the “within reach” problems in NLP:
- List the Open Problems in NLP:
- Why is NLP hard? List Issues:
- Define:
- Morphology:
- Morphemes:
- Stems:
- Affixes:
- Stemming:
- Lemmatization:
Language Modeling
- What is a Language Model?
- List some Applications of LMs:
- Traditional LMs:
- How are they setup?
- What do they depend on?
- What is the Goal of the LM task? (in the ctxt of the problem setup)
- What assumptions are made by the problem setup? Why?
- What are the MLE Estimates for probabilities of the following:
- Bi-Grams:
- Tri-Grams:
- What are the issues w/ Traditional Approaches?
- What+How can we setup some NLP tasks as LM tasks:
- How does the LM task relate to Reasoning/AGI:
- Evaluating LM models:
- List the Loss Functions (+formula) used to evaluate LM models? Motivate each:
- Which application of LM modeling does each loss work best for?
- Why Cross-Entropy:
- Which setting it used for?
- Why Perplexity:
- Which setting used for?
- If no surprise, what is the perplexity?
- How does having a good LM relate to Information Theory?
- LM DATA:
- How does the fact that LM is a time-series prediction problem affect the way we need to train/test:
- How should we choose a subset of articles for testing:
-
List three approaches to Parametrizing LMs:
- Describe “Count-Based N-gram Models”:
- What distributions do they capture?:
- Describe “Neural N-gram Models”:
- What do they replace the captured distribution with?
- What are they better at capturing:
- Describe “RNNs”:
- What do they replace/capture?
- How do they capture it?
- What are they best at capturing:
-
What’s the main issue in LM modeling?
- How do N-gram models capture/approximate the history?:
- How do RNNs models capture/approximate the history?:
- The Bias-Variance Tradeoff of the following:
- N-Gram Models:
- RNNs:
- An Estimate s.t. it predicts the probability of a sentence by how many times it has seen it before:
- What happens in the limit of infinite data?
- What are the advantages of sub-word level LMs:
- What are the disadvantages of sub-word level LMs:
- What is a “Conditional LM”?
- Write the decomposition of the probability for the Conditional LM:
- Describe the Computational Bottleneck for Language Models:
- Describe/List some solutions to the Bottleneck:
-
Complexity Comparison of the different solutions:
![i
Regularization
- Define Regularization both intuitively and formally:
- Define “well-posedness”:
-
Give four aspects of justification for regularization (theoretical):
- From a philosophical pov:
- From a probabilistic pov:
- From an SLT pov:
- From a practical pov (relating to the real-world):
- Describe an overview of regularization in DL. How does it usually work?
- Intuitively, how can a regularizer be effective?
- Describe the relationship between regularization and capacity:
- Describe the different approaches to regularization:
- List 9 regularization techniques:
-
Describe Parameter Norm Penalties (PNPs):
- Define the regularized objective:
- Describe the parameter \(\alpha\):
- How does it influence the regularization:
- What is the effect of minimizing the regularized objective?
- How do we deal with the Bias parameter in PNPs? Explain.
- Describe the tuning of the \(\alpha\) HP in NNs for different hidden layers:
- Formally describe the \(L^2\) parameter regularization:
- AKA:
- Describe the regularization contribution to the gradient in a single step.
- Describe the regularization contribution to the gradient. How does it scale?
- How does weight decay relate to shrinking the individual weight wrt their size? What is the measure/comparison used?
- Draw a graph describing the effects of \(L^2\) regularization on the weights:
- Describe the effects of applying weight decay to linear regression
-
Derivation:
- What is \(L^2\) regularization equivalent to?
- What are we maximizing?
- Derive the MAP Estimate:
- What kind of prior do we place on the weights? What are its parameters?
- List the properties of \(L^2\) regularization:
- Formally describe the \(L^1\) parameter regularization:
- AKA:
- Whats the regularized objective function?
- What is its gradient?
- Describe the regularization contribution to the gradient compared to L2. How does it scale?
-
List the properties and applications of \(L^1\) regularization:
- How is it used as a feature selection mechanism?
-
Derivation:
- What is \(L^1\) regularization equivalent to?
- What kind of prior do we place on the weights? What are its parameters?
-
Analyze \(L^1\) vs \(L^2\) regularization:
- For Sparsity:
- For correlated features:
- For optimization:
- Give an example that shows the difference wrt sparsity:
- For sensitivity:
- Describe Elastic Net Regularization. Why was it devised? Any properties?
- Motivate Regularization for ill-posed problems:
- What is the property that needs attention?
- What would the regularized solution correspond to in this case?
- Are there any guarantees for the solution to be well-posed? How/Why?
- What is the Linear Algebraic property that needs attention?
- What models are affected by this?
- What would the sol correspond to in terms of inverting \(X^{\top}X\):
- When would \(X^{\top}X\) be singular?
- Describe the Linear Algebraic Perspective. What does it correspond to? [LAP]
- Can models with no closed-form solution be underdetermined? Explain. [CFS]
- What models are affected by this? [CFS]
- Define the Moore-Penrose Pseudoinverse:
- What can it solve? How?
- What does it correspond to in terms of regularization?
- What is the limit wrt?
- How can we interpret the pseudoinverse wrt regularization?
- Explain the problem with Logistic Regression:
- What are the possible solutions?
- Are there any guarantees that we achieve with regularization? How?
- Describe dataset augmentation and its techniques:
- When is it applicable?
- When is it not?
- Motivate the Noise Robustness property:
- How can Noise Robustness motivate a regularization technique?
-
How can we enhance noise robustness in NN?
- Give a motivation for Noise Injection:
- Where can noise be injected?
- Give Motivation, Interpretation and Applications of injecting noise in the different components (from above):
Injecting Noise in the Input Layer:
Injecting Noise in the Hidden Layers:
Injecting Noise in the Weight Matrices:
Injecting Noise in the Output Layer:
- Give an interpretation for injecting noise in the Input layer:
- Give an interpretation for injecting noise in the Hidden layers:
- What is the most successful application of this technique:
- Describe the Bayesian View of learning:
- How does it motivate injecting noise in the weight matrices?
- Describe a different, more traditional, interpretation of injecting noise to matrices. What are its effects on the function to be learned?
- Whats the biggest application for this kind of regularization?
- Motivate injecting noise in the Output layer:
- What is the biggest application of this technique?
- How does it compare to weight-decay when applied to MLE problems?
- Define “Semi-Supervised Learning”:
- What does it refer to in the context of DL:
- What is its goal?
- Give an example in classical ML:
- Describe an approach to applying semi-supervised learning:
- How can we interpret dropout wrt data augmentation?
- Add Answers from link below for L2 applied to linear regression and how it reduces variance:
- When is Ridge regression favorable over Lasso regression? for correlated features?
Misc.
- Explain Latent Dirichlet Allocation (LDA)
- How to deal with curse of dimensionality
- How to detect correlation of “categorical variables”?
- Define “marginal likelihood” (wrt naive bayes):
- KNN VS K-Means
- When is Ridge regression favorable over Lasso regression for correlated features?
- What is convex hull ?
- Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
- OLS vs MLE
- What are collinearity and multicollinearity?
- Describe ways to overcome scaling (scalability) issues: