Gradient-Based Optimization

Define Gradient Methods:
Give examples of Gradient-Based Algorithms:
What is Gradient Descent:
Explain it intuitively:
Give its derivation:
What is the learning rate?
1. Where does it come from?
1. How does it relate to the step-size?
2. We go from having a fixed step-size to [blank]:
1. What is its range?
2. How do we choose the learning rate?
1. Compare Line Search vs Trust Region:
2. Learning Rate Schedule:
  1. Define:
  2. List Types:
Describe the convergence of the algorithm:
How does GD relate to Euler?
List the variants of GD:
1. How do they differ? Why?:
1. BGD:
2. SGD:
  1. How should we handle the lr in this case? Why?
3. M-BGD:
  1. What advantages does it have?
4. Explain the different kinds of gradient-descent optimization procedures:
5. State the difference between SGD and GD?
6. When would you use GD over SDG, and vice-versa?
What is the problem of vanilla approaches to GD?
1. List the challenges that account for the problem above:
List the different strategies for optimizing GD:
List the different variants for optimizing GD:

Maximum Margin Classifiers

Define Margin Classifiers:
What is a Margin for a linear classifier?
Give the motivation for margin classifiers:
Define the notion of the “best” possible classifier
How can we achieve the “best” classifier?
What unique vector is orthogonal to the hp? Prove it:
What do we mean by “signed distance”? Derive its formula:
Given the formula for signed distance, calculate the “distance of the point closest to the hyperplane”:
Use geometric properties of the hp to Simplify the expression for the distance of the closest point to the hp, above
Characterize the margin, mathematically:
Characterize the “Slab Existence”:
Formulate the optimization problem of maximizing the margin wrt analysis above:
Reformulate the optimization problem above to a more “friendly” version (wrt optimization -> put in standard form):
1. Give the final (standard) formulation of the “Optimization problem for maximum margin classifiers”:
2. What kind of formulation is it (wrt optimization)? What are the parameters?

Hard-Margin SVMs

Define:
1. SVMs:
2. Support Vectors:
3. Hard-Margin SVM:
Define the following wrt hard-margin SVM:
1. Goal:
2. Procedure:
3. Decision Function:
4. Constraints:
5. The Optimization Problem:
6. The Optimization Method:
Elaborate on the generalization analysis:
List the properties:
Give the solution to the optimization problem for H-M SVM:
1. What method does it require to be solved:
2. Formulate the Lagrangian:
3. Optimize the objective for each variable:
4. Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
5. Set the problem as a Quadratic Programming problem:
6. What are the inputs and outputs to the Quadratic Program Package?
7. Give the final form of the optimization problem in standard form:

Soft-Margin SVM

Motivate the soft-margin SVM:
What is the main idea behind it?
Define the following wrt soft-margin SVM:
1. Goal:
2. Procedure:
3. Decision Function:
4. Constraints:
  1. Why is there a non-negativity constraint?
5. Objective/Cost Function:
6. The Optimization Problem:
7. The Optimization Method:
8. Properties:
Specify the effects of the regularization hyperparameter \(C\):
1. Describe the effect wrt over/under fitting:
How do we choose \(C\)?
Give an equivalent formulation in the standard form objective for function estimation (what should it minimize?)

Loss Functions

Define:
1. Loss Functions - Abstractly and Mathematically:
2. Distance-Based Loss Functions:
  1. What are they used for?
  2. Describe an important property of dist-based losses:
    Translation Invariance:
3. Relative Error - What does it lack?
List 3 Regression Loss Functions

List 7 Classification Loss Functions

Information Theory

What is Information Theory? In the context of ML?
Describe the Intuition for Information Theory. Intuitively, how does the theory quantify information (list)?
Measuring Information - Definitions and Formulas:
1. In Shannons Theory, how do we quantify “transmitting 1 bit of information”?
2. What is the amount of information transmitted?
3. What is the uncertainty reduction factor?
4. What is the amount of information in an event \(x\)?
Define the Self-Information - Give the formula:
1. What is it defined with respect to?
Define Shannon Entropy - What is it used for?
1. Describe how Shannon Entropy relate to distributions with a graph:
Define Differential Entropy:
How does entropy characterize distributions?
Define Relative Entropy - Give it’s formula:
1. Give an interpretation:
2. List the properties:
3. Describe it as a distance:
4. List the applications of relative entropy:
5. How does the direction of minimization affect the optimization:
Define Cross Entropy - Give it’s formula:
1. What does it measure?
2. How does it relate to relative entropy?
3. When are they equivalent?

Recommendation Systems

Describe the different algorithms for recommendation systems:

Ensemble Learning

What are the two paradigms of ensemble methods?
Random Forest VS GBM?

Data Processing and Analysis

What are 3 data preprocessing techniques to handle outliers?
Describe the strategies to dimensionality reduction?
What are 3 ways of reducing dimensionality?
List methods for Feature Selection
List methods for Feature Extraction
How to detect correlation of “categorical variables”?
Feature Importance
Capturing the correlation between continuous and categorical variable? If yes, how?
What cross validation technique would you use on time series data set?
How to deal with missing features? (Imputation?)
Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
What are collinearity and multicollinearity?

ML/Statistical Models

What are parametric models?
What is a classifier?

K-NN

PCA

What is PCA?
What is the Goal of PCA?
List the applications of PCA:
Give formulas for the following:
1. Assumptions on \(X\):
2. SVD of \(X\):
3. Principal Directions/Axes:
4. Principal Components (scores):
5. The \(j\)-th principal component:
Define the transformation, mathematically:
What does PCA produce/result in?
1. Finds a lower dimensional subspace spanned by what?:
2. Finds a lower dimensional subspace that minimizes what?:
3. What does each PC have (properties)?
4. What does the procedure find in terms of a “basis”?
5. What does the procedure find in terms of axes? (where do they point?):
Describe the PCA algorithm:
1. What Data Processing needs to be done?
2. How to compute the Principal Components?
3. How do you compute the Low-Rank Approximation Matrix \(X_k\)?
Describe the Optimality of PCA:
List limitations of PCA:
Intuition:
1. What property of the internal structure of the data does PCA reveal/explain?
2. What object does it fit to the data?:
Should you remove correlated features b4 PCA?
How can we measure the “Total Variance” of the data?
How can we measure the “Total Variance” of the projected data?
How can we measure the “Error in the Projection”?
1. What does it mean when this ratio is high?
How does PCA relate to CCA?
How does PCA relate to ICA?

The Centroid Method

Define “The Centroid”:
Describe the Procedure:
What is the Decision Function:
Describe the Decision Boundary:

K-Means

What is K-Means?
What is the idea behind K-Means?
What does K-Mean find?
Formal Description of the Model:
1. What is the Objective?
Description of the Algorithm:
What is the Optimization method used? What class does it belong to?
1. How does the optimization method relate to EM?
What is the Complexity of the algorithm?

Describe the convergence and prove it:

 <button>Show Proof</button>{: .showText value="show"
 onclick="showTextPopHide(event);"}

Describe the Optimality of the Algorithm:

Derive the estimated parameters of the algorithm:

Objective Function:
Optimization Objective:

Derivation:

 <button>Show Derivation</button>{: .showText value="show"
 onclick="showTextPopHide(event);"}
        
 <button>Show Derivation</button>{: .showText value="show"
 onclick="showTextPopHide(event);"}

When does K-Means fail to give good results?

Naive Bayes

Define:
1. Naive Bayes:
2. Naive Bayes Classifiers:
3. Bayes Theorem:
List the assumptions of Naive Bayes:
List some properties of Naive Bayes:
1. Is it a Bayesian Method or Frequentest Method?
2. Is it a Bayes Classifier? What does that mean?:
Define the Probabilistic Model for the method:
1. What kind of model is it?
2. What is a conditional probability model?
3. Decompose the conditional probability w/ Bayes Theorem:
4. How does the new expression incorporate the joint probability model?
5. Use the chain rule to re-write the joint probability model:
6. Use the Naive Conditional Independence assumption to rewrite the joint model:
7. What is the conditional distribution over the class variable \(C_k\):
Construct the classifier. What are its components? Formally define it.
1. What’s the decision rule used?
2. List the difference between the Naive Bayes Estimate and the MAP Estimate:
What are the parameters to be estimated for the classifier?:
What method do we use to estimate the parameters?:
What are the estimates for each of the following parameters?:
1. The prior probability of each class:
2. The conditional probability of each feature (word) given a class:

CNNs

What is a CNN?
1. What kind of data does it work on? What is the mathematical property?
What are the layers of a CNN?
What are the four important ideas and their benefits that the convolution affords CNNs:
Benefits:
Benefits:
Benefits:
What is the inspirational model for CNNs:
Describe the connectivity pattern of the neurons in a layer of a CNN:
Describe the process of a ConvNet:
Convolution Operation:
1. Define:
2. Formula (continuous):
3. Formula (discrete):
4. Define the following:
  1. Feature Map:
5. Does the operation commute?
Cross Correlation:
1. Define:
2. Formulae:
3. What are the differences/similarities between convolution and cross-correlation:
Write down the Convolution operation and the cross-correlation over two axes and:
1. Convolution:
2. Convolution (commutative):
3. Cross-Correlation:
The Convolutional Layer:
1. What are the parameters and how do we choose them?
2. Describe what happens in the forward pass:
3. What is the output of the forward pass:
4. How is the output configured?
Spatial Arrangements:
1. List the Three Hyperparameters that control the output volume:
2. How to compute the spatial size of the output volume?
3. How can you ensure that the input & output volume are the same?
4. In the output volume, how do you compute the \(d\)-th depth slice:
Calculate the number of parameters for the following config:
Definitions:
1. Receptive Field:
Suppose the input volume has size \([ 32 × 32 × 3 ]\) and the receptive field (or the filter size) is \(5 × 5\) , then each neuron in the Conv Layer will have weights to a __Blank__ region in the input volume, for a total of __Blank__ weights:
How can we achieve the greatest reduction in the spatial dims of the network (for classification):
Pooling Layer:
1. Define:
2. List key ideas/properties and benefits:
3. List the different types of Pooling:
4. List variations of pooling and their definitions:
  1. What is “Learned Pooling”:
  2. What is “Dynamical Pooling”:
5. List the hyperparams of Pooling Layer:
6. How to calculate the size of the output volume:
7. How many parameters does the pooling layer have:
8. What are other ways to perform downsampling:
Weight Priors:
1. Define “Prior Prob Distribution on the parameters”:
2. Define “Weight Prior” and its types/classes:
  1. Weak Prior:
  2. Strong Prior:
  3. Infinitely Strong Prior:
3. Describe the Conv Layer as a FC Layer using priors:
4. What are the key insights of using this view:
  1. When is the prior imposed by convolution INAPPROPRIATE:
  2. What happens when the priors imposed by convolution and pooling are not suitable for the task?
  3. What kind of other models should Convolutional models be compared to? Why?:
When do multi-channel convolutions commute?
Why do we use several different kernels in a given conv-layer?
Strided Convolutions
1. Define:
2. What are they used for?
3. What are they equivalent to?
4. Formula:
Zero-Padding:
1. Definition/Usage:
2. List the types of padding:
Locally Connected Layers/Unshared Convolutions:
Bias Parameter:
1. How many bias terms are used per output channel in the tradional convolution:
Dilated Convolutions
1. Define:
2. What are they used for?
Stacked Convolutions
1. Define:
2. What are they used for?

Theory

RNNs

What is an RNN?
1. Definition:
2. What machine-type is the standard RNN:
What is the big idea behind RNNs?
Dynamical Systems:
1. Standard Form:
2. RNN as a Dynamical System:
Unfolding Computational Graphs
1. Definition:
2. List the Advantages introduced by unfolding and the benefits:
3. Graph and write the equations of Unfolding hidden recurrence:
Describe the State of the RNN, its usage, and extreme cases of the usage:
RNN Architectures:
1. List the three standard architectures of RNNs:
  1. Graph:
  2. Architecture:
  3. Equations:
  4. Total Loss:
  5. Complexity:
  6. Properties:
Teacher Forcing:
1. Definition:
2. Application:
3. Disadvantages:
4. Possible Solutions for Mitigation:

Optimization

Define the sigmoid function and some of its properties:
Backpropagation:
1. Definition:
2. Derive Gradient Descent Update:
3. Explain the difference kinds of gradient-descent optimization procedures:
4. List the different optimizers and their properties:
Error-Measures:
1. Define what an error measure is:
2. List the 5 most common error measures and where they are used:
3. Specific Questions:
  1. Derive MSE carefully:
  2. Derive the Binary Cross-Entropy Loss function:
  3. Explain the difference between Cross-Entropy and MSE and which is better (for what task)?
  4. Describe the properties of the Hinge loss and why it is used?
Show that the weight vector of a linear signal is orthogonal to the decision boundary?
What does it mean for a function to be well-behaved from an optimization pov?
Write \(\|\mathrm{Xw}-\mathrm{y}\|^{2}\) as a summation
Compute:
1. \(\dfrac{\partial}{\partial y}\vert{x-y}\vert=\)
State the difference between SGD and GD?
When would you use GD over SDG, and vice-versa?
What is convex hull ?
OLS vs MLE

ML Theory

Explain intuitively why Deep Learning works?
List the different types of Learning Tasks and their definitions:
Describe the relationship between supervised and unsupervised learning?
Describe the differences between Discriminative and Generative Models?
Describe the curse of dimensionality and its effects on problem solving:
How to deal with curse of dimensionality?
Describe how to initialize a NN and any concerns w/ reasons:
Describe the difference between Learning and Optimization in ML:
List the 12 Standard Tasks in ML:
What is the difference between inductive and deductive learning?

Statistical Learning Theory

Define Statistical Learning Theory:

How can we affect performance on the test set when we can only observe the training set?
What assumptions are made by the theory?
1. Define the i.i.d assumptions?
2. Why assume a joint probability distribution \(p(x,y)\)?
3. Why do we need to model \(y\) as a target-distribution and not a target-function?
Give the Formal Definition of SLT:
1. The Definitions:
2. The Assumptions:
3. The Inference Problem:
4. The Expected Risk:
5. The Target Function:
6. The Empirical Risk:
Define Empirical Risk Minimization:
What is the Complexity of ERM?
1. How do you Cope with the Complexity?
Definitions:
1. Generalization:
2. Generalization Error:
3. Generalization Gap:
  1. Computing the Generalization Gap:
  2. What is the goal of SLT in the context of the Generalization Gap given that it can’t be computed?
4. Achieving (“good”) Generalization:
5. Empirical Distribution:
Describe the difference between Learning and Optimization in ML:
Describe the difference between Generalization and Learning in ML:
How to achieve Learning?
What does the (VC) Learning Theory Achieve?
Why do we need the probabilistic framework?
What is the Approximation-Generalization Tradeoff:
What are the factors determining how well an ML-algo will perform?
Define the following and their usage/application & how they relate to each other:
1. Underfitting:
2. Overfitting:
3. Capacity:
  1. Models with Low-Capacity:
  2. Models with High-Capacity:
4. Hypothesis Space:
5. VC-Dimension:
  1. What does it measure?
6. Graph the relation between Error, and Capacity in the ctxt of (Underfitting, Overfitting, Training Error, Generalization Err, and Generalization Gap):
What is the most important result in SLT that show that learning is feasible?

Bias-Variance Decomposition Theory

What is the Bias-Variance Decomposition Theory:
What are the Assumptions made by the theory?
What is the question that the theory tries to answer? What assumption is important? How do you achieve the answer/goal?
What is the Bias-Variance Decomposition:
Define each term w.r.t. source of the error:
What does each of the following measure? Describe it in Words? Give their AKA in statistics?
1. Bias:
2. Variance:
Give the Formal Definition of the Decomposition (Formula):
1. What is the Expectation over?
Define the Bias-Variance Tradeoff:
1. Effects of Bias:
2. Effects of Variance:
3. Draw the Graph of the Tradeoff (wrt model capacity):
Derive the Bias-Variance Decomposition with explanations:
What are the key Takeaways from the Tradeoff?
What are the most common ways to negotiate the Tradeoff?
How does the decomposition relate to Classification?
Increasing/Decreasing Bias&Variance:

Activation Functions

Describe the Desirable Properties for activation functions:
1. Non-Linearity:
2. Range:
3. Continuously Differentiable:
4. Monotonicity:
5. Smoothness with Monotonic Derivatives:
6. Approximating Identity near Origin:
7. Zero-Centered Range:
Describe the NON-Desirable Properties for activation functions:
1. Saturation:
2. Vanishing Gradients:
3. Range Not Zero-Centered:
List the different activation functions used in ML?
Names, Definitions, Properties (pros&cons), Derivatives, Applications, pros/cons:

Kernels

Define “Local Kernel” and give an analogy to describe it:
Write the following kernels:
1. Polynomial Kernel of degree, up to, \(d\):
2. Gaussian Kernel:
3. Sigmoid Kernel:
4. Polynomial Kernel of degree, exactly, \(d\):

Math

What is a metric?
Describe Binary Relations and their Properties?
Formulas:
1. Set theory:
  1. Number of subsets of a set of \(N\) elements:
  2. Number of pairs \((a,b)\) of a set of N elements:
2. Binomial Theorem:
3. Binomial Coefficient:
4. Expansion of \(x^n - y^n =\)
5. Number of ways to partition \(N\) data points into \(k\) clusters:
6. \(\log_x(y) =\)
7. The length of a vector \(\mathbf{x}\) along a direction (projection):
8. \(\sum_{i=1}^{n} 2^{i}=\)
List 6 proof methods:
Important Formulas
1. Projection \(\tilde{\mathbf{x}}\) of a vector \(\mathbf{x}\) onto another vector \(\mathbf{u}\):

Statistics

ROC curve:
1. Definition:
2. Purpose:
3. How do you create the plot?
4. How to identify a good classifier:
5. How to identify a bad classifier:
6. What is its application in tuning the model?
AUC - AUROC:
1. Definition:
2. Range:
3. What does it measure:
4. Usage in ML:
Define Statistical Efficiency (of an estimator)?
Whats the difference between Errors and Residuals:
1. Compute the statistical errors and residuals of the univariate, normal distribution defined as \(X_{1}, \ldots, X_{n} \sim N\left(\mu, \sigma^{2}\right)\):
What is a biased estimator?
1. Why would we prefer biased estimators in some cases?
What is the difference between “Probability” and “Likelihood”:
Estimators:
1. Define:
2. Formula:
3. Whats a good estimator?
4. What are the Assumptions made regarding the estimated parameter:
What is Function Estimation:
1. Whats the relation between the Function Estimator \(\hat{f}\) and Point Estimator:
Define “marginal likelihood” (wrt naive bayes):

(Statistics) - MLE

Clearly Define MLE and derive the final formula:
1. Write MLE as an expectation wrt the Empirical Distribution:
2. Describe formally the relationship between MLE and the KL-divergence:
3. Extend the argument to show the link between MLE and Cross-Entropy. Give an example:
4. How does MLE relate to the model distribution and the empirical distribution?
5. What is the intuition behind using MLE?
6. What does MLE find/result in?
7. What kind of problem is MLE and how to solve for it?
8. How does it relate to SLT:
9. Explain clearly why we maximize the natural log of the likelihood

Text-Classification | Classical

List some Classification Methods:
List some Applications of Txt Classification:

NLP

List some problems in NLP:
List the Solved Problems in NLP:
List the “within reach” problems in NLP:
List the Open Problems in NLP:
Why is NLP hard? List Issues:
Define:
1. Morphology:
2. Morphemes:
3. Stems:
4. Affixes:
5. Stemming:
6. Lemmatization:

Language Modeling

What is a Language Model?
List some Applications of LMs:
Traditional LMs:
1. How are they setup?
2. What do they depend on?
3. What is the Goal of the LM task? (in the ctxt of the problem setup)
4. What assumptions are made by the problem setup? Why?
5. What are the MLE Estimates for probabilities of the following:
  1. Bi-Grams:
  2. Tri-Grams:
6. What are the issues w/ Traditional Approaches?
What+How can we setup some NLP tasks as LM tasks:
How does the LM task relate to Reasoning/AGI:
Evaluating LM models:
1. List the Loss Functions (+formula) used to evaluate LM models? Motivate each:
2. Which application of LM modeling does each loss work best for?
1. Why Cross-Entropy:
2. Which setting it used for?
3. Why Perplexity:
4. Which setting used for?
5. If no surprise, what is the perplexity?
6. How does having a good LM relate to Information Theory?
LM DATA:
1. How does the fact that LM is a time-series prediction problem affect the way we need to train/test:
2. How should we choose a subset of articles for testing:
List three approaches to Parametrizing LMs:
1. Describe “Count-Based N-gram Models”:
2. What distributions do they capture?:
3. Describe “Neural N-gram Models”:
4. What do they replace the captured distribution with?
5. What are they better at capturing:
6. Describe “RNNs”:
7. What do they replace/capture?
8. How do they capture it?
9. What are they best at capturing:
What’s the main issue in LM modeling?
1. How do N-gram models capture/approximate the history?:
2. How do RNNs models capture/approximate the history?:
1. The Bias-Variance Tradeoff of the following:
  1. N-Gram Models:
  2. RNNs:
  3. An Estimate s.t. it predicts the probability of a sentence by how many times it has seen it before:
    1. What happens in the limit of infinite data?
What are the advantages of sub-word level LMs:
What are the disadvantages of sub-word level LMs:
What is a “Conditional LM”?
Write the decomposition of the probability for the Conditional LM:
Describe the Computational Bottleneck for Language Models:
Describe/List some solutions to the Bottleneck:
Complexity Comparison of the different solutions:

![i

Regularization

Define Regularization both intuitively and formally:
Define “well-posedness”:
Give four aspects of justification for regularization (theoretical):
1. From a philosophical pov:
2. From a probabilistic pov:
3. From an SLT pov:
4. From a practical pov (relating to the real-world):
Describe an overview of regularization in DL. How does it usually work?
1. Intuitively, how can a regularizer be effective?
Describe the relationship between regularization and capacity:
Describe the different approaches to regularization:
List 9 regularization techniques:

Describe Parameter Norm Penalties (PNPs):
1. Define the regularized objective:
2. Describe the parameter \(\alpha\):
3. How does it influence the regularization:
4. What is the effect of minimizing the regularized objective?
How do we deal with the Bias parameter in PNPs? Explain.
Describe the tuning of the \(\alpha\) HP in NNs for different hidden layers:
Formally describe the \(L^2\) parameter regularization:
1. AKA:
2. Describe the regularization contribution to the gradient in a single step.
3. Describe the regularization contribution to the gradient. How does it scale?
4. How does weight decay relate to shrinking the individual weight wrt their size? What is the measure/comparison used?
Draw a graph describing the effects of \(L^2\) regularization on the weights:
Describe the effects of applying weight decay to linear regression
Derivation:
1. What is \(L^2\) regularization equivalent to?
2. What are we maximizing?
3. Derive the MAP Estimate:
4. What kind of prior do we place on the weights? What are its parameters?
List the properties of \(L^2\) regularization:
Formally describe the \(L^1\) parameter regularization:
1. AKA:
2. Whats the regularized objective function?
3. What is its gradient?
4. Describe the regularization contribution to the gradient compared to L2. How does it scale?
List the properties and applications of \(L^1\) regularization:
1. How is it used as a feature selection mechanism?
Derivation:
1. What is \(L^1\) regularization equivalent to?
2. What kind of prior do we place on the weights? What are its parameters?
Analyze \(L^1\) vs \(L^2\) regularization:
1. For Sparsity:
2. For correlated features:
3. For optimization:
4. Give an example that shows the difference wrt sparsity:
5. For sensitivity:
Describe Elastic Net Regularization. Why was it devised? Any properties?
Motivate Regularization for ill-posed problems:
1. What is the property that needs attention?
2. What would the regularized solution correspond to in this case?
3. Are there any guarantees for the solution to be well-posed? How/Why?
1. What is the Linear Algebraic property that needs attention?
2. What models are affected by this?
3. What would the sol correspond to in terms of inverting \(X^{\top}X\):
4. When would \(X^{\top}X\) be singular?
5. Describe the Linear Algebraic Perspective. What does it correspond to? [LAP]
6. Can models with no closed-form solution be underdetermined? Explain. [CFS]
7. What models are affected by this? [CFS]
1. Define the Moore-Penrose Pseudoinverse:
2. What can it solve? How?
3. What does it correspond to in terms of regularization?
4. What is the limit wrt?
5. How can we interpret the pseudoinverse wrt regularization?
1. Explain the problem with Logistic Regression:
2. What are the possible solutions?
3. Are there any guarantees that we achieve with regularization? How?
Describe dataset augmentation and its techniques:
When is it applicable?
When is it not?
Motivate the Noise Robustness property:
How can Noise Robustness motivate a regularization technique?
How can we enhance noise robustness in NN?
1. Give a motivation for Noise Injection:
2. Where can noise be injected?
3. Give Motivation, Interpretation and Applications of injecting noise in the different components (from above):
  Injecting Noise in the Input Layer:
  Injecting Noise in the Hidden Layers:
  Injecting Noise in the Weight Matrices:
  Injecting Noise in the Output Layer:
1. Give an interpretation for injecting noise in the Input layer:
2. Give an interpretation for injecting noise in the Hidden layers:
3. What is the most successful application of this technique:
4. Describe the Bayesian View of learning:
5. How does it motivate injecting noise in the weight matrices?
6. Describe a different, more traditional, interpretation of injecting noise to matrices. What are its effects on the function to be learned?
7. Whats the biggest application for this kind of regularization?
8. Motivate injecting noise in the Output layer:
9. What is the biggest application of this technique?
10. How does it compare to weight-decay when applied to MLE problems?
Define “Semi-Supervised Learning”:
1. What does it refer to in the context of DL:
2. What is its goal?
3. Give an example in classical ML:
Describe an approach to applying semi-supervised learning:
How can we interpret dropout wrt data augmentation?

Add Answers from link below for L2 applied to linear regression and how it reduces variance:
When is Ridge regression favorable over Lasso regression? for correlated features?

Misc.

Explain Latent Dirichlet Allocation (LDA)
How to deal with curse of dimensionality
How to detect correlation of “categorical variables”?
Define “marginal likelihood” (wrt naive bayes):
KNN VS K-Means
When is Ridge regression favorable over Lasso regression for correlated features?
What is convex hull ?
Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
OLS vs MLE
What are collinearity and multicollinearity?
Describe ways to overcome scaling (scalability) issues:

Answers to Prep Questions (Learning)

Gradient-Based Optimization

Maximum Margin Classifiers

Hard-Margin SVMs

Soft-Margin SVM

Loss Functions

Information Theory

Recommendation Systems

Ensemble Learning

Data Processing and Analysis

ML/Statistical Models

K-NN

PCA

The Centroid Method

K-Means

Naive Bayes

CNNs

Theory

RNNs

Optimization

ML Theory

Statistical Learning Theory

Bias-Variance Decomposition Theory

Activation Functions

Kernels

Math

Statistics

(Statistics) - MLE

Text-Classification | Classical

NLP

Language Modeling

Regularization

Misc.