Gradient-Based Optimization

  1. Define Gradient Methods:
  2. Give examples of Gradient-Based Algorithms:
  3. What is Gradient Descent:
  4. Explain it intuitively:
  5. Give its derivation:
  6. What is the learning rate?
    1. Where does it come from?

    1. What is its range?
    2. How do we choose the learning rate?

  7. Describe the convergence of the algorithm:
  8. How does GD relate to Euler?
  9. List the variants of GD:
    1. How do they differ? Why?:

  10. What is the problem of vanilla approaches to GD?

  11. List the different strategies for optimizing GD:
  12. List the different variants for optimizing GD:


Maximum Margin Classifiers

  1. Define Margin Classifiers:
  2. What is a Margin for a linear classifier?
  3. Give the motivation for margin classifiers:
  4. Define the notion of the “best” possible classifier
  5. How can we achieve the “best” classifier?
  6. What unique vector is orthogonal to the hp? Prove it:
  7. What do we mean by “signed distance”? Derive its formula:
  8. Given the formula for signed distance, calculate the “distance of the point closest to the hyperplane”:
  9. Use geometric properties of the hp to Simplify the expression for the distance of the closest point to the hp, above
  10. Characterize the margin, mathematically:
  11. Characterize the “Slab Existence”:
  12. Formulate the optimization problem of maximizing the margin wrt analysis above:
  13. Reformulate the optimization problem above to a more “friendly” version (wrt optimization -> put in standard form):
    1. Give the final (standard) formulation of the “Optimization problem for maximum margin classifiers”:
    2. What kind of formulation is it (wrt optimization)? What are the parameters?

Hard-Margin SVMs

  1. Define:
    1. SVMs:
    2. Support Vectors:
    3. Hard-Margin SVM:
  2. Define the following wrt hard-margin SVM:
    1. Goal:
    2. Procedure:
    3. Decision Function:
    4. Constraints:
    5. The Optimization Problem:
    6. The Optimization Method:
  3. Elaborate on the generalization analysis:
  4. List the properties:
  5. Give the solution to the optimization problem for H-M SVM:
    1. What method does it require to be solved:
    2. Formulate the Lagrangian:
    3. Optimize the objective for each variable:
    4. Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
    5. Set the problem as a Quadratic Programming problem:
    6. What are the inputs and outputs to the Quadratic Program Package?
    7. Give the final form of the optimization problem in standard form:

Soft-Margin SVM

  1. Motivate the soft-margin SVM:
  2. What is the main idea behind it?
  3. Define the following wrt soft-margin SVM:
    1. Goal:
    2. Procedure:
    3. Decision Function:
    4. Constraints:
      1. Why is there a non-negativity constraint?
    5. Objective/Cost Function:
    6. The Optimization Problem:
    7. The Optimization Method:
    8. Properties:
  4. Specify the effects of the regularization hyperparameter \(C\):
    1. Describe the effect wrt over/under fitting:
  5. How do we choose \(C\)?
  6. Give an equivalent formulation in the standard form objective for function estimation (what should it minimize?)

Loss Functions

  1. Define:
    1. Loss Functions - Abstractly and Mathematically:
    2. Distance-Based Loss Functions:
      1. What are they used for?
      2. Describe an important property of dist-based losses:
        Translation Invariance:
    3. Relative Error - What does it lack?
  2. List 3 Regression Loss Functions

  1. List 7 Classification Loss Functions


Information Theory

  1. What is Information Theory? In the context of ML?
  2. Describe the Intuition for Information Theory. Intuitively, how does the theory quantify information (list)?
  3. Measuring Information - Definitions and Formulas:
    1. In Shannons Theory, how do we quantify “transmitting 1 bit of information”?
    2. What is the amount of information transmitted?
    3. What is the uncertainty reduction factor?
    4. What is the amount of information in an event \(x\)?
  4. Define the Self-Information - Give the formula:
    1. What is it defined with respect to?
  5. Define Shannon Entropy - What is it used for?
    1. Describe how Shannon Entropy relate to distributions with a graph:
  6. Define Differential Entropy:
  7. How does entropy characterize distributions?
  8. Define Relative Entropy - Give it’s formula:
    1. Give an interpretation:
    2. List the properties:
    3. Describe it as a distance:
    4. List the applications of relative entropy:
    5. How does the direction of minimization affect the optimization:
  9. Define Cross Entropy - Give it’s formula:
    1. What does it measure?
    2. How does it relate to relative entropy?
    3. When are they equivalent?

Recommendation Systems

  1. Describe the different algorithms for recommendation systems:

Ensemble Learning

  1. What are the two paradigms of ensemble methods?
  2. Random Forest VS GBM?

Data Processing and Analysis

  1. What are 3 data preprocessing techniques to handle outliers?
  2. Describe the strategies to dimensionality reduction?
  3. What are 3 ways of reducing dimensionality?
  4. List methods for Feature Selection
  5. List methods for Feature Extraction
  6. How to detect correlation of “categorical variables”?
  7. Feature Importance
  8. Capturing the correlation between continuous and categorical variable? If yes, how?
  9. What cross validation technique would you use on time series data set?
  10. How to deal with missing features? (Imputation?)
  11. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
  12. What are collinearity and multicollinearity?

ML/Statistical Models

  1. What are parametric models?
  2. What is a classifier?

K-NN


PCA

  1. What is PCA?
  2. What is the Goal of PCA?
  3. List the applications of PCA:
  4. Give formulas for the following:
    1. Assumptions on \(X\):
    2. SVD of \(X\):
    3. Principal Directions/Axes:
    4. Principal Components (scores):
    5. The \(j\)-th principal component:
  5. Define the transformation, mathematically:
  6. What does PCA produce/result in?

  7. Describe the PCA algorithm:

  8. Describe the Optimality of PCA:
  9. List limitations of PCA:
  10. Intuition:

  11. Should you remove correlated features b4 PCA?
  12. How can we measure the “Total Variance” of the data?
  13. How can we measure the “Total Variance” of the projected data?
  14. How can we measure the “Error in the Projection”?
    1. What does it mean when this ratio is high?
  15. How does PCA relate to CCA?
  16. How does PCA relate to ICA?

The Centroid Method


K-Means

  1. What is K-Means?
  2. What is the idea behind K-Means?
  3. What does K-Mean find?
  4. Formal Description of the Model:
    1. What is the Objective?
  5. Description of the Algorithm:
  6. What is the Optimization method used? What class does it belong to?

  7. What is the Complexity of the algorithm?
  8. Describe the convergence and prove it:

     <button>Show Proof</button>{: .showText value="show"
     onclick="showTextPopHide(event);"}
    
  9. Describe the Optimality of the Algorithm:
  10. Derive the estimated parameters of the algorithm:
    1. Objective Function:
    2. Optimization Objective:
    3. Derivation:

       <button>Show Derivation</button>{: .showText value="show"
       onclick="showTextPopHide(event);"}
              
       <button>Show Derivation</button>{: .showText value="show"
       onclick="showTextPopHide(event);"}
      
  11. When does K-Means fail to give good results?

Naive Bayes

  1. Define:
    1. Naive Bayes:
    2. Naive Bayes Classifiers:
    3. Bayes Theorem:
  2. List the assumptions of Naive Bayes:
  3. List some properties of Naive Bayes:

  4. Define the Probabilistic Model for the method:

  5. Construct the classifier. What are its components? Formally define it.

  6. What are the parameters to be estimated for the classifier?:
  7. What method do we use to estimate the parameters?:
  8. What are the estimates for each of the following parameters?:


CNNs

  1. What is a CNN?
    1. What kind of data does it work on? What is the mathematical property?
  2. What are the layers of a CNN?
  3. What are the four important ideas and their benefits that the convolution affords CNNs:
    Benefits:
    Benefits:
    Benefits:
  4. What is the inspirational model for CNNs:
  5. Describe the connectivity pattern of the neurons in a layer of a CNN:
  6. Describe the process of a ConvNet:
  7. Convolution Operation:
    1. Define:
    2. Formula (continuous):
    3. Formula (discrete):
    4. Define the following:
      1. Feature Map:
    5. Does the operation commute?
  8. Cross Correlation:
    1. Define:
    2. Formulae:
    3. What are the differences/similarities between convolution and cross-correlation:
  9. Write down the Convolution operation and the cross-correlation over two axes and:
    1. Convolution:
    2. Convolution (commutative):
    3. Cross-Correlation:
  10. The Convolutional Layer:
    1. What are the parameters and how do we choose them?
    2. Describe what happens in the forward pass:
    3. What is the output of the forward pass:
    4. How is the output configured?
  11. Spatial Arrangements:
    1. List the Three Hyperparameters that control the output volume:
    2. How to compute the spatial size of the output volume?
    3. How can you ensure that the input & output volume are the same?
    4. In the output volume, how do you compute the \(d\)-th depth slice:
  12. Calculate the number of parameters for the following config:
  13. Definitions:
    1. Receptive Field:
  14. Suppose the input volume has size \([ 32 × 32 × 3 ]\) and the receptive field (or the filter size) is \(5 × 5\) , then each neuron in the Conv Layer will have weights to a __Blank__ region in the input volume, for a total of __Blank__ weights:
  15. How can we achieve the greatest reduction in the spatial dims of the network (for classification):
  16. Pooling Layer:
    1. Define:
    2. List key ideas/properties and benefits:
    3. List the different types of Pooling:
    4. List variations of pooling and their definitions:

    5. List the hyperparams of Pooling Layer:
    6. How to calculate the size of the output volume:
    7. How many parameters does the pooling layer have:
    8. What are other ways to perform downsampling:
  17. Weight Priors:
    1. Define “Prior Prob Distribution on the parameters”:
    2. Define “Weight Prior” and its types/classes:

    3. Describe the Conv Layer as a FC Layer using priors:
    4. What are the key insights of using this view:

  18. When do multi-channel convolutions commute?
  19. Why do we use several different kernels in a given conv-layer?
  20. Strided Convolutions
    1. Define:
    2. What are they used for?
    3. What are they equivalent to?
    4. Formula:
  21. Zero-Padding:
    1. Definition/Usage:
    2. List the types of padding:
  22. Locally Connected Layers/Unshared Convolutions:
  23. Bias Parameter:
    1. How many bias terms are used per output channel in the tradional convolution:
  24. Dilated Convolutions
    1. Define:
    2. What are they used for?
  25. Stacked Convolutions
    1. Define:
    2. What are they used for?

Theory


RNNs

  1. What is an RNN?
    1. Definition:
    2. What machine-type is the standard RNN:
  2. What is the big idea behind RNNs?
  3. Dynamical Systems:
    1. Standard Form:
    2. RNN as a Dynamical System:
  4. Unfolding Computational Graphs
    1. Definition:
    2. List the Advantages introduced by unfolding and the benefits:
    3. Graph and write the equations of Unfolding hidden recurrence:
  5. Describe the State of the RNN, its usage, and extreme cases of the usage:
  6. RNN Architectures:
    1. List the three standard architectures of RNNs:
      1. Graph:
      2. Architecture:
      3. Equations:
      4. Total Loss:
      5. Complexity:
      6. Properties:
  7. Teacher Forcing:
    1. Definition:
    2. Application:
    3. Disadvantages:
    4. Possible Solutions for Mitigation:

Optimization

  1. Define the sigmoid function and some of its properties:
  2. Backpropagation:
    1. Definition:
    2. Derive Gradient Descent Update:
    3. Explain the difference kinds of gradient-descent optimization procedures:
    4. List the different optimizers and their properties:
  3. Error-Measures:
    1. Define what an error measure is:
    2. List the 5 most common error measures and where they are used:
    3. Specific Questions:

  4. Show that the weight vector of a linear signal is orthogonal to the decision boundary?
  5. What does it mean for a function to be well-behaved from an optimization pov?
  6. Write \(\|\mathrm{Xw}-\mathrm{y}\|^{2}\) as a summation
  7. Compute:
    1. \(\dfrac{\partial}{\partial y}\vert{x-y}\vert=\)
  8. State the difference between SGD and GD?
  9. When would you use GD over SDG, and vice-versa?
  10. What is convex hull ?
  11. OLS vs MLE

ML Theory

  1. Explain intuitively why Deep Learning works?
  2. List the different types of Learning Tasks and their definitions:
  3. Describe the relationship between supervised and unsupervised learning?
  4. Describe the differences between Discriminative and Generative Models?
  5. Describe the curse of dimensionality and its effects on problem solving:
  6. How to deal with curse of dimensionality?
  7. Describe how to initialize a NN and any concerns w/ reasons:
  8. Describe the difference between Learning and Optimization in ML:
  9. List the 12 Standard Tasks in ML:
  10. What is the difference between inductive and deductive learning?

Statistical Learning Theory

  1. Define Statistical Learning Theory:

    How can we affect performance on the test set when we can only observe the training set?

  2. What assumptions are made by the theory?

  3. Give the Formal Definition of SLT:

  4. Define Empirical Risk Minimization:
  5. What is the Complexity of ERM?

  6. Definitions:
    1. Generalization:
    2. Generalization Error:
    3. Generalization Gap:
      1. Computing the Generalization Gap:
      2. What is the goal of SLT in the context of the Generalization Gap given that it can’t be computed?
    4. Achieving (“good”) Generalization:
    5. Empirical Distribution:
  7. Describe the difference between Learning and Optimization in ML:
  8. Describe the difference between Generalization and Learning in ML:
  9. How to achieve Learning?
  10. What does the (VC) Learning Theory Achieve?
  11. Why do we need the probabilistic framework?
  12. What is the Approximation-Generalization Tradeoff:
  13. What are the factors determining how well an ML-algo will perform?
  14. Define the following and their usage/application & how they relate to each other:
    1. Underfitting:
    2. Overfitting:
    3. Capacity:
      1. Models with Low-Capacity:
      2. Models with High-Capacity:
    4. Hypothesis Space:
    5. VC-Dimension:
      1. What does it measure?
    6. Graph the relation between Error, and Capacity in the ctxt of (Underfitting, Overfitting, Training Error, Generalization Err, and Generalization Gap):
  15. What is the most important result in SLT that show that learning is feasible?

Bias-Variance Decomposition Theory

  1. What is the Bias-Variance Decomposition Theory:
  2. What are the Assumptions made by the theory?
  3. What is the question that the theory tries to answer? What assumption is important? How do you achieve the answer/goal?
  4. What is the Bias-Variance Decomposition:
  5. Define each term w.r.t. source of the error:
  6. What does each of the following measure? Describe it in Words? Give their AKA in statistics?
    1. Bias:
    2. Variance:
  7. Give the Formal Definition of the Decomposition (Formula):
    1. What is the Expectation over?
  8. Define the Bias-Variance Tradeoff:
    1. Effects of Bias:
    2. Effects of Variance:
    3. Draw the Graph of the Tradeoff (wrt model capacity):
  9. Derive the Bias-Variance Decomposition with explanations:
  10. What are the key Takeaways from the Tradeoff?
  11. What are the most common ways to negotiate the Tradeoff?
  12. How does the decomposition relate to Classification?
  13. Increasing/Decreasing Bias&Variance:

Activation Functions

  1. Describe the Desirable Properties for activation functions:

  2. Describe the NON-Desirable Properties for activation functions:

  3. List the different activation functions used in ML?
    Names, Definitions, Properties (pros&cons), Derivatives, Applications, pros/cons:


Kernels

  1. Define “Local Kernel” and give an analogy to describe it:
  2. Write the following kernels:
    1. Polynomial Kernel of degree, up to, \(d\):
    2. Gaussian Kernel:
    3. Sigmoid Kernel:
    4. Polynomial Kernel of degree, exactly, \(d\):

Math

  1. What is a metric?
  2. Describe Binary Relations and their Properties?
  3. Formulas:
    1. Set theory:
      1. Number of subsets of a set of \(N\) elements:
      2. Number of pairs \((a,b)\) of a set of N elements:
    2. Binomial Theorem:
    3. Binomial Coefficient:
    4. Expansion of \(x^n - y^n =\)
    5. Number of ways to partition \(N\) data points into \(k\) clusters:
    6. \(\log_x(y) =\)
    7. The length of a vector \(\mathbf{x}\) along a direction (projection):
    8. \(\sum_{i=1}^{n} 2^{i}=\)
  4. List 6 proof methods:
  5. Important Formulas
    1. Projection \(\tilde{\mathbf{x}}\) of a vector \(\mathbf{x}\) onto another vector \(\mathbf{u}\):

Statistics

  1. ROC curve:
    1. Definition:
    2. Purpose:
    3. How do you create the plot?
    4. How to identify a good classifier:
    5. How to identify a bad classifier:
    6. What is its application in tuning the model?
  2. AUC - AUROC:
    1. Definition:
    2. Range:
    3. What does it measure:
    4. Usage in ML:
  3. Define Statistical Efficiency (of an estimator)?
  4. Whats the difference between Errors and Residuals:
    1. Compute the statistical errors and residuals of the univariate, normal distribution defined as \(X_{1}, \ldots, X_{n} \sim N\left(\mu, \sigma^{2}\right)\):
  5. What is a biased estimator?
    1. Why would we prefer biased estimators in some cases?
  6. What is the difference between “Probability” and “Likelihood”:
  7. Estimators:
    1. Define:
    2. Formula:
    3. Whats a good estimator?
    4. What are the Assumptions made regarding the estimated parameter:
  8. What is Function Estimation:
    1. Whats the relation between the Function Estimator \(\hat{f}\) and Point Estimator:
  9. Define “marginal likelihood” (wrt naive bayes):

(Statistics) - MLE

  1. Clearly Define MLE and derive the final formula:
    1. Write MLE as an expectation wrt the Empirical Distribution:
    2. Describe formally the relationship between MLE and the KL-divergence:
    3. Extend the argument to show the link between MLE and Cross-Entropy. Give an example:
    4. How does MLE relate to the model distribution and the empirical distribution?
    5. What is the intuition behind using MLE?
    6. What does MLE find/result in?
    7. What kind of problem is MLE and how to solve for it?
    8. How does it relate to SLT:
    9. Explain clearly why we maximize the natural log of the likelihood

Text-Classification | Classical

  1. List some Classification Methods:
  2. List some Applications of Txt Classification:

NLP

  1. List some problems in NLP:
  2. List the Solved Problems in NLP:
  3. List the “within reach” problems in NLP:
  4. List the Open Problems in NLP:
  5. Why is NLP hard? List Issues:
  6. Define:
    1. Morphology:
    2. Morphemes:
    3. Stems:
    4. Affixes:
    5. Stemming:
    6. Lemmatization:

Language Modeling

  1. What is a Language Model?
  2. List some Applications of LMs:
  3. Traditional LMs:
    1. How are they setup?
    2. What do they depend on?
    3. What is the Goal of the LM task? (in the ctxt of the problem setup)
    4. What assumptions are made by the problem setup? Why?
    5. What are the MLE Estimates for probabilities of the following:
      1. Bi-Grams:
      2. Tri-Grams:
    6. What are the issues w/ Traditional Approaches?
  4. What+How can we setup some NLP tasks as LM tasks:
  5. How does the LM task relate to Reasoning/AGI:
  6. Evaluating LM models:
    1. List the Loss Functions (+formula) used to evaluate LM models? Motivate each:
    2. Which application of LM modeling does each loss work best for?

  7. LM DATA:
    1. How does the fact that LM is a time-series prediction problem affect the way we need to train/test:
    2. How should we choose a subset of articles for testing:
  8. List three approaches to Parametrizing LMs:

  9. What’s the main issue in LM modeling?

    1. The Bias-Variance Tradeoff of the following:
      1. N-Gram Models:
      2. RNNs:
      3. An Estimate s.t. it predicts the probability of a sentence by how many times it has seen it before:
        1. What happens in the limit of infinite data?
  10. What are the advantages of sub-word level LMs:
  11. What are the disadvantages of sub-word level LMs:
  12. What is a “Conditional LM”?
  13. Write the decomposition of the probability for the Conditional LM:
  14. Describe the Computational Bottleneck for Language Models:
  15. Describe/List some solutions to the Bottleneck:
  16. Complexity Comparison of the different solutions:

    ![i


Regularization

  1. Define Regularization both intuitively and formally:
  2. Define “well-posedness”:
  3. Give four aspects of justification for regularization (theoretical):

  4. Describe an overview of regularization in DL. How does it usually work?
    1. Intuitively, how can a regularizer be effective?
  5. Describe the relationship between regularization and capacity:
  6. Describe the different approaches to regularization:
  7. List 9 regularization techniques:

  1. Add Answers from link below for L2 applied to linear regression and how it reduces variance:
  2. When is Ridge regression favorable over Lasso regression? for correlated features?

Misc.

  1. Explain Latent Dirichlet Allocation (LDA)
  2. How to deal with curse of dimensionality
  3. How to detect correlation of “categorical variables”?
  4. Define “marginal likelihood” (wrt naive bayes):
  5. KNN VS K-Means
  6. When is Ridge regression favorable over Lasso regression for correlated features?
  7. What is convex hull ?
  8. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
  9. OLS vs MLE
  10. What are collinearity and multicollinearity?
  11. Describe ways to overcome scaling (scalability) issues: