Gradient-Based Optimization

  1. Define Gradient Methods:

  2. Give examples of Gradient-Based Algorithms:
  3. What is Gradient Descent:
  4. Explain it intuitively:
  5. Give its derivation:
  6. What is the learning rate?
    1. Where does it come from?

    1. How do we choose the learning rate?

  7. Describe the convergence of the algorithm:
  8. How does GD relate to Euler?

  9. List the variants of GD:
    1. How do they differ?:

  10. What is the problem of vanilla approaches to GD?
  11. List the different strategies for optimizing GD:
  12. List the different variants for optimizing GD:


Maximum Margin Classifiers

  1. Define Margin Classifiers:
  2. What is a Margin for a linear classifier?
  3. Give the motivation for margin classifiers:
  4. Define the notion of the “best” possible classifier
  5. How can we achieve the “best” classifier?
  6. What unique vector is orthogonal to the hp? Prove it:
  7. What do we mean by “signed distance”? Derive its formula:
  8. Given the formula for signed distance, calculate the “distance of the point closest to the hyperplane”:
  9. Use geometric properties of the hp to Simplify the expression for the distance of the closest point to the hp, above
  10. Characterize the margin, mathematically:
  11. Characterize the “Slab Existence”:
  12. Formulate the optimization problem of maximizing the margin wrt analysis above:
  13. Reformulate the optimization problem above to a more “friendly” version (wrt optimization -> put in standard form):
    1. Give the final (standard) formulation of the “Optimization problem for maximum margin classifiers”:
    2. What kind of formulation is it (wrt optimization)? What are the parameters?

Hard-Margin SVMs

  1. Define:
    1. SVMs:
    2. Support Vectors:
    3. Hard-Margin SVM:
  2. Define the following wrt hard-margin SVM:
    1. Goal:
    2. Procedure:
    3. Decision Function:
    4. Constraints:
    5. The Optimization Problem:
    6. The Optimization Method:
  3. Elaborate on the generalization analysis:
  4. List the properties:
  5. Give the solution to the optimization problem for H-M SVM:
    1. What method does it require to be solved:
    2. Formulate the Lagrangian:
    3. Optimize the objective for each variable:
    4. Get the Dual Formulation w.r.t. the (tricky) constrained variable \(\alpha_n\):
    5. Set the problem as a Quadratic Programming problem:
    6. What are the inputs and outputs to the Quadratic Program Package?
    7. Give the final form of the optimization problem in standard form:

Soft-Margin SVM

  1. Motivate the soft-margin SVM:
  2. What is the main idea behind it?
  3. Define the following wrt soft-margin SVM:
    1. Goal:
    2. Procedure:
    3. Decision Function:
    4. Constraints:
      1. Why is there a non-negativity constraint?
    5. Objective/Cost Function:
    6. The Optimization Problem:
    7. The Optimization Method:
    8. Properties:
  4. Specify the effects of the regularization hyperparameter \(C\):
    1. Describe the effect wrt over/under fitting:
  5. How do we choose \(C\)?
  6. Give an equivalent formulation in the standard form objective for function estimation (what should it minimize?)

Loss Functions

  1. Define:
    1. Loss Functions:
    2. Distance-Based Loss Functions:
      1. Describe an important property of dist-based losses:
      2. What are they used for?
    3. Relative Error - What does it lack?
  2. List 3 Regression Loss Functions

  1. List 7 Classification Loss Functions


Information Theory

  1. What is Information Theory? In the context of ML?
  2. Describe the Intuition for Information Theory. Intuitively, how does the theory quantify information (list)?
  3. Measuring Information - Definitions and Formulas:
    1. In Shannons Theory, how do we quantify “transmitting 1 bit of information”?
    2. What is the amount of information transmitted?
    3. What is the uncertainty reduction factor?
    4. What is the amount of information in an event \(x\)?
  4. Define the Self-Information:
    1. What is it defined with respect to?
  5. Define Shannon Entropy - what is it used for?
    1. Describe how Shannon Entropy relate to distributions with a graph:
  6. Define Differential Entropy:
  7. How does entropy characterize distributions?
  8. Define Relative Entropy:
    1. Give an interpretation:
    2. List the properties:
    3. Describe it as a distance:
    4. List the applications of relative entropy:

  9. Define Cross Entropy:
    1. What does it measure?
    2. How does it relate to relative entropy?
    3. When are they equivalent (wrt. optimization)?
  10. Mutual Information:
    1. Definition:
    2. What does it measure?
    3. Intuitive Definitions:
    4. Interpretations XXX:
    5. Properties:
    6. Applications:

  11. Pointwise Mutual Information (PMI):
    1. Definition:
    2. Relation to MI:

Recommendation Systems

  1. Describe the different algorithms for recommendation systems:

Ensemble Learning

  1. What are the two paradigms of ensemble methods?
  2. Random Forest VS GBM?

Data Processing and Analysis

  1. What are 3 data preprocessing techniques to handle outliers?
  2. Describe the strategies to dimensionality reduction:
  3. What are 3 ways of reducing dimensionality?
  4. List methods for Feature Selection
  5. List methods for Feature Extraction
  6. How to detect correlation of “categorical variables”?
  7. Feature Importance
  8. Capturing the correlation between continuous and categorical variable? If yes, how?
  9. What cross validation technique would you use on time series data set?
  10. How to deal with missing features? (Imputation?)
  11. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
  12. What are collinearity and multicollinearity?
  13. What is data normalization and why do we need it?

ML/Statistical Models

  1. What are parametric models?
  2. What is a classifier?

K-NN


PCA

  1. What is PCA?
  2. What is the Goal of PCA?
  3. List the applications of PCA:
  4. Give formulas for the following:
    1. Assumptions on \(X\):
    2. SVD of \(X\):
    3. Principal Directions/Axes:
    4. Principal Components (scores):
    5. The \(j\)-th principal component:
  5. Describe how to find the principal components:
  6. Define the transformation, mathematically:
  7. What does PCA produce/result in?
  8. Describe the PCA algorithm:
  9. Describe the Optimality of PCA:
  10. List limitations of PCA:
  11. Intuition:
  12. How does PCA relate to CCA?
  13. How does PCA relate to ICA?
  14. Should you remove correlated features b4 PCA?
  15. How can we measure the “Total Variance” of the data?
  16. How can we measure the “Total Variance” of the projected data?
  17. How can we measure the “Error in the Projection”?
    1. What does it mean when this ratio is high?

The Centroid Method


K-Means

  1. What is K-Means?
  2. What is the idea behind K-Means?
  3. What does K-Mean find?
  4. Formal Description of the Model:
    1. What is the Objective?
  5. Description of the Algorithm:
  6. What is the Optimization method used? What class does it belong to?
  7. What is the Complexity of the algorithm?
  8. Describe the convergence and prove it:
  9. Describe the Optimality of the Algorithm:
  10. Derive the estimated parameters of the algorithm:
    1. Objective Function:
    2. Optimization Objective:
    3. Derivation:
  11. When does K-Means fail to give good results?

Naive Bayes

  1. Define:
    1. Naive Bayes:
    2. Naive Bayes Classifiers:
    3. Bayes Theorem:
  2. List the assumptions of Naive Bayes:
  3. List some properties of Naive Bayes:
  4. Define the Probabilistic Model for the method:
  5. Construct the classifier. What are its components? Formally define it.
  6. What are the parameters to be estimated for the classifier?:
  7. What method do we use to estimate the parameters?:
  8. What are the estimates for each of the following parameters?:

CNNs

  1. What is a CNN?
  2. What are the layers of a CNN?
  3. What are the four important ideas and their benefits that the convolution affords CNNs:
  4. What is the inspirational model for CNNs:
  5. Describe the connectivity pattern of the neurons in a layer of a CNN:
  6. Describe the process of a ConvNet:
  7. Convolution Operation:
    1. Define:
    2. Formula (continuous):
    3. Formula (discrete):
    4. Define the following:
      1. Feature Map:
    5. Does the operation commute?
  8. Cross Correlation:
    1. Define:
    2. Formulae:
    3. What are the differences/similarities between convolution and cross-correlation:
  9. Write down the Convolution operation and the cross-correlation over two axes and:
    1. Convolution:
    2. Convolution (commutative):
    3. Cross-Correlation:
  10. The Convolutional Layer:
    1. What are the parameters and how do we choose them?
    2. Describe what happens in the forward pass:
    3. What is the output of the forward pass:
    4. How is the output configured?
  11. Spatial Arrangements:
    1. List the Three Hyperparameters that control the output volume:
    2. How to compute the spatial size of the output volume?
    3. How can you ensure that the input & output volume are the same?
    4. In the output volume, how do you compute the \(d\)-th depth slice:
  12. Calculate the number of parameters for the following config:

    Given:
    1. Input Volume: \(64\times64\times3\)
    1. Filters: \(15 7\times7\)
    1. Stride: \(2\)
    1. Pad: \(3\)

  13. Definitions:
    1. Receptive Field:
  14. Suppose the input volume has size \([ 32 × 32 × 3 ]\) and the receptive field (or the filter size) is \(5 × 5\) , then each neuron in the Conv Layer will have weights to a __Blank__ region in the input volume, for a total of __Blank__ weights:
  15. How can we achieve the greatest reduction in the spatial dims of the network (for classification):
  16. Pooling Layer:
    1. Define:
    2. List key ideas/properties and benefits:
    3. List the different types of Pooling:
      Answer
    4. List variations of pooling and their definitions:
    5. List the hyperparams of Pooling Layer:
    6. How to calculate the size of the output volume:
    7. How many parameters does the pooling layer have:
    8. What are other ways to perform downsampling:
  17. Weight Priors:
    1. Define “Prior Prob Distribution on the parameters”:
    2. Define “Weight Prior” and its types/classes:
    3. Describe the Conv Layer as a FC Layer using priors:
    4. What are the key insights of using this view:
  18. When do multi-channel convolutions commute?
    Answer
  19. Why do we use several different kernels in a given conv-layer?
  20. Strided Convolutions
    1. Define:
    2. What are they used for?
    3. What are they equivalent to?
    4. Formula:
  21. Zero-Padding:
    1. Definition/Usage:
    2. List the types of padding:
  22. Locally Connected Layers/Unshared Convolutions:
  23. Bias Parameter:
    1. How many bias terms are used per output channel in the tradional convolution:
  24. Dilated Convolutions
    1. Define:
    2. What are they used for?
  25. Stacked Convolutions
    1. Define:
    2. What are they used for?
  26. What is the rule of Bias(es) in CNNs:

Theory

RNNs

  1. What is an RNN?
    1. Definition:
    2. What machine-type is the standard RNN:
  2. What is the big idea behind RNNs?
  3. Dynamical Systems:
    1. Standard Form:
    2. RNN as a Dynamical System:
  4. Unfolding Computational Graphs
    1. Definition:
    2. List the Advantages introduced by unfolding and the benefits:
    3. Graph and write the equations of Unfolding hidden recurrence:
  5. Describe the State of the RNN, its usage, and extreme cases of the usage:
  6. RNN Architectures:
    1. List the three standard architectures of RNNs:
      1. Graph:
      2. Architecture:
      3. Equations:
      4. Total Loss:
      5. Complexity:
      6. Properties:
  7. Teacher Forcing:
    1. Definition:
    2. Application:
    3. Disadvantages:
    4. Possible Solutions for Mitigation:

Optimization

  1. Define the sigmoid function and some of its properties:
  2. Backpropagation:
    1. Definition:
    2. Derive Gradient Descent Update:
    3. Explain the difference kinds of gradient-descent optimization procedures:
    4. List the different optimizers and their properties:
  3. Error-Measures:
    1. Define what an error measure is:
    2. List the 5 most common error measures and where they are used:
    3. Specific Questions:
  4. Show that the weight vector of a linear signal is orthogonal to the decision boundary?
  5. What does it mean for a function to be well-behaved from an optimization pov?
  6. Write \(\|\mathrm{Xw}-\mathrm{y}\|^{2}\) as a summation
  7. Compute:
    1. \(\dfrac{\partial}{\partial y}\vert{x-y}\vert=\)
  8. State the difference between SGD and GD?
  9. When would you use GD over SDG, and vice-versa?
  10. What is convex hull ?
  11. OLS vs MLE

ML Theory

  1. Explain intuitively why Deep Learning works?
  2. List the different types of Learning Tasks and their definitions:
    answer
  3. Describe the relationship between supervised and unsupervised learning?
    answer
  4. Describe the differences between Discriminative and Generative Models?
  5. Describe the curse of dimensionality and its effects on problem solving:
  6. How to deal with curse of dimensionality
  7. Describe how to initialize a NN and any concerns w/ reasons:
  8. Describe the difference between Learning and Optimization in ML:
  9. List the 12 Standard Tasks in ML:
  10. What is the difference between inductive and deductive learning?

Statistical Learning Theory

  1. Define Statistical Learning Theory:
  2. What assumptions are made by the theory?
  3. Give the Formal Definition of SLT:
  4. Define Empirical Risk Minimization:
  5. What is the Complexity of ERM?
  6. Definitions:
    1. Generalization:
    2. Generalization Error:
    3. Generalization Gap:
      1. Computing the Generalization Gap:
      2. What is the goal of SLT in the context of the Generalization Gap given that it can’t be computed?
    4. Achieving (“good”) Generalization:
      An algorithm is said to generalize when
    5. Empirical Distribution:
  7. Describe the difference between Learning and Optimization in ML:
  8. Describe the difference between Generalization and Learning in ML:
  9. How to achieve Learning?
  10. What does the (VC) Learning Theory Achieve?
  11. Why do we need the probabilistic framework?
  12. Give the Formal Definition of SLT:
  13. What is the Approximation-Generalization Tradeoff? How is it characterized?:
  14. What are the factors determining how well an ML-algo will perform?
  15. Define the following and their usage/application & how they relate to each other:
    1. Underfitting:
    2. Overfitting:
    3. Capacity:
      • Models with Low-Capacity:
      • Models with High-Capacity:
    4. Hypothesis Space:
    5. VC-Dimension:
      1. What does it measure?
    6. Graph the relation between Error, and Capacity in the ctxt of (Underfitting, Overfitting, Training Error, Generalization Err, and Generalization Gap):
  16. What is the most important result in SLT that show that learning is feasible?

Bias-Variance Decomposition Theory

  1. What is the Bias-Variance Decomposition Theory:
  2. What are the Assumptions made by the theory?
  3. What is the question that the theory tries to answer? How do you achieve the answer to this question? What assumption is important?
  4. What is the Bias-Variance Decomposition:
  5. Define each term w.r.t. source of the error (error from):
    1. Bias:
    2. Variance:
    3. Irreducible Error:
  6. What does each of the following measure (error in)? Describe this measured quantity in words, mathematically. Describe Bias&Variance in Words as a question statement. Give their AKA in statistics.
    1. Bias:
    2. Variance:
    3. Irreducible Error:
  7. Give the Formal Definition of the Decomposition (Formula):
    1. What is the Expectation over?
  8. Define the Bias-Variance Tradeoff:
    1. Effects of Bias:
      1. High Bias:
      2. Low Bias:
    2. Effects of Variance:
      1. High Variance:
      2. Low Variance:
    3. Draw the Graph of the Tradeoff (wrt model capacity):
  9. Derive the Bias-Variance Decomposition with explanations:
  10. What are the key Takeaways from the Tradeoff?
  11. What are the most common ways to negotiate the Tradeoff? (i.e. selecting/comparing models)
  12. How does the decomposition relate to Classification?
  13. Increasing/Decreasing Bias&Variance:
    1. Adding Good Feature:
    2. Adding Bad Feature:
    3. Adding ANY Feature:
    4. Adding more Data:
    5. Noise in Test Set:
    6. Noise in Training Set:
    7. Dimensionality Reduction:
    8. Feature Selection:
    9. Regularization:
    10. Increasing # of Hidden Units in ANNs:
    11. Increasing # of Hidden Layers in ANNs:
    12. Increasing \(k\) in K-NN:
    13. Increasing Depth in Decision-Trees:
    14. Boosting:
    15. Bagging:

Activation Functions

  1. Describe the Desirable Properties for activation functions:
  2. Describe the NON-Desirable Properties for activation functions:
  3. List the different activation functions used in ML?
    Names, Definitions, Properties (pros&cons), Derivatives, Applications, pros/cons:


Kernels

  1. Define “Local Kernel” and give an analogy to describe it:
  2. Write the following kernels:
    1. Polynomial Kernel of degree, up to, \(d\):
    2. Gaussian Kernel:
    3. Sigmoid Kernel:
    4. Polynomial Kernel of degree, exactly, \(d\):

Math

  1. What is a metric?
    Metric

  2. Describe Binary Relations and their Properties?
    answer

  3. Formulas:
    1. Set theory:
      1. Number of subsets of a set of \(N\) elements:
      2. Number of pairs \((a,b)\) of a set of N elements:
    2. Binomial Theorem:
    3. Binomial Coefficient:
    4. Expansion of \(x^n - y^n =\)
    5. Number of ways to partition \(N\) data points into \(k\) clusters:
    6. \(\log_x(y) =\)
    7. The length of a vector \(\mathbf{x}\) along a direction (projection):
      1. Along a unit-length vector \(\hat{\mathbf{w}}\):
      2. Along an unnormalized vector \(\mathbf{w}\):
    8. \(\sum_{i=1}^{n} 2^{i}=\)
  4. List 6 proof methods:
    answer

  5. Something

Statistics

  1. ROC curve:
    1. Definition:
    2. Purpose:
    3. How do you create the plot?
    4. How to identify a good classifier:
    5. How to identify a bad classifier:
    6. What is its application in tuning the model?
  2. AUC - AUROC:
    1. Definition:
    2. Range:
    3. What does it measure:
    4. Usage in ML:
  3. Define Statistical Efficiency (of an estimator)?
    1. Intuitive Difference:
    2. How do we define Efficiency?
    3. What’s the difference between an efficient and inefficient estimators?
    4. How’s the use of an inefficient estimator bad compared to an efficient one?
  4. Whats the difference between Errors and Residuals:
    1. Compute the statistical errors and residuals of the univariate, normal distribution defined as \(X_{1}, \ldots, X_{n} \sim N\left(\mu, \sigma^{2}\right)\):
  5. What is a biased estimator?
    1. Why would we prefer biased estimators in some cases?
  6. What is the difference between “Probability” and “Likelihood”:
  7. Estimators:
    1. Define:
    2. Formula:
    3. Whats a good estimator?
    4. What are the Assumptions made regarding the estimated parameter:
  8. What is Function Estimation:
    1. Whats the relation between the Function Estimator \(\hat{f}\) and Point Estimator:
  9. Define “marginal likelihood” (wrt naive bayes):

(Statistics) - MLE

  1. Clearly Define MLE and derive the final formula:
    1. Write MLE as an expectation wrt the Empirical Distribution:
    2. Describe formally the relationship between MLE and the KL-divergence:
    3. Extend the argument to show the link between MLE and Cross-Entropy. Give an example of a well-known loss function:
    4. How does the form of the model (model family) affect the MLE Estimate?
    5. How does MLE relate to the model distribution and the empirical distribution?
    6. What is the intuition behind using MLE?
    7. What does MLE find/result in?
    8. What kind of problem is MLE and how to solve for it?
    9. How does it relate to SLT:
    10. Explain clearly why we maximize the natural log of the likelihood

Text-Classification | Classical

  1. List some Classification Methods:
  2. List some Applications of Txt Classification:

NLP

  1. List some problems in NLP:
  2. List the Solved Problems in NLP:
  3. List the “within reach” problems in NLP:
  4. List the Open Problems in NLP:
  5. Why is NLP hard? List Issues:
  6. Define:
    1. Morphology:
    2. Morphemes:
    3. Stems:
    4. Affixes:
    5. Stemming:
    6. Lemmatization:

Language Modeling

  1. What is a Language Model?
  2. List some Applications of LMs:
  3. Traditional LMs:
    1. How are they setup?
    2. What do they depend on?
    3. What is the Goal of the LM task? (in the ctxt of the problem setup)
    4. What assumptions are made by the problem setup? Why?
    5. What are the MLE Estimates for probabilities of the following:
      1. Bi-Grams:

        $$p(w_2\vert w_1) = $$

      2. Tri-Grams:

        $$p(w_3\vert w_1, w_2) = $$

    6. What are the issues w/ Traditional Approaches?
  4. What+How can we setup some NLP tasks as LM tasks:
  5. How does the LM task relate to Reasoning/AGI:
  6. Evaluating LM models:
  7. LM DATA:
    1. How does the fact that LM is a time-series prediction problem affect the way we need to train/test:
    2. How should we choose a subset of articles for testing:
  8. List three approaches to Parametrizing LMs:
  9. What’s the main issue in LM modeling?
    1. The Bias-Variance Tradeoff of the following:
      1. N-Gram Models:
      2. RNNs:
      3. An Estimate s.t. it predicts the probability of a sentence by how many times it has seen it before:
        1. What happens in the limit of infinite data?
  10. What are the advantages of sub-word level LMs:
  11. What are the disadvantages of sub-word level LMs:
  12. What is a “Conditional LM”?
  13. Write the decomposition of the probability for the Conditional LM:
  14. Describe the Computational Bottleneck for Language Models:
  15. Describe/List some solutions to the Bottleneck:
  16. Complexity Comparison of the different solutions:

Regularization

  1. Define Regularization both intuitively and formally:
  2. Define “well-posedness”:
  3. Give four aspects of justification for regularization (theoretical):
  4. Describe an overview of regularization in DL. How does it usually work?
    1. Intuitively, how can a regularizer be effective?
  5. Describe the relationship between regularization and capacity. How does regularization work in this case?
  6. Describe the different approaches to regularization:
  7. List 9 regularization techniques:

  1. When is Ridge regression favorable over Lasso regression? for correlated features?

Misc.

  1. Explain Latent Dirichlet Allocation (LDA)
  2. How to deal with curse of dimensionality
  3. How to detect correlation of “categorical variables”?
  4. Define “marginal likelihood” (wrt naive bayes):
  5. KNN VS K-Means
  6. When is Ridge regression favorable over Lasso regression for correlated features?
  7. Capturing the correlation between continuous and categorical variable? If yes, how?
  8. Random Forest VS GBM?
  9. What is convex hull ?
  10. What cross validation technique would you use on time series data set?
  11. How to deal with missing features? (Imputation?)
  12. Describe the different algorithms for recommendation systems:
  13. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?
  14. OLS vs MLE
  15. What is the difference between inductive and deductive learning?
  16. What are collinearity and multicollinearity?
  17. What are the two paradigms of ensemble methods?
  18. Describe Label Smoothing as a regularization technique:
    1. Give its motivation:
      • What is data normalization and why do we need it?:
      • Weight initialization in neural networks?:
      • How to improve Generalization
      • How to prevent Overfitting
      • How to control the capacity
      • Why small weights in NN lead to lower capacity:

INTERVIEWS


FeedForward Neural Network

  1. What is a “FeedForward” Neural Network:
  2. What is the Architecture of an FFN (components and how they work together):
  3. List two examples of FFNs:

Multilayer Perceptron

  1. What model class does the “Multi-Layer Perceptron” belong to:
  2. What is the Architecture of an MLP:
  3. Describe “Learning” of an MLP (Learning Algorithm and brief description of the procedure and optimization):
  4. List the properties of the MLP:

Deep Feedforward Neural Networks

  1. Describe the Deep Feedforward Neural Networks:

  2. Describe the Motivation for Deep FFNs:
  3. How can we interpret Deep Neural Networks (in SLT):

AutoEncoders

  1. What is an AutoEncoder? What is its goal? (draw a diagram)
  2. What type of NN is the Autoencoder?
  3. Give Motivation for AutoEncoders:
  4. Why Deep AutoEncoders? What do they allow us to do?
  5. List the Advantages of Deep AutoEncoders:
  6. List the Applications of AutoEncoders:
  7. Describe the Training of Deep AutoEncoders:

  8. Describe the Architecture of AutoEncoders:
    1. What is the simplest form of an AE:
    2. What realm of “Learning” is employed for AEs?
  9. Mathematical Description of the Structure of AutoEncoders:

  10. Compare AutoEncoders and PCA (wrt what they learn):
  11. List the different Types of AEs
  12. How can we use AEs for Initialization?
  13. Describe the Representational Power of AEs:
  14. Describe the progression (stages) of AE Architectures in CV:

  15. What are Undercomplete AutoEncoders?
  16. What’s the motivation behind Undercomplete AEs?
  17. List the Challenges of Utilizing Undercomplete AEs:
  18. What is the Main Method/Approach of addressing the Challenges above (Training AEs)?