Ahmad Badary

Dimensionality Reduction

Feature Selection

Feature Extraction

Feature Importance

Imputation

Normalization

Outliers Handling

Dimensionality Reduction

Dimensionality Reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Dimensionality Reduction Methods:

PCA
Heatmaps
t-SNE
Multi-Dimensional Scaling (MDS)

t-SNE | T-distributed Stochastic Neighbor Embeddings

Understanding t-SNE Part 1: SNE algorithm and its drawbacks
Understanding t-SNE Part 2: t-SNE improvements over SNE
t-SNE (statwiki)
t-SNE tutorial (video)
series (deleteme)

SNE - Stochastic Neighbor Embeddings:
SNE is a method that aims to match distributions of distances between points in high and low dimensional space via conditional probabilities.
It Assumes distances in both high and low dimensional space are Gaussian-distributed.

Algorithm

t-SNE:
t-SNE is a machine learning algorithm for visualization developed by Laurens van der Maaten and Geoffrey Hinton.
It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions.
Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

It tends to preserve local structure, while at the same time, preserving the global structure as much as possible.

Stages:

It Constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked while dissimilar points have an extremely small probability of being picked.
It Defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map.

Key Ideas:
It solves two big problems that SNE faces:

The Crowding Problem:
The “crowding problem” that are addressed in the paper is defined as: “the area of the two-dimensional map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datepoints”. This happens when the datapoints are distributed in a region on a high-dimensional manifold around i, and we try to model the pairwise distances from i to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map. In SNE, this will result in very small attractive force from datapoint i to these too-distant map points. The very large number of such forces collapses together the points in the center of the map and prevents gaps from forming between the natural clusters. This phenomena, crowding problem, is not specific to SNE and can be observed in other local techniques such as Sammon mapping as well.
- Solution - Student t-distribution for $q$:
  Student t-distribution is used to compute the similarities between data points in the low dimensional space $q$.
Optimization Difficulty of KL-div:
The KL Divergence is used over the conditional probability to calculate the error in the low-dimensional representation. So, the algorithm will be trying to minimize this loss and will calculate its gradient:
$$\frac{\delta C}{\delta y_{i}}=2 \sum_{j}\left(p_{j | i}-q_{j | i}+p_{i | j}-q_{i | j}\right)\left(y_{i}-y_{j}\right)$$

This gradient involves all the probabilities for point $i$ and $j$. But, these probabilities were composed of the exponentials. The problem is that: We have all these exponentials in our gradient, which can explode (or display other unusual behavior) very quickly and hence the algorithm will take a long time to converge.
- Solution - Symmetric SNE:
  The Cost Function is a symmetrized version of that in SNE. i.e. $p_{i\vert j} = p_{j\vert i}$ and $q_{i\vert j} = q_{j\vert i}$.

Application:
It is often used to visualize high-level representations learned by an artificial neural network.

Motivation:
There are a lot of problems with traditional dimensionality reduction techniques that employ feature projection; e.g. PCA. These techniques attempt to preserve the global structure, and in that process they lose the local structure. Mainly, projecting the data on one axis or another, may (most likely) not preserve the neighborhood structure of the data; e.g. the clusters in the data:

t-SNE finds a way to project data into a low dimensional space (1-d, in this case) such that the clustering (“local structure”) in the high dimensional space is preserved.

t-SNE Clusters:
While t-SNE plots often seem to display clusters, the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such “clusters” can be shown to even appear in non-clustered data, and thus may be false findings.
It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering.

Properties:

It preserves the neighborhood structure of the data
Does NOT preserve distances nor density
Only to some extent preserves nearest-neighbors?
discussion
It learns a non-parametric mapping, which means that it does NOT learn an explicit function that maps data from the input space to the map

Algorithm:

Issues/Weaknesses/Drawbacks:

The paper only focuses on the date visualization using t-SNE, that is, embedding high-dimensional date into a two- or three-dimensional space. However, this behavior of t-SNE presented in the paper cannot readily be extrapolated to $d>3$ dimensions due to the heavy tails of the Student t-distribution.
It might be less successful when applied to data sets with a high intrinsic dimensionality. This is a result of the local linearity assumption on the manifold that t-SNE makes by employing Euclidean distance to present the similarity between the datapoints.
The cost function is not convex. This leads to the problem that several optimization parameters (hyperparameters) need to be chosen (and tuned) and the constructed solutions depending on these parameters may be different each time t-SNE is run from an initial random configuration of the map points.
It cannot work “online”. Since it learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map. You have to re-run t-SNE on the full dataset.
A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data.
Alternatively, you could also make such a regressor minimize the t-SNE loss directly (parametric t-SNE).

t-SNE Optimization:

Discussion and Information:

What is perplexity?
Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.
Choosing the perplexity hp:
The performance of t-SNE is fairly robust under different settings of the perplexity. The most appropriate value depends on the density of your data. Loosely speaking, one could say that a larger / denser dataset requires a larger perplexity. Typical values for the perplexity range between $5$ and $50$.
Every time I run t-SNE, I get a (slightly) different result?
In contrast to, e.g., PCA, t-SNE has a non-convex objective function. The objective function is minimized using a gradient descent optimization that is initiated randomly. As a result, it is possible that different runs give you different solutions. Notice that it is perfectly fine to run t-SNE a number of times (with the same data and parameters), and to select the visualization with the lowest value of the objective function as your final visualization.
Assessing the “Quality of Embeddings/visualizations”:
Preferably, just look at them! Notice that t-SNE does not retain distances but probabilities, so measuring some error between the Euclidean distances in high-D and low-D is useless. However, if you use the same data and perplexity, you can compare the Kullback-Leibler divergences that t-SNE reports. It is perfectly fine to run t-SNE ten times, and select the solution with the lowest KL divergence.

Feature Selection

Feature Selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

Applications:

Simplification of models to make them easier to interpret by researchers/users
Shorter training time
A way to handle curse of dimensionality
Reduction of Variance $\rightarrow$ Reduce Overfitting $\rightarrow$ Enhanced Generalization

Strategies/Approaches:

Wrapper Strategy:
Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model.
e.g. Search Guided by Accuracy, Stepwise Selection
Filter Strategy:
Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set.
Filter methods produce a feature set which is not tuned to a specific model, usually giving lower prediction performance than a wrapper, but are more general and more useful for exposing the relationships between features.
e.g. Information Gain, pointwise-mutual/mutual information, Pearson Correlation
Embedded Strategy:
Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process.
e.g. LASSO

Correlation Feature Selection

The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis:
“Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other”.

The following equation gives the merit of a feature subset $S$ consisting of $k$ features:

$${\displaystyle \mathrm {Merit} _{S_{k}}={\frac {k{\overline {r_{cf}}}}{\sqrt {k+k(k-1){\overline {r_{ff}}}}}}.}$$

where, ${\displaystyle {\overline {r_{cf}}}}$ is the average value of all feature-classification correlations, and ${\displaystyle {\overline {r_{ff}}}}$ is the average value of all feature-feature correlations.

The CFS criterion is defined as follows:

$$\mathrm {CFS} =\max _{S_{k}}\left[{\frac {r_{cf_{1}}+r_{cf_{2}}+\cdots +r_{cf_{k}}}{\sqrt {k+2(r_{f_{1}f_{2}}+\cdots +r_{f_{i}f_{j}}+\cdots +r_{f_{k}f_{1}})}}}\right]$$

Feature Selection Embedded in Learning Algorithms

$l_{1}$-regularization techniques, such as sparse regression, LASSO, and ${\displaystyle l_{1}}$-SVM
Regularized trees, e.g. regularized random forest implemented in the RRF package
Decision tree
Memetic algorithm
Random multinomial logit (RMNL)
Auto-encoding networks with a bottleneck-layer
Submodular feature selection

Information Theory Based Feature Selection Mechanisms

There are different Feature Selection mechanisms around that utilize mutual information for scoring the different features.
They all usually use the same algorithm:

Calculate the mutual information as score for between all features (${\displaystyle f_{i}\in F}$) and the target class ($c$)
Select the feature with the largest score (e.g. ${\displaystyle argmax_{f_{i}\in F}(I(f_{i},c))}$) and add it to the set of selected features ($S$)
Calculate the score which might be derived form the mutual information
Select the feature with the largest score and add it to the set of select features (e.g. ${\displaystyle {\arg \max }_{f_{i}\in F}(I_{derived}(f_{i},c))}$)
Repeat 3. and 4. until a certain number of features is selected (e.g. ${\displaystyle \vert S\vert =l}$)

Feature Extraction

Feature Extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations.

In dimensionality reduction, feature extraction is also called Feature Projection, which is a method that transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.

Methods/Algorithms:

Independent component analysis
Isomap
Kernel PCA
Latent semantic analysis
Partial least squares
Principal component analysis
Autoencoder
Linear Discriminant Analysis (LDA)
Non-negative matrix factorization (NMF)

Data Imputation

Resources:

How to Make Your Machine Learning Models Robust to Outliers (Blog!)

Outliers
Replacing Outliers
Data Transformation - Outliers - Standardization
PreProcessing in DL - Data Normalization
Imputation and Feature Scaling
Missing Data - Imputation
Dim-Red - Random Projections
F-Selection - Relief
Box-Cox Transf - outliers
ANCOVA
Feature Selection Methods

Data Processing

Table of Contents

Dimensionality Reduction

Dimensionality Reduction

t-SNE | T-distributed Stochastic Neighbor Embeddings

Feature Selection

Feature Selection

Correlation Feature Selection

Feature Selection Embedded in Learning Algorithms

Information Theory Based Feature Selection Mechanisms

Feature Extraction

Feature Extraction

Data Imputation