Table of Contents



Ensemble Learning

  1. Ensemble Learning:
    In machine learning, Ensemble Learning is a set of ensemble methods that use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

  2. Ensemble Theory:
    An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would, but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data.

    Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine.
    Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).
    Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.

    • In any network, the bias can be reduced at the cost of increased variance
    • In a group of networks, the variance can be reduced at no cost to bias
  3. Types of Ensembles:
    • Bayes optimal classifier (Theoretical)
    • Bootstrap aggregating (bagging)
    • Boosting
    • Bayesian parameter averaging
    • Bayesian model combination
    • Bucket of models
    • Stacking
  4. Applications:
    • Remote sensing
      • Land cover mapping
      • Change detection
    • Computer security
      • Distributed denial of service
      • Malware Detection
      • Intrusion detection
    • Face recognition
    • Emotion recognition
    • Fraud detection
    • Financial decision-making
    • Medicine
  5. Ensemble Size:
    It is an important problem that hasn’t been well studied/addressed.

    More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble such that having more or less than this number of classifiers would deteriorate the accuracy. It is called “the law of diminishing returns in ensemble construction”. Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.

  6. Notes:
    • Averaging models increases capacity.

    Ensemble Averaging:
    Relies


Bayes optimal classifier (Theoretical)

  1. Bayes Optimal Classifier:
    Bayes Optimal Classifier


Bootstrap Aggregating (Bagging)

  1. Bootstrap Aggregating (Bagging):
    Bootstrap Aggregating (Bagging) is an ensemble meta-algorithm designed to improve the stability and accuracy of ml algorithms. It is designed to reduce variance and help to avoid overfitting.

    It is applicable to both classification and regression problems.

    Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

  2. Bootstrapping:
    Bootstrapping is a sampling technique. From a set \(D\) of \(n\) sample points, it constructs \(m\) subsets \(D_i\), each of size \(n'\), by sampling from \(D\) uniformly and with replacement.
    • By sampling with replacement, some observations may be repeated in each \({\displaystyle D_{i}}\).
    • If \(n'=n\), then for large \(n\) the set \(D_{i}\) is expected to have the fraction (\(1 - 1/e\)) (\(\approx 63.2\%\)) of the unique examples of \(D\), the rest being duplicates.

    The point of sampling with replacement is to make the re-sampling truly random. If done without replacement, the samples drawn will be dependent on the previous ones and thus not be random.

  3. Aggregating:
    The predictions from the above models are aggregated to make a final combined prediction. This aggregation can be done on the basis of predictions made or the probability of the predictions made by the bootstrapped individual models.

  4. The Algorithm:
    Bagging uses multiple weak models and aggregates the predictions from each of them to get the final prediction. The weak models should be such that each specialize in a particular part of the feature space thus enabling us to leverage predictions from each model to maximum use. As suggested by the name, it consists of two parts, bootstrapping and aggregation.

    • Given a set \(D\) of \(n\) sample points,
    • Bootstrapping: Construct \(m\) bootstap samples (subsets) \(D_i\).
    • Fit \(m\) models using the \(m\) bootstrap samples
    • Aggregating: Combine the models by:
      • Regression: Averaging
      • Classification: Voting

    Bagging and Random Forests (Shewchuk)

  5. Advantages:
    • Improves “unstable” procedures
    • Reduces variance \(\rightarrow\) helps avoid overfitting
    • Ensemble models can be used to capture the linear as well as the non-linear relationships in the data.This can be accomplished by using 2 different models and forming an ensemble of the two.

  6. Disadvantages:
    • On the other hand, it can mildly degrade the performance of “stable” methods such as K-NN
    • It causes a Reduction in the interpretability of the model
    • Prone to high bias if not modeled properly
    • Though improves accuracy, it is computationally expensive and hard to design:
      It is not good for real time applications.

  7. Examples (bagging algorithms):
    • Random Forests: is a bagging algorithm that further reduces variance by selecting a subset of features
      1. Suppose there are N observations and M features. A sample from observation is selected randomly with replacement(Bootstrapping).
      2. A subset of features are selected to create a model with sample of observations and subset of features.
      3. Feature from the subset is selected which gives the best split on the training data.(Visit my blog on Decision Tree to know more of best split)
      4. This is repeated to create many models and every model is trained in parallel Prediction is given based on the aggregation of predictions from all the models.

Boosting

  1. Boosting:
    Boosting is an ensemble meta-algorithm for primarily reducing bias, but also variance in supervised learning. It belongs to a family of machine learning algorithms that convert weak learners to strong ones.

    It is an iterative technique which adjusts the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Thus, future weak learners focus more on the examples that previous weak learners misclassified.

    Boosting in general decreases the bias error and builds strong predictive models. However, they may sometimes over fit on the training data.

    Boosting increases the capacity.

    Summary
    Boosting: create different hypothesis \(h_i\)s sequentially + make each new hypothesis decorrelated with previous hypothesis.

    • Assumes that this will be combined/ensembled
    • Ensures that each new model/hypothesis will give a different/independent output

  2. Motivation - “The Hypothesis Boosting Problem”:
    Boosting is based on a question posed by Kearns and Valiant (1988):

    “Can a set of weak learners create a single strong learner?”:

    This question was formalized as a hypothesis called “The Hypothesis Boosting Problem”.

    The Hypothesis Boosting Problem:
    Informally, [the hypothesis boosting] problem asks whether an efficient learning algorithm […] that outputs a hypothesis whose performance is only slightly better than random guessing [i.e. a weak learner] implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner].

    • A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing).
    • A strong learner is a classifier that is arbitrarily well-correlated with the true classification.

    Countering BAgging Limitations:
    Bagging suffered from some limitations; namely, that the models can be dependent/correlated which cause the voting to be trapped in the wrong hypothesis of the weak learners. This motivated the intuition behind Boosting:

    • Instead of training parallel models, one needs to train models sequentially &
    • Each model should focus on where the previous classifier performed poorly

  3. Boosting Theory and Convexity:
    Only algorithms that are provable boosting algorithms in the probably approximately correct (PAC) learning formulation can accurately be called boosting algorithms. Other algorithms that are similar in spirit to boosting algorithms are sometimes called “leveraging algorithms”, although they are also sometimes incorrectly called boosting algorithms.

    Convexity:
    Boosting algorithms can be based on convex or non-convex optimization algorithms:

    • Convex Algorithms:
      such as AdaBoost and LogitBoost, can be “defeated” by random noise such that they can’t learn basic and learnable combinations of weak hypotheses.
      This limitation was pointed out by Long & Servedio in 2008.
    • Non-Convex Algorithms:
      such as BrownBoost, was shown to be able to learn from noisy datasets and can specifically learn the underlying classifier of the “Long–Servedio dataset”.

  4. The Boosting MetaAlgorithm:
    • Finding (defining) Weak Learners:
      The algorithm defines weak learners as those that have weak rules (rules that are not powerful enough for accurate classification)
    • Identifying Weak Rules:
      • To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.
    • Choosing different distribution for each round:
      1. The base learner takes all the distributions and assign equal weight or attention to each observation.
      2. If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.
      3. Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.
    • Aggregating Outputs:
      Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classified or have higher errors by preceding weak rules.

  5. Boosting Algorithms:
    • AdaBoost (Adaptive Boosting)
    • Gradient Tree Boosting
    • XGBoost
  6. The AdaBoost Algorithm - Adaptive Boosting:
    AdaBoost: It works on similar method as discussed above. It fits a sequence of weak learners on different weighted training data. It starts by predicting original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.

    Mostly, we use decision stamps with AdaBoost. But, we can use any machine learning algorithms as base learner if it accepts weight on training data set. We can use AdaBoost algorithms for both classification and regression problem.

    Notes:

    • order of trees matter in AdaBoost

  7. Advantages:
  8. Disadvantages:
    • Reduced interpretability
    • Harder to tune than other models, because you have so many hyperparameters and you can easily overfit
    • Computationally expensive for training (sequential) and inference

  9. Bagging VS Boosting:
    Bagging VS Boosting

Stacking

  1. Asynchronous: