Decision Trees
-
Decision Trees:
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.The trees have two types of Nodes:
- Internal Nodes: test feature values (usually just \(1\)) and branch accordingly
- Leaf Nodes: specify class \(h(\mathbf{x})\)
-
CART (Classification And Regression Trees) Learning:
Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves).“Nonlinear method for classification and regression.” - Schewchuk
Classification Trees:
Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures:- Leaves represent class labels and
- Branches represent conjunctions of features that lead to those class labels.
Regression Trees:
Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. - Properties:
- Non-Linear
- Used for Both, Classification and Regressions
- Cuts \(X\) space into rectangular cells
- Works well with both quantitative and categorical features
- Interpretable results (inference)
- Decision boundary can be arbitrarily complicated (increase number of nodes to split on)
- Linear Classifiers VS Decision Trees (\(x\) axis) | Linear VS Non-Linear data (\(y\) axis)
- Linear Classifiers VS Decision Trees (\(x\) axis) | Linear VS Non-Linear data (\(y\) axis)
- Classification Learning Algorithm:
Greedy, top-down learning heuristic.
Let \(S \subseteq\{1,2, \ldots, n\}\) be set of sample point indices.
Top-level call: \(S=\{1,2, \ldots, n\} .\)
Algorithm:
(*) How to choose best split?:
- Try all splits
- For a set \(S\), let \(J(S)\) be the cost of \(S\)
- Choose the split that minimizes \(J(S_l) + J(S_r)\); or,
- The split that minimizes the weighted average:
$$\dfrac{\left\vert S_{l}\right\vert J\left(S_{l}\right)+\left\vert S_{r}\right\vert J\left(S_{r}\right)}{\left\vert S_{l}\right\vert +\left\vert S_{r}\right\vert }$$
- The split that minimizes the weighted average:
-
Choosing the Split - Further Discussion:
- For binary feature \(x_{i} :\) children are \(x_{i}=0 \:\: \& \:\: x_{i}=1\)
- If \(x_{i}\) has \(3+\) discrete values: split depends on application.
- Sometimes it makes sense to use multiway splits; sometimes binary splits.
- If \(x_{i}\) is quantitative: sort \(x_{i}\) values in \(S\) ; try splitting between each pair of unequal consecutive values.
- We can radix sort the points in linear time, and if \(n\) is huge we should.
- Efficient Splitting (clever bit): As you scan sorted list from left to right, you can update entropy in \(\mathcal{O}(1)\) time per point!
- This is important for obtaining a fast tree-building time.
- How to update # \(X\)s and \(O\)s Illustration (189)
- Further Discussion (189)
How to choose the cost \(J(S)\):
We can accomplish this by Measuring the Entropy (from information theory):
Let \(Y\) be a random class variable, and suppose \(P(Y=C) = p_C\).- The Self-Information (“surprise”) of \(Y\) being class \(C\) (non-negative) is:
$$- \log_2 p_C$$
- Event w/ probability \(= 1\) gives us zero surprise
- Event w/ probability \(= 0\) gives us infinite surprise
- The Entropy (“average surprisal”) of an index set \(S\):
$$H(S)=-\sum_{\mathbf{C}} p_{\mathbf{C}} \log _{2} p_{\mathbf{C}}, \quad \text { where } p_{\mathbf{C}}=\dfrac{\left|\left\{i \in S : y_{i}=\mathbf{C}\right\}\right|}{|S|}$$
The proportion of points in \(S\) that are in class \(C\).
- If all points in $S$ belong to same class? \(H(S)=-1 \log_{2} 1=0\)
- Half class \(C,\) half class \(D ? H(S)=-0.5 \log_{2} 0.5-0.5 \log_{2} 0.5=1\)
- \(n\) points, all different classes? \(H(S)=-\log_{2} \dfrac{1}{n}=\log_{2} n\)
- Weighted avg entropy after split is:
$$H_{\text {after }}=\dfrac{\left|S_{l}\right| H\left(S_{l}\right)+\left|S_{r}\right| H\left(S_{r}\right)}{\left|S_{l}\right|+\left|S_{r}\right|}$$
- Choose the split that Maximizes Information Gain:
$$\text{Info-Gain} = H(S) - H_{\text{after}}$$
\(\iff\)
Minimizing \(H_{\text{after}}\).- Info gain always positive except when one child is empty or
$$\forall \mathrm{C}, P\left(y_{i}=\mathrm{C} | i \in S_{l}\right)=P\left(y_{i}=\mathrm{C} | i \in S_{r}\right)$$
- Info gain always positive except when one child is empty or
- [Choosing Cost - Bad Idea (189)]
- [Choosing Cost - Good Idea (189)]
-
Algorithms and their Computational Complexity:
Algorithms and Complexity (189)Test Point:
- Algorithm: Walk down tree until leaf. Return its label.
- Run Time:
- Worst-case Time: is \(\mathcal{O}(\text{(tree) depth})\).
- For binary features, that’s \(\leq d\).
- For Quantitative features, they may go deeper.
- Usually (not always) \(\leq \mathcal{O}(\log n)\)
- Worst-case Time: is \(\mathcal{O}(\text{(tree) depth})\).
Training:
- Algorithm:
- For Binary Features: try \(\mathcal{O}(d)\) splits at each node.
- For Quantitative Features: try \(\mathcal{O}(n'd)\); where \(n' = \#\) of points in node
- Run Time:
- Splits/Per-Node Time: is \(\mathcal{O}(d)\) for both binary and quantitative.
Quantitative features are asymptotically just as fast as binary features because of our clever way of computing the entropy for each split.
- Points/Per-Node Amount (number): is \(\mathcal{O}(n)\) points per node
- Nodes/per-point Amount (number): is \(\mathcal{O}(\text{depth})\) nodes per point (i.e. each point participates in \(\mathcal{O}(\text{depth})\) nodes)
\(\implies\) - Worst-case Time:
$$\mathcal{O}(d) \times \mathcal{O}(n) \times \mathcal{O}(\text{depth}) \leq \mathcal{O}(nd \text{ depth}) $$
- Splits/Per-Node Time: is \(\mathcal{O}(d)\) for both binary and quantitative.
-
Multivariate Splits:
Multivariate Splits (189)
-
Regression Tree Learning:
Regression Trees (189)
-
Early Stopping:
Early Stopping (189)
- Pruning:
Pruning (189)
Notes:
- Number of splits in a Decision Tree: \(= dn\)
- Complexity of finding the split:
- Naive: \(\mathcal{O}(dn^2)\)
- Fast (sort): \(\mathcal{O}(dn \: log n)\)
Random Forests
-
Ensemble Learning:
Ensemble Learning (189)
-
Bagging - Bootstrap AGGregating:
Bagging (189)
-
Random Forests: