[MLE]Decision Trees
Overview
- Decision Trees
- Linear vs non-linear classifiers
- Entropy
- C4.5
- Random Forests
- Some fundamental concepts
The process of selecting a specific model, given a new input x, can be described by a sequential decision making process corresponding to the traversal of a binary tree (one that splits into two branches at each node). Here we focus on a particular tree-based framework called classification and regression trees
Basic Decision Trees
Decision trees apply a sequence of linear decisions, that often depend on only a single variable at a time. Such trees partition the input space into cuboid regions, gradually refining the level of detail of a decision until a leaf node has been reached, which provides the final predicted label
As the above figure indicates, we should follow the rules for tree classification:
- Start at root node
- Evaluate relevant attribute of that node
- Follow branch according to attribute evaluation to descendent node
- Descendent node can then be considered to be the root node of a subtree
- Repeat until a Leaf node is reached, and assign associated label
Learning trees
- Given a set of data D = {X,y}, we want to find a tree structure and decision criteria for each node n
- Each node splits D into two(or more) disjoint subsets \(D_{l,i}\)
- If all labels \(y_{l,i}\) are of the same label \(c_j\)∈ C, the subset is said to be pure, and this would signify a leaf node and termination of(this part of) the tree generation procedure.
- otherwise node will be used to further split data
- It is not always desirable to reach pure subsets, so termination strategies need to be a bit more sophisticated
CART
- CART = Classification And Regression Trees
- Generic framework for designing decision tree algorithms
- Six general questions to decide algorithm(not list here)
- We prefer simple, compact trees, following Occam’s Razor
- To do so, we will seek to minimise impurity of data reaching descendent nodes
There are two kinds of ways of calculate impurity:
Entropy Impurity
Say node N has 10 examples of class C1, and 30 examples of class C2. Calculate the entropy impurity:
\(i(N) = -(10/40 log_2(10/40) + 30/40log_2(30/40))\) = 0.8113
Gini/Variance Impurity
- upper one is variance, lower one is Gini
Misclassification impurity
The graph of three impurity measures
Twoing Criterion
- Strategy for dealing with large multi-class problems
- Split data for all classes C at node N into groups with subsets of classes \(C_1\) and \(C_2\), where \(C_2 = C - C_1\)
- Then proceed to treat the problem to be binary classification with \(\widehat{C_1} = C_1 and \widehat{C_2} = C_2\)
- Repeat with assigning \(C_1\) until you find the class with the highest information gain
- Once a node is determined to be a leaf node, it has to be assigned a category label \(C_l\)
- This is done by simply looking at the majority category of all examples in this leaf node \(N_l\)
- \(C_l = argmax_j P(C_j)\)
Overfitting
A hypothesis is said to be overfit if its prediction performance on the training data is overoptimistic compared to that on unseen data.
It presents itself in complicated decision boundaries that depend strongly on individual training examples.
- The first graph seems to be more accurate, but it is overfitting
Stopping Criteria
Reaching a node with a pure sample is always possible but usually not desirable as it usually causes over-fitting
Three common ways to decide to stop splitting:
- Validation set
- Cross-validation
- Hypothesis testing
- What we’ve got now is Training set, Validation set and Test set
- Keep splitting nodes level by level, or even node by node, using only the training data to learn decisions, until the error on the validation set stops going down
Cross-validation
In n-fold cross-validation, a dataset is split into n roughly equally sized partitions, such that each example is assigned to one and only one fold. At each iteration a hypothesis is learned using n-1 folds as the training set and predictions are made on the n’th fold.This is repeated until a prediction is made for all n examples, and an error rate for the entire dataset is obtained.
Cross-validation maximises the amount of data available to train and test on, at cost of increased time to perform the evaluation.
- Split training data in a number of folds
- For each fold, train on all other folds and make predictions on the held-out test fold
- Combine all predictions and calculate error
- If error has gone down, continue splitting nodes, otherwise, stop
Pruning
- First fully train a tree, without stopping criterion
- After training, prune tree by eliminating pairs of leaf nodes for which the impurity penalty is small
- More complex methods are also possible
Missing attributes
- It is common to have examples in your dataset with missing attributes/variables
- One way of training a tree in the presence of missing attributes is removing all data points with any missing attributes
- A better method is to only remove data points that miss a required attribute when considering the test for a given node for a given attribute
Random forests
- Variety of methods to create random forest:
- Use bootstrapping/bagging to initialise each tree with different data
- Use only a subset of variables at each node
- Use a random optimisation criterion at each node
- Project feature on a random different manifold at each node
评论
发表评论