[MLE]Decision Trees

Overview

Decision Trees
Linear vs non-linear classifiers
Entropy
C4.5
Random Forests
Some fundamental concepts

The process of selecting a specific model, given a new input x, can be described by a sequential decision making process corresponding to the traversal of a binary tree (one that splits into two branches at each node). Here we focus on a particular tree-based framework called classification and regression trees

Basic Decision Trees

Decision trees apply a sequence of linear decisions, that often depend on only a single variable at a time. Such trees partition the input space into cuboid regions, gradually refining the level of detail of a decision until a leaf node has been reached, which provides the final predicted label

enter image description here

As the above figure indicates, we should follow the rules for tree classification:

Start at root node
Evaluate relevant attribute of that node
Follow branch according to attribute evaluation to descendent node
Descendent node can then be considered to be the root node of a subtree
Repeat until a Leaf node is reached, and assign associated label

Learning trees

Given a set of data D = {X,y}, we want to find a tree structure and decision criteria for each node n
Each node splits D into two(or more) disjoint subsets \(D_{l,i}\)
If all labels \(y_{l,i}\) are of the same label \(c_j\)∈ C, the subset is said to be pure, and this would signify a leaf node and termination of(this part of) the tree generation procedure.
otherwise node will be used to further split data
It is not always desirable to reach pure subsets, so termination strategies need to be a bit more sophisticated

CART

CART = Classification And Regression Trees
Generic framework for designing decision tree algorithms
Six general questions to decide algorithm(not list here)
We prefer simple, compact trees, following Occam’s Razor
To do so, we will seek to minimise impurity of data reaching descendent nodes

There are two kinds of ways of calculate impurity:

Entropy Impurity

enter image description here
Say node N has 10 examples of class C1, and 30 examples of class C2. Calculate the entropy impurity:
\(i(N) = -(10/40 log_2(10/40) + 30/40log_2(30/40))\) = 0.8113

Gini/Variance Impurity

enter image description here

upper one is variance, lower one is Gini

Misclassification impurity

enter image description here

The graph of three impurity measures

enter image description here

Twoing Criterion

Strategy for dealing with large multi-class problems
Split data for all classes C at node N into groups with subsets of classes \(C_1\) and \(C_2\), where \(C_2 = C - C_1\)
Then proceed to treat the problem to be binary classification with \(\widehat{C_1} = C_1 and \widehat{C_2} = C_2\)
Repeat with assigning \(C_1\) until you find the class with the highest information gain
Once a node is determined to be a leaf node, it has to be assigned a category label \(C_l\)
This is done by simply looking at the majority category of all examples in this leaf node \(N_l\)
\(C_l = argmax_j P(C_j)\)

Overfitting

A hypothesis is said to be overfit if its prediction performance on the training data is overoptimistic compared to that on unseen data.
It presents itself in complicated decision boundaries that depend strongly on individual training examples.
enter image description here

The first graph seems to be more accurate, but it is overfitting

Stopping Criteria

Reaching a node with a pure sample is always possible but usually not desirable as it usually causes over-fitting
Three common ways to decide to stop splitting:

Validation set
Cross-validation
Hypothesis testing
What we’ve got now is Training set, Validation set and Test set
Keep splitting nodes level by level, or even node by node, using only the training data to learn decisions, until the error on the validation set stops going down

Cross-validation

In n-fold cross-validation, a dataset is split into n roughly equally sized partitions, such that each example is assigned to one and only one fold. At each iteration a hypothesis is learned using n-1 folds as the training set and predictions are made on the n’th fold.This is repeated until a prediction is made for all n examples, and an error rate for the entire dataset is obtained.
Cross-validation maximises the amount of data available to train and test on, at cost of increased time to perform the evaluation.
enter image description here

Split training data in a number of folds
For each fold, train on all other folds and make predictions on the held-out test fold
Combine all predictions and calculate error
If error has gone down, continue splitting nodes, otherwise, stop

Pruning

First fully train a tree, without stopping criterion
After training, prune tree by eliminating pairs of leaf nodes for which the impurity penalty is small
More complex methods are also possible

Missing attributes

It is common to have examples in your dataset with missing attributes/variables
One way of training a tree in the presence of missing attributes is removing all data points with any missing attributes
A better method is to only remove data points that miss a required attribute when considering the test for a given node for a given attribute

Random forests

enter image description here

Variety of methods to create random forest:
Use bootstrapping/bagging to initialise each tree with different data
Use only a subset of variables at each node
Use a random optimisation criterion at each node
Project feature on a random different manifold at each node

搜索此博客

MikeChen's Blog