[MLE]Decision Trees

Overview

  • Decision Trees
  • Linear vs non-linear classifiers
  • Entropy
  • C4.5
  • Random Forests
  • Some fundamental concepts

The process of selecting a specific model, given a new input x, can be described by a sequential decision making process corresponding to the traversal of a binary tree (one that splits into two branches at each node). Here we focus on a particular tree-based framework called classification and regression trees

Basic Decision Trees

Decision trees apply a sequence of linear decisions, that often depend on only a single variable at a time. Such trees partition the input space into cuboid regions, gradually refining the level of detail of a decision until a leaf node has been reached, which provides the final predicted label

enter image description here
enter image description here

As the above figure indicates, we should follow the rules for tree classification:

  • Start at root node
  • Evaluate relevant attribute of that node
  • Follow branch according to attribute evaluation to descendent node
  • Descendent node can then be considered to be the root node of a subtree
  • Repeat until a Leaf node is reached, and assign associated label

Learning trees

  • Given a set of data D = {X,y}, we want to find a tree structure and decision criteria for each node n
  • Each node splits D into two(or more) disjoint subsets \(D_{l,i}\)
  • If all labels \(y_{l,i}\) are of the same label \(c_j\)∈ C, the subset is said to be pure, and this would signify a leaf node and termination of(this part of) the tree generation procedure.
  • otherwise node will be used to further split data
  • It is not always desirable to reach pure subsets, so termination strategies need to be a bit more sophisticated

CART

  • CART = Classification And Regression Trees
  • Generic framework for designing decision tree algorithms
  • Six general questions to decide algorithm(not list here)
  • We prefer simple, compact trees, following Occam’s Razor
  • To do so, we will seek to minimise impurity of data reaching descendent nodes

There are two kinds of ways of calculate impurity:

Entropy Impurity

enter image description here
Say node N has 10 examples of class C1, and 30 examples of class C2. Calculate the entropy impurity:
\(i(N) = -(10/40 log_2(10/40) + 30/40log_2(30/40))\) = 0.8113

Gini/Variance Impurity

enter image description here

  • upper one is variance, lower one is Gini

Misclassification impurity

enter image description here

The graph of three impurity measures

enter image description here

Twoing Criterion

  • Strategy for dealing with large multi-class problems
  • Split data for all classes C at node N into groups with subsets of classes \(C_1\) and \(C_2\), where \(C_2 = C - C_1\)
  • Then proceed to treat the problem to be binary classification with \(\widehat{C_1} = C_1 and \widehat{C_2} = C_2\)
  • Repeat with assigning \(C_1\) until you find the class with the highest information gain
  • Once a node is determined to be a leaf node, it has to be assigned a category label \(C_l\)
  • This is done by simply looking at the majority category of all examples in this leaf node \(N_l\)
  • \(C_l = argmax_j P(C_j)\)

Overfitting

A hypothesis is said to be overfit if its prediction performance on the training data is overoptimistic compared to that on unseen data.
It presents itself in complicated decision boundaries that depend strongly on individual training examples.
enter image description here

  • The first graph seems to be more accurate, but it is overfitting

Stopping Criteria

Reaching a node with a pure sample is always possible but usually not desirable as it usually causes over-fitting
Three common ways to decide to stop splitting:

  • Validation set
  • Cross-validation
  • Hypothesis testing
    enter image description here
  • What we’ve got now is Training set, Validation set and Test set
  • Keep splitting nodes level by level, or even node by node, using only the training data to learn decisions, until the error on the validation set stops going down

Cross-validation

In n-fold cross-validation, a dataset is split into n roughly equally sized partitions, such that each example is assigned to one and only one fold. At each iteration a hypothesis is learned using n-1 folds as the training set and predictions are made on the n’th fold.This is repeated until a prediction is made for all n examples, and an error rate for the entire dataset is obtained.
Cross-validation maximises the amount of data available to train and test on, at cost of increased time to perform the evaluation.
enter image description here

  • Split training data in a number of folds
  • For each fold, train on all other folds and make predictions on the held-out test fold
  • Combine all predictions and calculate error
  • If error has gone down, continue splitting nodes, otherwise, stop

Pruning

  • First fully train a tree, without stopping criterion
  • After training, prune tree by eliminating pairs of leaf nodes for which the impurity penalty is small
  • More complex methods are also possible

Missing attributes

  • It is common to have examples in your dataset with missing attributes/variables
  • One way of training a tree in the presence of missing attributes is removing all data points with any missing attributes
  • A better method is to only remove data points that miss a required attribute when considering the test for a given node for a given attribute

Random forests

enter image description here

  • Variety of methods to create random forest:
  • Use bootstrapping/bagging to initialise each tree with different data
  • Use only a subset of variables at each node
  • Use a random optimisation criterion at each node
  • Project feature on a random different manifold at each node

评论

此博客中的热门博文

[MLE] W2 Multivariate linear regression

[MLE] W1 Introduction

[AIM] MetaHeuristics