Introduction

Decision Trees

Introduction

Decision Trees can provide an efficient classification option when data cannot be separated linearly. The most commonly used decision tree classifier splits the dataset into hyperrectangles with sides parallel to the axes. Decision trees start with the root node which contains the entire data set. The root node then splits into two descendant nodes. The splits continue until a leaf node is determined which contains data of only one class. At each split, the classifier creates nodes that are more homogeneous in regards to the class when compared to the ancestor's subset of the datapoints. At each node, the classifier measures node impurity and splits the node so that the impurity of the descendant nodes is optimally decreased. Two common measures of impurity are Gini impurity and Entropy impurity.

Gini impurity is equal to one minus the sum of the probability of the datapoint belonging to each class squared.

The maximum value is when the probability of the sample being assigned to all classes is the same. In a classification problem with three classes, this is approximately 0.667. The minimum value is when the node contains only one class and the value becomes 0.

Entropy impurity is determined by the entropy equation from Shannon's Information Theory.

The maximum value is also when the probability of the sample being assigned to all classes is the same. The minimum value is when the node contains only one class and the value becomes 0.

Entropy impurity takes more computational power to compute so Gini impurity is more widely used. 

In addition to determining the purity of each node, the classifier needs criteria for stopping. One option is to stop when the node is pure (all datapoints in the node belong to one class). Another option is to stop when a node contains a small enough subset of data regardless of if the node is pure. By default, the decision tree classifier in SciKit Learn stops when the node is pure. Additionally, it is possible to set the max_depth parameter allows control over the size of the tree to prevent overfitting. 

Like with all classifiers, data trees can overfit to the data when the tree is too large. When this is the case, the tree learns the details of a training set and these details cannot be generalized to new datapoints. The most common strategy is to grow the tree to a large size and then prune nodes.

Image From: Murphy, K. P. (2021). Figure 16.4 and 16.5. *In Machine Learning: A Probabilistic Perspective*. textbook, MIT Press.