Part 2

Decision Trees

Code | Part 2 : Random Forest

Decision trees have high variance. Even a small change in the training data can result in a drastically different tree. Even the Decision Trees with the highest accuracy, may fail to predict the class of a new datapoint. To address these challenges, Random Forest trains many decision tree classifiers and combines them to predict the class of a new datapoint.  Each tree in the Random Forest predicts the class of the new datapoint and the prediction with the most votes becomes the predicted classification. We can see that the accuracy increases when Random Forest is implemented.

The number of trees used in Random Forest Classification with SciKit is defined by the n_estimators hyperparameter. By default, the value is 100, but it can be changed by setting a new value. The example below uses 50 trees.

from sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification
# Train Random Forest Classifierclf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=50)clf.fit(X_train,y_train)
# Predict the response for test datasety_pred = clf.predict(X_test)print(y_pred)
[2 1 0 1 0 2 1 0 2 1 0 0 1 0 1 1 2 0 1 0 0 1 2 0 0 2 0 0 0 2 1 2 2 0 1 1 1 1 1 0 0 1 2 0 0 0 1 0 0 0 1 2 2 0]
# Model Accuracy, how often is the classifier correct?print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
# Which classes are commonly misclassified?print('Confusion Matrix')print(metrics.confusion_matrix(y_test, y_pred, labels=None))
Accuracy: 0.9814814814814815 Confusion Matrix[[23 0 0] [ 1 18 0] [ 0 0 12]]