Decision Tree Variants

In this post we dicuss different decision trees.

Two types of decision tree exists.

Classification Tree.
- The simplest to exist. See Bootstrap and Bagging for details.
Regression Tree
- Make use of variance as a measure for splitting.
Adaboost with decision tree
- same as random forest except there the weights of each classifier and sample are re-calculated during each iteration
GBDT (Gradient Boost Decision Tree) (Mart) (Multiple Additive Regression Tree)
- think of it as residual net
- two kinds of implementation
  - new tree trained gradient based
  - new tree trained difference based
- loop procedure: initial setup $y_{label} = y_{actual}$
  - $y_{n} = y_{label} - y_{1}^{n-1}$, where $y_1^{n-1}$ represent an ensemble prediction of all previous trees. And $y_n$ is the new tree learned from either the differene or gradient
  - $y_1^n = y_1^{n-1} + y_n$, the new tree is then added to the ensemble
  - $y_{label} = \text{difference or negative gradient}$
- shrinkage version
  - replace step 2 in the loop with $y_1^n = y_1^{n-1} + \text{step} \cdot y_n$, everything else is the same.
  - require more computational resource
  - prevent overfitting. No theory to prove it though.
- other variant: stochastic sampling of features and bootstrap sampling.
- Reference
XGBOOST
- make use of 2nd Taylor series
- 2nd Taylor series reduces the iteration process therefore hasten the training process
- added 2 regulations to prevent the tree from overfitting
- the loss function
- the regulation function
- where $\gamma$ and $\lambda$ is two hyperparameters, the $\gamma$ parameter is a threshold value. And $\lambda$ is a smoothing factor.
- Reference - Usuage
- Reference - Induction
- Features I don’t understand
  - how parallel process is implemented in XGboost
  - Its relationship to GBDT
  - How it handles missing features
  - If 2nd order derivative is used, why not 3rd order?

Related blog posts:

DeepMask Review

ROI pooling, align, warping