Decision Tree Variants
In this post we dicuss different decision trees.
Two types of decision tree exists.
- Classification Tree.
- The simplest to exist. See Bootstrap and Bagging for details.
- Regression Tree
- Make use of variance as a measure for splitting.
- Adaboost with decision tree
- same as random forest except there the weights of each classifier and sample are re-calculated during each iteration
- GBDT (Gradient Boost Decision Tree) (Mart) (Multiple Additive Regression Tree)
- think of it as residual net
- two kinds of implementation
- new tree trained gradient based
- new tree trained difference based
- loop procedure: initial setup $y_{label} = y_{actual}$
- $y_{n} = y_{label} - y_{1}^{n-1}$, where $y_1^{n-1}$ represent an ensemble prediction of all previous trees. And $y_n$ is the new tree learned from either the differene or gradient
- $y_1^n = y_1^{n-1} + y_n$, the new tree is then added to the ensemble
- $y_{label} = \text{difference or negative gradient}$
- shrinkage version
- replace step 2 in the loop with $y_1^n = y_1^{n-1} + \text{step} \cdot y_n$, everything else is the same.
- require more computational resource
- prevent overfitting. No theory to prove it though.
- other variant: stochastic sampling of features and bootstrap sampling.
- Reference
- XGBOOST
- make use of 2nd Taylor series
- 2nd Taylor series reduces the iteration process therefore hasten the training process
- added 2 regulations to prevent the tree from overfitting
- the loss function
- the regulation function
- where $\gamma$ and $\lambda$ is two hyperparameters, the $\gamma$ parameter is a threshold value. And $\lambda$ is a smoothing factor.
- Reference - Usuage
- Reference - Induction
- Features I don’t understand
- how parallel process is implemented in XGboost
- Its relationship to GBDT
- How it handles missing features
- If 2nd order derivative is used, why not 3rd order?