In this post we dicuss different decision trees.

Two types of decision tree exists.

  • Classification Tree.
    • The simplest to exist. See Bootstrap and Bagging for details.
  • Regression Tree
    • Make use of variance as a measure for splitting.
  • Adaboost with decision tree
    • same as random forest except there the weights of each classifier and sample are re-calculated during each iteration
  • GBDT (Gradient Boost Decision Tree) (Mart) (Multiple Additive Regression Tree)
    • think of it as residual net
    • two kinds of implementation
      • new tree trained gradient based
      • new tree trained difference based
    • loop procedure: initial setup ylabel=yactual
      • yn=ylabelyn11, where yn11 represent an ensemble prediction of all previous trees. And yn is the new tree learned from either the differene or gradient
      • yn1=yn11+yn, the new tree is then added to the ensemble
      • ylabel=difference or negative gradient
    • shrinkage version
      • replace step 2 in the loop with yn1=yn11+stepyn, everything else is the same.
      • require more computational resource
      • prevent overfitting. No theory to prove it though.
    • other variant: stochastic sampling of features and bootstrap sampling.
    • Reference
  • XGBOOST
    • make use of 2nd Taylor series
    • 2nd Taylor series reduces the iteration process therefore hasten the training process
    • added 2 regulations to prevent the tree from overfitting
    • kltaUH.png the loss function
    • kltOIJ.png the regulation function
    • where γ and λ is two hyperparameters, the γ parameter is a threshold value. And λ is a smoothing factor.
    • Reference - Usuage
    • Reference - Induction
    • Features I don’t understand
      • how parallel process is implemented in XGboost
      • Its relationship to GBDT
      • How it handles missing features
      • If 2nd order derivative is used, why not 3rd order?