利用者:Kidotaka/sandbox

勾配ブースティング（英: Gradient boosting）

勾配ブースティングは、回帰および分類問題のための機械学習技術です。これは、弱い予測モデル、通常は決定木の集合の形で予測モデルを作成します。他のブースティング手法と同様に段階的な方法でモデルを構築し、任意の微分可能損失関数の最適化を可能にすることでそれらを一般化します。

勾配ブースティングのアイデアは、ブースティングは適切な損失関数の最適化アルゴリズムとして解釈できるというLeo Breiman（英語版）による観察から生まれました。 ^[1] 明示的な回帰勾配ブースティングアルゴリズムは、Llew Mason、Jonathan Baxter、Peter BartlettおよびMarcus Frean ^[2]^[3] のより一般的な関数勾配ブースティングの観点と同時に、Jerome H. Friedmanによって開発されました。 ^[4] ^[5] 後者の2つの論文は、反復的な「関数勾配降下」アルゴリズムとしてのブースティングアルゴリズムの見方を紹介した。つまり、負の勾配方向を向く関数（弱い仮説）を繰り返し選択することによって、関数空間上でコスト関数を最適化するアルゴリズムです。ブースティングのこの機能的勾配ビューは、回帰と分類を超えた機械学習と統計の多くの分野でブースティングアルゴリズムの開発をもたらしました。

非公式の紹介

(この節では、Liによる勾配ブースティングについて説明します。^[6])

他のブースティング法のように、勾配ブースティングは弱い「学習器」を単一の強い学習器に反復的に結合します。 ${\hat {y}}=F(x)$ の値を予測するようにモデル $F$ に「教える」ことが目標である最小二乗法による回帰設定で説明するのが最も簡単です。平均二乗誤差 ${\tfrac {1}{n}}\sum _{i}({\hat {y}}_{i}-y_{i})^{2}$ を最小化することによって、ここで、 $i$ は、出力変数 $y$ の実際の値のサイズ $n$ のトレーニングセットに対してのインデックスです。

At each stage $m$ , $1\leq m\leq M$ , of gradient boosting, it may be assumed that there is some imperfect model $F_{m}$ (at the outset, a very weak model that just predicts the mean $y$ in the training set could be used). The gradient boosting algorithm improves on $F_{m}$ by constructing a new model that adds an estimator $h$ to provide a better model: $F_{m+1}(x)=F_{m}(x)+h(x)$ . To find $h$ , the gradient boosting solution starts with the observation that a perfect $h$ would imply

F_{m+1}(x)=F_{m}(x)+h(x)=y

または、同等に

h(x)=y-F_{m}(x)

.

Therefore, gradient boosting will fit $h$ to the residual $y-F_{m}(x)$ . As in other boosting variants, each $F_{m+1}$ attempts to correct the errors of its predecessor $F_{m}$ . A generalization of this idea to loss functions other than squared error, and to classification and ranking problems, follows from the observation that residuals $y-F(x)$ for a given model are the negative gradients (with respect to $F(x)$ ) of the squared error loss function ${\frac {1}{2}}(y-F(x))^{2}$ . So, gradient boosting is a gradient descent algorithm, and generalizing it entails "plugging in" a different loss and its gradient.

アルゴリズム

多くの教師あり学習では、一つの出力変数 $y$ と入力変数 $x$ のベクター described via a joint probability distribution $P(x,y)$ . Using a training set $\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}$ of known values of $x$ and corresponding values of $y$ , the goal is to find an approximation ${\hat {F}}(x)$ to a function $F(x)$ that minimizes the expected value of some specified loss function（英語版） $L(y,F(x))$ :

{\hat {F}}={\underset {F}{\arg \min }}\,\mathbb {E} _{x,y}[L(y,F(x))]

.

The gradient boosting method assumes a real-valued $y$ and seeks an approximation ${\hat {F}}(x)$ in the form of a weighted sum of functions $h_{i}(x)$ from some class ${\mathcal {H}}$ , called base (or weak) learners:

{\hat {F}}(x)=\sum _{i=1}^{M}\gamma _{i}h_{i}(x)+{\mbox{const}}

.

In accordance with the empirical risk minimization（英語版） principle, the method tries to find an approximation ${\hat {F}}(x)$ that minimizes the average value of the loss function on the training set, i.e., minimizes the empirical risk. It does so by starting with a model, consisting of a constant function $F_{0}(x)$ , and incrementally expands it in a greedy fashion:

F_{0}(x)={\underset {\gamma }{\arg \min }}{\sum _{i=1}^{n}{L(y_{i},\gamma )}}

,

F_{m}(x)=F_{m-1}(x)+{\underset {h_{m}\in {\mathcal {H}}}{\operatorname {arg\,min} }}\left[{\sum _{i=1}^{n}{L(y_{i},F_{m-1}(x_{i})+h_{m}(x_{i}))}}\right]

,

where $h_{m}\in {\mathcal {H}}$ is a base learner function.

Unfortunately, choosing the best function $h$ at each step for an arbitrary loss function $L$ is a computationally infeasible optimization problem in general. Therefore, we restrict our approach to a simplified version of the problem.

The idea is to apply a steepest descent step to this minimization problem. If we considered the continuous case, i.e. where ${\mathcal {H}}$ is the set of arbitrary differentiable functions on $\mathbb {R}$ , we would update the model in accordance with the following equations

F_{m}(x)=F_{m-1}(x)-\gamma _{m}\sum _{i=1}^{n}{\nabla _{F_{m-1}}L(y_{i},F_{m-1}(x_{i}))},

\gamma _{m}={\underset {\gamma }{\arg \min }}{\sum _{i=1}^{n}{L\left(y_{i},F_{m-1}(x_{i})-\gamma \nabla _{F_{m-1}}L(y_{i},F_{m-1}(x_{i}))\right)}},

where the derivatives are taken with respect to the functions $F_{i}$ for $i\in \{1,..,m\}$ . In the discrete case however, i.e. when the set ${\mathcal {H}}$ is finite, we choose the candidate function $h$ closest to the gradient of $L$ for which the coefficient $γ$ may then be calculated with the aid of line search on the above equations. Note that this approach is a heuristic and therefore doesn't yield an exact solution to the given problem, but rather an approximation. In pseudocode, the generic gradient boosting method is:^[4]^[7]

Input: training set $\{(x_{i},y_{i})\}_{i=1}^{n},$ a differentiable loss function $L(y,F(x)),$ number of iterations $M$ .

アルゴリズム:

定数によるモデル初期化:
$F_{0}(x)={\underset {\gamma }{\arg \min }}\sum _{i=1}^{n}L(y_{i},\gamma ).$
For m = 1 to M:
1. いわゆる 擬似残差の計算:
  $r_{im}=-\left[{\frac {\partial L(y_{i},F(x_{i}))}{\partial F(x_{i})}}\right]_{F(x)=F_{m-1}(x)}\quad {\mbox{for }}i=1,\ldots ,n.$
2. ベース学習器の学習 (e.g. 決定木) $h_{m}(x)$ to pseudo-residuals, i.e. train it using the training set $\{(x_{i},r_{im})\}_{i=1}^{n}$ .
3. Compute multiplier $\gamma _{m}$ by solving the following one-dimensional optimization problem:
  $\gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L\left(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})\right).$
4. Update the model:
  $F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x).$
Output $F_{M}(x).$

勾配ブースティング木

勾配ブースティングは通常、ベース学習器として固定サイズの決定木（特にCART木）で使用されます。この特別な場合のために、Friedmanは各学習器の適合の質を改善する勾配ブースティング方法への修正を提案します。

m番目のステップでの一般的な勾配ブースティングは、決定木 $h_{m}(x)$ を擬似残差に当てはめます。 $J_{m}$ を葉の数とします。ツリーは入力スペースを $J_{m}$ の互いに素な領域構文解析に失敗 (構文エラー): {\displaystyle R_{1m},\ldots,R_{J_{m}m}}} に分割し、各領域の定数値を予測します。指示関数を使用して、入力xに対する $h_{m}(x)$ の出力は合計として書くことができます。

Generic gradient boosting at the m-th step would fit a decision tree $h_{m}(x)$ to pseudo-residuals. Let $J_{m}$ be the number of its leaves. The tree partitions the input space into $J_{m}$ disjoint regions $R_{1m},\ldots ,R_{J_{m}m}$ and predicts a constant value in each region. Using the indicator notation, the output of $h_{m}(x)$ for input x can be written as the sum:

h_{m}(x)=\sum _{j=1}^{J_{m}}b_{jm}\mathbf {1} _{R_{jm}}(x),

where $b_{jm}$ is the value predicted in the region $R_{jm}$ .^[8]

Then the coefficients $b_{jm}$ are multiplied by some value $\gamma _{m}$ , chosen using line search so as to minimize the loss function, and the model is updated as follows:

F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x),\quad \gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})).

Friedman proposes to modify this algorithm so that it chooses a separate optimal value $\gamma _{jm}$ for each of the tree's regions, instead of a single $\gamma _{m}$ for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients $b_{jm}$ from the tree-fitting procedure can be then simply discarded and the model update rule becomes:

F_{m}(x)=F_{m-1}(x)+\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbf {1} _{R_{jm}}(x),\quad \gamma _{jm}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{x_{i}\in R_{jm}}L(y_{i},F_{m-1}(x_{i})+\gamma ).

木のサイズ

$J$ , the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of interaction（英語版） between variables in the model. With $J=2$ (decision stump（英語版）s), no interaction between variables is allowed. With $J=3$ the model may include effects of the interaction between up to two variables, and so on.

Hastie et al.^[7] comment that typically $4\leq J\leq 8$ work well for boosting and results are fairly insensitive to the choice of $J$ in this range, $J=2$ is insufficient for many applications, and $J>10$ is unlikely to be required.

正則化

Fitting the training set too closely can lead to degradation of the model's generalization ability. Several so-called regularization techniques reduce this overfitting effect by constraining the fitting procedure.

One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of trees in the model when the base learner is a decision tree). Increasing M reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set. Besides controlling M, several other regularization techniques are used.

Another regulurization parameter is the depth of the trees. The higher this value the more likely the model will overfit the training data.

Shrinkage

An important part of gradient boosting method is regularization by shrinkage which consists in modifying the update rule as follows:

F_{m}(x)=F_{m-1}(x)+\nu \cdot \gamma _{m}h_{m}(x),\quad 0<\nu \leq 1,

where parameter（英語版） $\nu$ is called the "learning rate".

Empirically it has been found that using small learning rate（英語版）s (such as $\nu <0.1$ ) yields dramatic improvements in models' generalization ability over gradient boosting without shrinking ( $\nu =1$ ).^[7] However, it comes at the price of increasing computational time（英語版） both during training and querying: lower learning rate requires more iterations.

確率的勾配ブースティング

Soon after the introduction of gradient boosting, Friedman proposed a minor modification to the algorithm, motivated by Breiman（英語版）'s bootstrap aggregation ("bagging") method.^[5] Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement.^[9] Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.

Subsample size is some constant fraction f of the size of the training set. When f = 1, the algorithm is deterministic and identical to the one described above. Smaller values of f introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Friedman^[5] obtained that $0.5\leq f\leq 0.8$ leads to good results for small and moderate sized training sets. Therefore, f is typically set to 0.5, meaning that one half of the training set is used to build each base learner.

Also, like in bagging, subsampling allows one to define an out-of-bag error（英語版） of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.^[10]^[11]

Number of observations in leaves

Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes (this parameter is called n.minobsinnode in the R gbm package^[10]). It is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances.

Imposing this limit helps to reduce variance in predictions at leaves.

Penalize Complexity of Tree

Another useful regularization techniques for gradient boosted trees is to penalize model complexity of the learned model.^[12] The model complexity can be defined as the proportional number of leaves in the learned trees. The joint optimization of loss and model complexity corresponds to a post-pruning algorithm to remove branches that fail to reduce the loss by a threshold. Other kinds of regularization such as an $\ell _{2}$ penalty on the leaf values can also be added to avoid overfitting.

Usage

Gradient boosting can be used in the field of learning to rank（英語版）. The commercial web search engines Yahoo^[13] and Yandex^[14] use variants of gradient boosting in their machine-learned ranking engines.

Names

The method goes by a variety of names. Friedman introduced his regression technique as a "Gradient Boosting Machine" (GBM).^[4] Mason, Baxter et al. described the generalized abstract class of algorithms as "functional gradient boosting".^[2]^[3] Friedman et al. describe an advancement of gradient boosted models as Multiple Additive Regression Trees (MART);^[15] Elith et al. describe that approach as "Boosted Regression Trees" (BRT).^[16]

A popular open-source implementation for R calls it a "Generalized Boosting Model",^[10] however packages expanding this work use BRT.^[17] Commercial implementations from Salford Systems use the names "Multiple Additive Regression Trees" (MART) and TreeNet, both trademarked.^[要出典]

参考文献

^ Breiman, L. (June 1997). “Arcing The Edge”. Technical Report 486 (Statistics Department, University of California, Berkeley).
^ ^a ^b Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (1999). "Boosting Algorithms as Gradient Descent" (PDF). In S.A. Solla and T.K. Leen and K. Müller (ed.). Advances in Neural Information Processing Systems 12. MIT Press. pp. 512–518.
^ ^a ^b Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May 1999). Boosting Algorithms as Gradient Descent in Function Space.
^ ^a ^b ^c Friedman, J. H. (February 1999). Greedy Function Approximation: A Gradient Boosting Machine.
^ ^a ^b ^c Friedman, J. H. (March 1999). Stochastic Gradient Boosting.
^ Cheng Li. “A Gentle Introduction to Gradient Boosting”. 2019年5月2日閲覧。}
^ ^a ^b ^c Hastie, T.; Tibshirani, R.; Friedman, J. H. (2009). “10. Boosting and Additive Trees”. The Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 337–384. ISBN 978-0-387-84857-0. オリジナルの2009-11-10時点におけるアーカイブ。
^ Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient $b_{jm}$ for the region $R_{jm}$ is equal to just the value of output variable, averaged over all training instances in $R_{jm}$ .
^ Note that this is different from bagging, which samples with replacement because it uses samples of the same size as the training set.
^ ^a ^b ^c Ridgeway, Greg (2007). Generalized Boosted Models: A guide to the gbm package.
^ Learn Gradient Boosting Algorithm for better predictions (with codes in R)
^ Tianqi Chen. Introduction to Boosted Trees
^ Cossock, David and Zhang, Tong (2008). Statistical Analysis of Bayes Optimal Subset Ranking Archived 2010-08-07 at the Wayback Machine., page 14.
^ Yandex corporate blog entry about new ranking model "Snezhinsk" (in Russian)
^ Friedman, Jerome (2003). “Multiple Additive Regression Trees with Application in Epidemiology”. Statistics in Medicine 22 (9): 1365–1381. doi:10.1002/sim.1501. PMID 12704603.
^ Elith, Jane (2008). “A working guide to boosted regression trees”. Journal of Animal Ecology 77 (4): 802–813. doi:10.1111/j.1365-2656.2008.01390.x. PMID 18397250.
^ “Boosted Regression Trees for ecological modeling”. CRAN. CRAN. 31 August 2018閲覧。

外部リンク

How to explain gradient boosting

[Breiman1997-1] Breiman, L. (June 1997). “Arcing The Edge”. Technical Report 486 (Statistics Department, University of California, Berkeley).

[MasonBaxterBartlettFrean1999a-2] Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (1999). "Boosting Algorithms as Gradient Descent" (PDF). In S.A. Solla and T.K. Leen and K. Müller (ed.). Advances in Neural Information Processing Systems 12. MIT Press. pp. 512–518.

[MasonBaxterBartlettFrean1999b-3] Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May 1999). Boosting Algorithms as Gradient Descent in Function Space.

[Friedman1999a-4] Friedman, J. H. (February 1999). Greedy Function Approximation: A Gradient Boosting Machine.

[Friedman1999b-5] Friedman, J. H. (March 1999). Stochastic Gradient Boosting.

[6] Cheng Li. “A Gentle Introduction to Gradient Boosting”. 2019年5月2日閲覧。}

[hastie-7] Hastie, T.; Tibshirani, R.; Friedman, J. H. (2009). “10. Boosting and Additive Trees”. The Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 337–384. ISBN 978-0-387-84857-0. オリジナルの2009-11-10時点におけるアーカイブ。

[8] Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient $b_{jm}$ for the region $R_{jm}$ is equal to just the value of output variable, averaged over all training instances in $R_{jm}$ .

[9] Note that this is different from bagging, which samples with replacement because it uses samples of the same size as the training set.

[gbm-vignette-10] Ridgeway, Greg (2007). Generalized Boosted Models: A guide to the gbm package.

[11] Learn Gradient Boosting Algorithm for better predictions (with codes in R)

[12] Tianqi Chen. Introduction to Boosted Trees

[13] Cossock, David and Zhang, Tong (2008). Statistical Analysis of Bayes Optimal Subset Ranking Archived 2010-08-07 at the Wayback Machine., page 14.

[snezhinsk-14] Yandex corporate blog entry about new ranking model "Snezhinsk" (in Russian)

[15] Friedman, Jerome (2003). “Multiple Additive Regression Trees with Application in Epidemiology”. Statistics in Medicine 22 (9): 1365–1381. doi:10.1002/sim.1501. PMID 12704603.

[16] Elith, Jane (2008). “A working guide to boosted regression trees”. Journal of Animal Ecology 77 (4): 802–813. doi:10.1111/j.1365-2656.2008.01390.x. PMID 18397250.

[17] “Boosted Regression Trees for ecological modeling”. CRAN. CRAN. 31 August 2018閲覧。

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]