Boosting(2) - Adaboost and Forward Stagewise

Boosting理论基础: 和前向分步算法的等价性

Forward stagewise

Consider an additive model like adaboost:
f(x) = \sum\limits_{m=1}^M \beta_m b(x; \gamma_m)
in which $b(x; \gamma_m)$ is the base model, $\gamma_m$ is the model’s parameters, $\beta_m$ is it’s coefficient.

To minimim the loss function, we have the forward stagewise algorithm, which minimize one model & coefficient at a time, that is, take the models already trained as constant, that’s the core concept of this algorithm:

Input: Training data $T = \lbrace (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(N)}, y^{(N)}) \rbrace$, lost function$L(y, f(x))$ and base model $b(x; \gamma)$
Output: additive model $f(x)$

  1. Initialize $f_0(x) = 0$
  2. For $m = 1,2,\cdots,M$:
    (1) Minimize loss function:
    $$ (\beta_m, \gamma_m) = arg\min\limits_{\beta, \gamma} \sum\limits_{i=1}^N L(y^{(i)}, f_{m-1}(x^{(i)}) + \beta b(x^{(i)};\gamma)) $$
    (2) Update $f(x)$:
    $$ f_m(x) = f_{m-1}(x) + \beta_m b(x^{(i)};\gamma_m)$$
  3. Get additive model:
    $$ f(x) = \sum\limits_{m=1}^M \beta_m b(x; \gamma_m) $$

Equivalence of adaboost and Forward stagewise

When using additive model & exponential loss function, forward stagewise algorithm is equivalent to adaboost. Here is the proof:

Assume the exponential loss function is:
$$ L(y, f(x)) = exp(-yf(x)) $$
In the $m$th iteration, we have:
f_m(x) = f_{m-1}(x) + \alpha_mG_m(x), \\
\text{in which } f_{m-1}(x) = \sum\limits_{m=1}^{M-1} \alpha_mG_m(x)
To minimize
(\alpha_m, G_m) & = arg\min\limits_{\alpha, G} \sum\limits_{i=1}^N exp \left\lbrace -y^{(i)}(f_{m-1}(x^{(i)}) + \alpha_m G_m(x^{(i)}))\right\rbrace \\
& = arg\min\limits_{\alpha, G} \sum\limits_{i=1}^N w_{mi} exp \left\lbrace -y^{(i)} \alpha_m G_m(x^{(i)}) \right\rbrace
in which $w_{mi} = exp(-y^{(i)} f_{m-1}(x^{(i)}))$, it depends on neither $\alpha$ nor $G$.

To prove the $\alpha_m^*, G_m^*$ achieved here is exatly the $\alpha_m, G_m$ in adaboost:

  1. For any $\alpha > 0$, $G_m^*(x)$ which maximize the loss function is the one who has the minimum error rate:
    G_m^*(x) = arg\min\limits_{G} \sum\limits_{i=1}^N w_{mi} I(y^{(i)} \neq G(x^{(i)}))
    That’s exatly what we train the base model to do, so $G_m^*(x) = G_m(x)$
  2. Then calculate $\alpha_m$
    L(\alpha, G_m) & = \sum\limits_{i=1}^N w_{mi} exp \left\lbrace -y^{(i)} \alpha G_m(x^{(i)}) \right\rbrace \\
    & = e^{-\alpha} \sum\limits_{=} w_{mi} + e^{\alpha} \sum\limits_{\neq} w_{mi} \\
    & = e^{-\alpha} \sum\limits w_{mi} - e^{-\alpha} \sum\limits_{\neq} w_{mi} + e^{\alpha} \sum\limits_{\neq} w_{mi} \\
    & = (e^{\alpha} - e^{-\alpha}) \sum\limits_{\neq} w_{mi} + e^{-\alpha} \sum\limits w_{mi}
    \end{align} $$
    in which $\sum\limits_{\neq}$ is abbreviation for $\sum\limits_{y^{(i)} \neq G(x^{(i)})}$, $\sum$ is abbreviation for $\sum\limits_{i=1}^N$
    Let the partial derivative=0, we have:
    $$\frac {\partial L(\alpha, G_m)} {\partial \alpha} = (e^{\alpha} + e^{-\alpha}) \sum\limits_{\neq} w_{mi} - e^{-\alpha} \sum\limits w_{mi} = 0$$
    s.t. $$(e^{\alpha} + e^{-\alpha}) \frac{\sum\limits_{\neq} w_{mi}}{\sum\limits w_{mi}} - e^{-\alpha} = 0$$
    s.t. $$(e^{\alpha} + e^{-\alpha}) \epsilon_m - e^{-\alpha} = 0$$
    s.t. $$ \alpha^* = \frac{1}{2}ln \frac{1-\epsilon_m}{\epsilon_m} $$
    in which $\epsilon_m = \frac{\sum\limits_{\neq} w_{mi}}{\sum\limits w_{mi}}$, which is the error rate of $G_m^*(x)$

Finally we could see that the $(\alpha_m, G_m)$ we get here is exatly the same as in adaboost.