Boosting理论基础: 和前向分步算法的等价性

### Forward stagewise

$$f(x) = \sum\limits_{m=1}^M \beta_m b(x; \gamma_m)$$
in which $b(x; \gamma_m)$ is the base model, $\gamma_m$ is the model’s parameters, $\beta_m$ is it’s coefficient.

To minimim the loss function, we have the forward stagewise algorithm, which minimize one model & coefficient at a time, that is, take the models already trained as constant, that’s the core concept of this algorithm:

Input: Training data $T = \lbrace (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(N)}, y^{(N)}) \rbrace$, lost function$L(y, f(x))$ and base model $b(x; \gamma)$
Output: additive model $f(x)$

1. Initialize $f_0(x) = 0$
2. For $m = 1,2,\cdots,M$:
(1) Minimize loss function:
$$(\beta_m, \gamma_m) = arg\min\limits_{\beta, \gamma} \sum\limits_{i=1}^N L(y^{(i)}, f_{m-1}(x^{(i)}) + \beta b(x^{(i)};\gamma))$$
(2) Update $f(x)$:
$$f_m(x) = f_{m-1}(x) + \beta_m b(x^{(i)};\gamma_m)$$
$$f(x) = \sum\limits_{m=1}^M \beta_m b(x; \gamma_m)$$

### Equivalence of adaboost and Forward stagewise

When using additive model & exponential loss function, forward stagewise algorithm is equivalent to adaboost. Here is the proof:

Assume the exponential loss function is:
$$L(y, f(x)) = exp(-yf(x))$$
In the $m$th iteration, we have:
$$f_m(x) = f_{m-1}(x) + \alpha_mG_m(x), \\ \text{in which } f_{m-1}(x) = \sum\limits_{m=1}^{M-1} \alpha_mG_m(x)$$
To minimize
\begin{align} (\alpha_m, G_m) & = arg\min\limits_{\alpha, G} \sum\limits_{i=1}^N exp \left\lbrace -y^{(i)}(f_{m-1}(x^{(i)}) + \alpha_m G_m(x^{(i)}))\right\rbrace \\ & = arg\min\limits_{\alpha, G} \sum\limits_{i=1}^N w_{mi} exp \left\lbrace -y^{(i)} \alpha_m G_m(x^{(i)}) \right\rbrace \end{align}
in which $w_{mi} = exp(-y^{(i)} f_{m-1}(x^{(i)}))$, it depends on neither $\alpha$ nor $G$.

To prove the $\alpha_m^*, G_m^*$ achieved here is exatly the $\alpha_m, G_m$ in adaboost:

1. For any $\alpha > 0$, $G_m^*(x)$ which maximize the loss function is the one who has the minimum error rate:
$$G_m^*(x) = arg\min\limits_{G} \sum\limits_{i=1}^N w_{mi} I(y^{(i)} \neq G(x^{(i)}))$$
That’s exatly what we train the base model to do, so $G_m^*(x) = G_m(x)$
2. Then calculate $\alpha_m$
\begin{align} L(\alpha, G_m) & = \sum\limits_{i=1}^N w_{mi} exp \left\lbrace -y^{(i)} \alpha G_m(x^{(i)}) \right\rbrace \\ & = e^{-\alpha} \sum\limits_{=} w_{mi} + e^{\alpha} \sum\limits_{\neq} w_{mi} \\ & = e^{-\alpha} \sum\limits w_{mi} - e^{-\alpha} \sum\limits_{\neq} w_{mi} + e^{\alpha} \sum\limits_{\neq} w_{mi} \\ & = (e^{\alpha} - e^{-\alpha}) \sum\limits_{\neq} w_{mi} + e^{-\alpha} \sum\limits w_{mi} \end{align}
in which $\sum\limits_{\neq}$ is abbreviation for $\sum\limits_{y^{(i)} \neq G(x^{(i)})}$, $\sum$ is abbreviation for $\sum\limits_{i=1}^N$
Let the partial derivative=0, we have:
$$\frac {\partial L(\alpha, G_m)} {\partial \alpha} = (e^{\alpha} + e^{-\alpha}) \sum\limits_{\neq} w_{mi} - e^{-\alpha} \sum\limits w_{mi} = 0$$
s.t. $$(e^{\alpha} + e^{-\alpha}) \frac{\sum\limits_{\neq} w_{mi}}{\sum\limits w_{mi}} - e^{-\alpha} = 0$$
s.t. $$(e^{\alpha} + e^{-\alpha}) \epsilon_m - e^{-\alpha} = 0$$
s.t. $$\alpha^* = \frac{1}{2}ln \frac{1-\epsilon_m}{\epsilon_m}$$
in which $\epsilon_m = \frac{\sum\limits_{\neq} w_{mi}}{\sum\limits w_{mi}}$, which is the error rate of $G_m^*(x)$

Finally we could see that the $(\alpha_m, G_m)$ we get here is exatly the same as in adaboost.