### Boosting on different Loss Functions

In the last section, we use forward stagewise on an addictive model to solve the optimization of adaboost. The exponential loss function is relatively easy to handle, but other loss functions may not.

Recall from the last section, we have loss function like this:
$$L(y_i, f(x_i))$$

Consider optimizing an arbitary loss function using forward stagewise:

$$(\beta_m, \gamma_m) = arg\min\limits_{\beta, \gamma} \sum\limits_{i=1}^N L(y_i, f_{m-1}(x_i) + \beta b(x_i;\gamma)) \tag{1}$$

Now since $\beta b(x_i;\gamma)$ is relatively small compared to $f_{m-1}(x_i)$, using Taylor Series we could unfold it around $f_{m-1}(x_i)$:

$$L \approx \frac{1}{N} \sum\limits_{i=1}^N L(y_i, f_{m-1}(x_i)) + \beta \sum\limits_{i=1}^N \left. { \frac{\partial L(y_i, s)}{\partial s} } \right |_{s=f_{m-1}(x_i)} b(x_i;\gamma) \tag{2}$$

Add regularzation to (2) to avoid $b(x_i;\gamma)$ being too big, since we already have $\beta$ to adjust the term’s magnitude:

\begin{align} L & \approx \frac{1}{N} \sum\limits_{i=1}^N L(y_i, f_{m-1}(x_i)) + \frac{\beta}{2} \sum\limits_{i=1}^N \left. { 2 \frac{\partial L(y_i, s)}{\partial s} } \right |_{s=f_{m-1}(x_i)} b(x_i;\gamma) + b^2(x_i;\gamma) \\ \text{(Strip the constants)} & = \beta \sum\limits_{i=1}^N 2 \frac{\partial L}{\partial s} b(x_i;\gamma) + b^2(x_i;\gamma) \\ & = \beta \sum\limits_{i=1}^N (b(x_i;\gamma) + \frac{\partial L}{\partial s})^2 - (\frac{\partial L}{\partial s} )^2 \end{align} \tag{3}

Now we could minimize the loss function using the last section’s results:

1. Optimize $b(x_i;\gamma)$. From (3) we get:
$$\gamma_m = arg\min\limits_\gamma \beta \sum\limits_{i=1}^N \left(b(x_i;\gamma) + \left. { \frac{\partial L(y_i, s)}{\partial s} } \right |_{s=f_{m-1}(x_i)} \right)^2$$
That is, just train the base model $b(x_i;\gamma_m)$ with the loss function’s gradient $-\frac{\partial L(y_i, s)}{\partial s}$ in every step $m$, that’s why it calld gradient boosting:
$$\text{fit } b(x_i;\gamma_m) = - \left. { \frac{\partial L(y_i, s)}{\partial s} } \right |_{s=f_{m-1}(x_i)}$$

2. Optimize $\beta$
$$\beta_m = arg\min\limits_\beta \sum\limits_{i=1}^N L(y_i, f_{m-1}(x_i) + \beta b(x_i;\gamma_m))$$
Since we already have $y_i$, $f_{m-1}(x_i)$ and $b(x_i;\gamma_m)$, this is just a one-dimensional optimization problem, which is easy to handle.

The full algorithm is similar, see the reference of wiki. The base model used most frequently is decision trees.

Reference:

Wikipedia