TOC

Ensemble learning

Ensemble learning happens when we have multiple models but they are weak. Weak learners are models that predict a little better than random. We ensemble them together in some ways to achieve better prediction (better bias and variance). This is also called meta-learning. The three most popular ensemble learning methods are bagging, boosting, and stacking.

  • Bagging is when we train those models in parallel. We sample with replacement to create new dataset for those models. This method of sampling is called bootstrapping. Random Forest is bagging. Each tree is then trained in parallel on random subset of the data and then the resulted predictions are averaged to find the classification (aggregation). Hence the name bootstrapping aggregation (bagging).

  • Boosting is the method in which we train the model sequentially. Each model will fit the error of the previous model. In this way, the next learner improves from the mistake of the previous learner.

  • Stacking is where we train multiple base models to solve similar problem, then based on the combined output, we build a new model with improved performance. In stacking, an algorithm takes the outputs of base models as input then learn to best combine them. It is also called stacked generalization, since all base models are equally considered, the models are called stacked on top of each other.

In stacking, the output of base models can be linearly regressed to combine for the regression problem. And they can be logistically regressed to combine for the classification problem. Stacked model doesn’t directly use the data input but the learned results of its base models.

There are reasons for the improvement in performance after boosting. First is that it generates a hypothesis whose error on the training set is small by combining many hypotheses whose error may be large (but still better than random guessing). It does so by challenging the weak learners with the harder part of the sample space. It also forces the weak learning algorithm to change its hypothesies by changing the distribution over training examples as a function of the error made by the previous hypotheses. Second, taking a weighted majority over many hypotheses, all of which are trained on different samples taken out of the same training set, can reduce the random variability of the combined hypothesis.

In this post, we would see popular boosting algorithms such as AdaBoost, gradient boosting, and XGBoost.

AdaBoost

AdaBoost is short for adaptive boosting. It uses an iterative approach to learn from the mistakes of weak classifier and achieve the composite prediction of an equivalent strong one. The weak classifiers that would be used for groupping are called base models. The base models would be very simple trees. When there are only two leaves in the tree (a single split of decision), we call it a decision stump.

The AdaBoost algorithm is as follows. In the initial step, the dataset is initialized with equal weight to each of the data point. The data is then provided as the input into the model, then the wrongly classified data is identified. Those data points’ weights would be increased in the next round, so that they are more likely to be chosen during the sampling. The model therefore pays more attention to difficult observations, and with the feedback of gradient descent, they would learn gradually the tricky cases. Hence the word adaptive.

The boosting algorithm takes a training set of m examples \(S = ((x_1, y_1),...(x_m, y_m))\) where \(x_i\) is an instance from X and \(y_i\) is the label. We also have a learning algorithm called WeakLearn. WeakLearn would be called repeatedly for T iterations. In each round t, the booster gives WeakLearn a distribution \(D_t\) over the training set S. WeakLearn will compute a classifier or hypothesis \(h_t: X \to Y\) which would be able to classify correctly a fraction of the training set. WeakLearn would also try to minimize the error \(\epsilon_t = Pr_{i \in D_t} {[h_t(x_i) \neq y_i]}\). At last, the booster combines all the weak hypothesis \(h_1,..h_T\) into a single final hypothesis \(h_{fin}\).

The initial distribution \(D_1\) would be uniform over S, so \(D_1(i) = 1/m \forall i\). We then compute \(D_{t+1}\) from \(D_t\) and the last weak hypothesis \(h_t\): we multiply the weight of example i by some number \(\beta_t \in [0,1)\) if \(h_t\) gets \(x_i\) right, otherwise the weight is left unchanged. Then all the weights are normalized. Easy examples would have less and less weight eventually. Thus AdaBoost can focus most of its weights on the hardest examples for WeakLearn.

\(\beta_t\) is a function of \(\epsilon_t\). And the final hypothesis \(h_{fin}\) is a weighted vote (weighted linear threshold) of the weak hypotheses. For a given x, \(h_{fin}\) outputs the label y that maximizes the sum of the weights of the weak hypotheses for that label. The weight of hypothesis \(h_t\) is \(log(1/\beta_t)\) so that lower error hypotheses get higher weights.

Here is the full algorithm:

  • Initialize \(D_1(i) = 1/m \forall i\)
  • For t = 1,2..T:
    • Call WeakLearn, give it distribution \(D_t\)
    • Get back a hypothesis \(h_t: X \to Y\)
    • Calculate the error of \(h_t: \epsilon_t = \sum_{i: h_t(x_i) \neq y_i} D_t(i)\). If \(\epsilon_t > 1/2\) set T = t-1 and abort loop
    • Set \(\beta_t = \epsilon_t / (1 - \epsilon_t)\)
    • Update distribution \(D_t\):

    \(D_{t+1} (i) = \frac{D_t(i)}{Z_t} x \beta_t\) if \(h_t(x_i) = y_i\). \(\beta_t = 1\) otherwise. \(Z_t\) is the normalization constant.

  • Output the final hypothesis:
\[h_{fin}(x) = argmax_{y \in Y} \sum_{t: h_t(x) = y} log \frac{1}{\beta_t}\]

The property of AdaBoost is that if WeakLearn’s output is slightly better than 1/2, the training error of final hypothesis drops to zero exponentially fast.

from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics
iris = datasets.load_iris()
X = iris.data
y = iris.target
# print(X)
# print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 
abc = AdaBoostClassifier(n_estimators=50, learning_rate=1)
model = abc.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
Accuracy: 1.0

Gradient Boosting

In gradient boosting, the models are short decision trees. And it is different from AdaBoost in that it simply tries to predict the error (calculated by the loss function) of the previous model. This makes it slow to learn, but it learns better: the model is more robust and stronger than the individual trees. We can use mean squared error for a regression task and logarithmic loss for a classification task.

The learning rate controls how much of the error to be fitted for the next model. The lower the rate, the slower the models learn. The number of individual trees is also a hyper parameter. The more trees we add, the higher risk of overfitting.

Let’s consider the first model for input x and output y:

\[y = A_1 + (B_1 * x) + e_1\]

with \(e_1\) is the residual. This \(e_1\) would be to fit the second tree:

\[e_1 = A_2 + (B_2 * x) + e_2\]

and \(e_2 = A_3 + (B_3 * x) + e_3\)

and so on. We can have hundreds of trees for a model. When we combine then, the combined model would be:

\[y = A_1 + A_2 + A_3 + B_1 * x + B_2 * x + B_3 * x + e_3\]
# Import all relevant libraries
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore")
# Load the dataset 
pima = pd.read_csv('diabetes.csv') 
pima.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
# Split dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(pima.drop('Outcome', axis=1),
                                                    pima['Outcome'], test_size=0.2)

# Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)
# Define Gradient Boosting Classifier with hyperparameters
gbc=GradientBoostingClassifier(n_estimators=500,learning_rate=0.05,random_state=100,max_features=5 )
# Fit train data to GBC
gbc.fit(X_train_transformed, y_train)
GradientBoostingClassifier(learning_rate=0.05, max_features=5, n_estimators=500,
                           random_state=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Confusion matrix will give number of correct and incorrect classifications
print(confusion_matrix(y_test, gbc.predict(X_test_transformed)))
[[83 15]
 [24 32]]
# Accuracy of model
print("GBC accuracy is %2.2f" % accuracy_score(
    y_test, gbc.predict(X_test_transformed)))
GBC accuracy is 0.75

Stochastic Gradient Boosting

SGB is a hybrid of the boosting and bagging approaches. At each iteration, a random sub sample of the dataset is chosen to fit a tree. SGB is also based on a steepest gradient algorithm which emphasizes the misclassified data that are close. Finally, at each iteration, they use small trees and the major voting to aggregate the prediction. SGB is robust to missing data and outliers.

Consider the training sample \(\{y_i, x_i\}_1^N\). We need to find a function \(H^*(x)\) that maps x to y and at the same time minimize the loss function \(L(y, H(x))\). SGB can approximate \(H^*(x)\):

\[H(x) = \sum_{m=0}^M \beta_m h(x; a_m)\]

where \(h(x;a_m)\) is the base model (which can be a tree), \(a_m\) is the parameters and \(\beta_m\) is an expansion coefficient. \(\beta\) and a would be fitted to min the loss function:

\[(\beta_m, a_m) = argmin_{\beta,a} \sum_{i=1}^N L(y_i, H_{m-1} (x_i) + \beta h(x_i; a))\]

and \(H_m(x) = H_{m-1} (x) + \beta_m h(x;a_m)\)

To solve for \((\beta_m, a_m)\), SGB fits h(x;a) by least squares:

\[a_m = argmin_{a,\rho} \sum_{i=1}^N {[y_{im} - \rho h(x_i;a)]}^2\]

where \(y_{im} = - {[ \frac{\delta L(y_i, H(x_i))}{\delta H(x_i)} ]}_{H(x) = H_{m-1}(x)}\)

Then we can estimate \(\beta_m = argmin_{\beta} \sum_{i=1}^N L(y_i, H_{m-1}(x_i) + \beta h(x_i;a_m))\)

To improve performance, at each iteration SGB incorporates randomness to select a random permutation (without replacement) to fit the regression tree. Tuning the parameters include tuning M - the total number of trees (for example, from 50 up to 700), the learning rate - which decides how much loss is fitted for the next tree (for example, 0.01, 0.05, 0.1, 0.5), and L - the depth of each tree (for example, 3, 5, 7, 9).

To search for the best combination of parameters, we can use 10 fold cross validation in which the training set is divided into 10 groups, nine would be for fitting and the other for testing.

XGBoost

XGBoost stands for Extreme Gradient Boosting, and is a scalable machine learning system for tree boosting. It is available as an open source package. The influence of XGBoost in machine learning’s field is quite large. Among the 29 challenge winning solutions at Kaggle’s 2015, 17 solutions used XGBoost. Among those, 8 solely used XGBoost, most others combined XGBoost with neural net ensembles.

A generic XGBoost algorithm is:

  • Initialize model with constant: \(f_{(0)} (x) = argmin_{\theta} sum_{i=1}^N L(y_i, \theta)\)
  • From m=1 to M:
    • Compute the gradients and Hessians: \(g_m(x_i) = {[ \frac{\delta L(y_i, f(x_i))}{\delta f(x_i)} ]}\) \(h_m(x_i) = {[ \frac{\delta^2 L(y_i, f(x_i))}{\delta f(x_i)^2} ]}\)
    • Fit a base learner (weak learner - tree) using the training set \(\{ x_i, - \frac{g_m(x_i)}{h_m(x_i)}_{i=1}^N\) so that: \(\phi_m = argmin \sum_{i=1}^N \frac{1}{2} h_m(x_i) {[ - \frac{g_m(x_i)}{h_m(x_i)} - \phi(x_i) ]}^2\) and \(f_m(x) = \alpha \phi_m(x)\)
    • Update the model: \(f_{(m)} (x) = f_{(m-1)} (x) + f_m (x)\)
  • Output \(f(x) = f_{(M)}(x) = \sum_{m=0}^M f_m(x)\)