In recent months, I have seen an uptick in the number of companies using the well known XGBoost algorithm. This powerful algorithm allows data scientists to use tree-based methods to easily model non-linearities in predictors. To do this, the algorithm selects variables to add to the model incrementally by choosing the variable that minimizes a loss function under a set of selected hyperparameters. I did write that sentence and realized that it was very jargon heavy but, no worries I will explain what all this means in this blog and explain how to do XGBoost hyperparameter tuning.
Let’s start with tree-based methods. Tree-based methods are a group of machine learning algorithms that take a variable and split it at a point that minimizes a loss function. At any step, the model will pick the variable that minimizes the loss function the most. For example, assume the loss function is the mean squared error (a normal choice for regression problems, whereas cross-entropy is generally used for classification problems) which is the average of the modeled value minus the actual value squared and we are trying to estimate the number of items sold based on the average temperature during the day. Consider the data below:
Temperature | Items Sold |
43 | 2 |
47 | 4 |
57 | 3 |
60 | 6 |
65 | 1 |
71 | 5 |
74 | 4 |
75 | 9 |
75 | 3 |
81 | 4 |
To begin the algorithm, you will pick a cut point between temperatures, assign each row to a group based on that cut point (whether the temperature is above or below the cut point), and then observe the average for the assigned groups. The average after each group will be assigned to each row of data in that group. As an example, lets look at a cut point of 58. The first three rows of the above data would belong to the first group and the remaining rows would belong to a second group. The average items sold in the first group is 3 whereas the average number of items sold in the second group is 4.57. The next step is to calculate the loss function, so we take the sum of the items sold minus the average for the group and square that difference (you can divide by the number of data points to get an average, but this is inconsequential to the analysis). This gives us a mean squared error of 39.71 at a cut point of 58.
Temperature | Items Sold | Group | Squared Error @ 58 |
43 | 2 | 1 | 1.000 |
47 | 4 | 1 | 1.000 |
57 | 3 | 1 | 0.000 |
60 | 6 | 2 | 2.041 |
65 | 1 | 2 | 12.755 |
71 | 5 | 2 | 0.184 |
74 | 4 | 2 | 0.327 |
75 | 9 | 2 | 19.612 |
75 | 3 | 2 | 2.469 |
81 | 4 | 2 | 0.327 |
Total | 39.714 |
Now that we understand how the algorithm works, the goal is to optimize and pick the cut point that minimizes our loss function. It turns out that the optimal cut point is any value between 66 and 70. Any of these cut points will result in a squared error of 36.8.
Temperature | Items Sold | Group | Squared Error @ 66 |
43 | 2 | 1 | 1.440 |
47 | 4 | 1 | 0.640 |
57 | 3 | 1 | 0.040 |
60 | 6 | 1 | 7.840 |
65 | 1 | 1 | 4.840 |
71 | 5 | 2 | 0.000 |
74 | 4 | 2 | 1.000 |
75 | 9 | 2 | 16.000 |
75 | 3 | 2 | 4.000 |
81 | 4 | 2 | 1.000 |
Total | 36.800 |
The tree-based method will do this same optimization process for any number of variables that are in your model. Now you may be asking yourself “how do I know which variables to include in my tree and in what order?”. This is a great question. Normal decision trees will randomly pick variables to add to a model. A random forest will create numerous decision trees and average them together to get optimal results. The algorithm will also consider other cut points of the same variable to see if those cut points reduce the loss function even more.
One of the biggest advantages of the XGBoost algorithm is it takes the results from one tree and selects a variable and cut point that minimizes the loss function on the residuals from the prior variables that were introduced in prior trees. A residual is the difference between the actual and predicted value. Once the residuals are calculated, the learning rate is applied to the residuals to slow down how quickly the model adjusts for these residuals. A large learning rate will cause the model to learn quickly whereas a small learning rate will cause the model to learn slower and take in more information from the subsequent variables added to the model. Generally, a small learning rate can accompany a larger maximum depth parameter and vice versa.
The max depth parameter sets the maximum number of splits allowed in a tree before the algorithm creates a new tree. In the example presented above, if a maximum depth of 2 was selected, the model would split the data into two groups to minimize the loss function, calculate the residuals, apply the learning rate and then split the temperature into two groups based on the residuals. After two splits, this round of tree building will be complete and the maximum depth will be reached. If there were more variables in the data (which is normally the case) the model would still only create 2 cut points: either 2 for one variable or a cut point for 2 different variables. An important thing to note is that too large of a maximum depth parameter will cause your model to overfit and not generalize well to data that it was not trained with.
How many rounds are necessary for your model? The default value for this parameter is 100. This parameter determines the maximum number of steps needed for the gradient decent to converge. Without setting this parameter, it is possible for the algorithm to never converge.
Each of these parameters can be selected by performing a grid search and cross validation on training data to determine the set of parameters which minimize the loss function. A grid search requires selecting a set of possible parameter values and then running the algorithm to determine which set of parameters to use. For example, if you were considering learning rates of {0.1, 0.2, 0.5, 0.75, 1}, maximum depths of {2, 3, 4, 5, 6, 7}, and number of rounds of {50, 100, 200, 300, 500}, you would run a model using all combinations of learning rates, maximum depths, and number of rounds and then select the set of parameters that minimize the loss function.
The application below uses the usautoBI dataset from the CASdatasets R package. The dataset consists of 1,340 data points of automobile injury claims collected in 2002 by the Insurance Research Council (part of AICPCU and IIA). With this application, you can select various the hyperparameters we discussed earlier in this post and see how the effect the model fit on both the training dataset and the test dataset. The last slider is used to select how much data to include in the training dataset.
In a well fit model, we would expect the predictions in the Train Dataset Lift Chart to closely align with the actual data and the predictions in the Test Dataset Lift Chart to align with the actual data in the test dataset. The lift chart for the test dataset will have more volatility than the training dataset lift chart, but that is expected since the test lift chart is using data the model has not seen yet. If the predictions and actual data in the training dataset lift chart are similar across datasets and the test lift charts indicate a poor fit, the model may be overfit and the hyperparameters should be reconsidered.
XGBoost is a versatile machine learning algorithm that can be used to estimate predictors such as revenue, expenses, the sale price of a home, the time it will take to settle an insurance claim, and any other type of continuous measure. This algorithm can also be used for classification problems such as grouping customers, segmenting risks, predicting if credit card transactions are fraudulent, and many other classification problems.
Comments