I have, in the past, used and tuned models without knowing what they do. I have mostly been successful at this because most of them had just a few parameters that needed tuning like learning rate, no. of iterations, alpha or lambda and it's easy to guess how they might affect the model.
So, when I came across LightGBM and XGBoost during a Kaggle challenge, I thought of doing the same with them too. I found it pretty complicated to understand the theory behind them so I tried to get away with using them as black-boxes.
But I soon found out that I can’t. Its because they have a HUGE number of hyperparameters.. ones that can make or break your model! To top it off, their default setting is often not the optimal one. So, in order to effectively use the model, I had to get at least a high-level understanding of what each parameter represents and understand which ones might be the most important.
My aim is to give you that quick, high-level, working knowledge of Gradient Boosting Machines (GBM) and making you understand Gradient boosting through LightGBM and XGBoost. This way you will be able to tell what’s happening in the algorithm, what parameters you should tweak to make it better and go directly to implementing them in your own analysis.
It's great if you want to know more about the theory and the math behind gradient boosting. First, let me explain to you what is Gradient boosting is and then point you to some excellent resources to understand the theory and the math behind GBM using the parameters of XGBoost and LightGBM.
Boosting refers to a group of algorithms which transforms weak learner to strong learners.
Boosting algorithms are classified into:
Gradient boosting is a machine learning technique for regression and classification problems which constructs a prediction model in the form of an ensemble of weak prediction models.
Basically, Gradient boosting Algorithm involves three elements:
In the article, “A Kaggle Master Explains Gradient Boosting”, the author quotes his fellow Kaggler, Mike Kim says-
My only goal is to gradient boost over myself of yesterday. And to repeat this every day with an unconquerable spirit.
With each passing day, we aim to improve ourselves by focusing on the mistakes of yesterday.
And you know what? — GBMs do that too!
GBMs do it by creating an ensemble of predictors. Each one of those predictors is sequentially built by focusing on the mistakes of the previous one.
It is simply a group of items viewed as a whole rather than individually.
Now, back to the explanation...
A GBM basically creates lots of individual predictors and each of them tries to predict the true label. Then, it gives its final prediction by averaging all those individual predictions (note however that it is not a simple average but a weighted average).
Q- “Averaging the predictions made by lots of predictors”.. that sounds like Random Forest!
One important difference between the two is that the predictors used in Random forest are independent of each other whereas the ones used in gradient boosting machines are built sequentially where each one tries to improve upon the mistakes made by its predecessor.
You should check out the concept of bagging and boosting. So, check out this quick explanation to do that.
Q: Okay, so how does the algorithm decide the number of predictors to put in the ensemble?
So, let’s talk about these individual predictors now.
In theory, these predictors can be any regressor or classifier but in practice, decision trees give the best results.
The sklearn API for LightGBM provides a parameter- boosting_type and the API for XGBoost has parameter- booster to change this predictor algorithm. You can choose from — gbdt, dart, goss, rf (LightGBM) or gbtree, gblinear or dart (XGBoost). [Note however that a decision tree, almost always, outperforms the other options by a fairly large margin. The good thing is that it is the default setting for this parameter; so you don’t have to worry about it.]
We also want these predictors to be weak. A weak predictor is simply a prediction model that performs better than random guessing.
Q: Wait a second.. that seems backwards. Don’t we want to have strong predictors that can make good guesses?
Also note that if the predictors are just too weak, it might not even be possible to build a strong ensemble out of them.
Now back to creating a weak predictor.. this seems like a good area to hyperparameterise.
These are the parameters that we need to tune to make the right predictors (which are decision trees):
Note: These are also the parameters that you can tune to control overfitting.
The subtree marked in red has a leaf node with 1 data in it. So, that subtree can’t be generated as 1 < `min_child_samples` for the above case
Even after we do all this, it might just happen that some trees in the ensemble are highly correlated.
Q: Excuse me, what do you mean by highly correlated trees?
To combat this problem, we subsample the data rows and columns before each iteration and train the tree on this subsample. These are the relevant parameters to look out for:
That is it. Now you have a good overview of the whole story of how a GBM works. There are 2 more important parameters though which I couldn’t fit into the story. So, here they are —
You can use sklearn’s RandomizedSearchCV in order to find a good set of hyperparameters. It will randomly search through a subset of all possible combinations of the hyperparameters and return the best possible set of hyperparameters(or at least something close to the best).
But if you wish to go even further, you could look around the hyperparameter set that it returns using GridSearchCV. Grid search will train the model using every possible hyperparameter combination and return the best set. Note that since it tries every possible combination, it can be expensive to run.
They are good at effectively modeling any kind of structured tabular data. Multiple winning solutions of Kaggle competitions use them. Here’s a list of Kaggle competitions where LightGBM was used in the winning model.
They are simpler to implement than many other stacked regression techniques and they easily give better results too.
There is another class of tree ensembles called — Random Forests. While GBMs are a type of boosting algorithm, this is a bagging algorithm (did you check the link about bagging and boosting that I mentioned above?). So, despite being implemented using decision trees like GBMs, Random Forests are much different from them. Random Forests are great because they will generally give you a good enough result with the default parameter settings, unlike XGBoost and LightGBM which require tuning. But once tuned, XGBoost and LightGBM are much more likely to perform better.
Below diagram is the sample of Random Forests
Alright. So now you know all about the parameters that you need to in order to successfully use XGBoost or LightGBM to model your dataset!
Before I finish off, here are a few links that you can follow to understand the theory and the math behind gradient boosting (in order of my preference) —
You can also take up Machine learning courses to understand these things better.