top

A Guide to Understanding Gradient Boosting Machines: Lightgbm and Xgboost

I have, in the past, used and tuned models without knowing what they do. I have mostly been successful at this because most of them had just a few parameters that needed tuning like learning rate, no. of iterations, alpha or lambda and it's easy to guess how they might affect the model.So, when I came across LightGBM and XGBoost during a Kaggle challenge, I thought of doing the same with them too. I found it pretty complicated to understand the theory behind them so I tried to get away with using them as black-boxes.But I soon found out that I can’t. Its because they have a HUGE number of hyperparameters.. ones that can make or break your model! To top it off, their default setting is often not the optimal one. So, in order to effectively use the model, I had to get at least a high-level understanding of what each parameter represents and understand which ones might be the most important.The motive behind this articleMy aim is to give you that quick, high-level, working knowledge of Gradient Boosting Machines (GBM) and making you understand Gradient boosting through LightGBM and XGBoost. This way you will be able to tell what’s happening in the algorithm, what parameters you should tweak to make it better and go directly to implementing them in your own analysis.It's great if you want to know more about the theory and the math behind gradient boosting. First, let me explain to you what is Gradient boosting is and then point you to some excellent resources to understand the theory and the math behind GBM using the parameters of XGBoost and LightGBM.What is Boosting?Boosting refers to a group of algorithms which transforms weak learner to strong learners.Boosting algorithms are classified into: Gradient BoostingXGBoostAdaBoost etc.What is Gradient Boosting in Machine Learning: Gradient boosting is a machine learning technique for regression and classification problems which constructs a prediction model in the form of an ensemble of weak prediction models.Elements in Gradient Boosting AlgorithmBasically, Gradient boosting Algorithm involves three elements:A loss function to be optimized.Weak learner to make predictions.An additive model to add weak learners to minimize the loss function.In the article, “A Kaggle Master Explains Gradient Boosting”, the author quotes his fellow Kaggler, Mike Kim says-My only goal is to gradient boost over myself of yesterday. And to repeat this every day with an unconquerable spirit.With each passing day, we aim to improve ourselves by focusing on the mistakes of yesterday.And you know what? — GBMs do that too!An ensemble of predictorsGBMs do it by creating an ensemble of predictors. Each one of those predictors is sequentially built by focusing on the mistakes of the previous one.What’s an ensemble?It is simply a group of items viewed as a whole rather than individually.Now, back to the explanation...A GBM basically creates lots of individual predictors and each of them tries to predict the true label. Then, it gives its final prediction by averaging all those individual predictions (note however that it is not a simple average but a weighted average).Q-  “Averaging the predictions made by lots of predictors”.. that sounds like Random Forest!That is in fact what an ensemble method is. And random forests and gradient boosting machines are just 2 types of ensemble methods.One important difference between the two is that the predictors used in Random forest are independent of each other whereas the ones used in gradient boosting machines are built sequentially where each one tries to improve upon the mistakes made by its predecessor.You should check out the concept of bagging and boosting. So, check out this quick explanation to do that.Q: Okay, so how does the algorithm decide the number of predictors to put in the ensemble?It does not. We do. And that brings us to our first important parameter — n_estimators : We pass the number of predictors that we want the GBM to build inside the n_estimators parameter. The default number is 100.So, let’s talk about these individual predictors now.In theory, these predictors can be any regressor or classifier but in practice, decision trees give the best results.The sklearn API for LightGBM provides a parameter- boosting_type and the API for XGBoost has parameter- booster to change this predictor algorithm. You can choose from —  gbdt, dart, goss, rf (LightGBM) or gbtree, gblinear or dart (XGBoost). [Note however that a decision tree, almost always, outperforms the other options by a fairly large margin. The good thing is that it is the default setting for this parameter; so you don’t have to worry about it.]Creating weak predictorsWe also want these predictors to be weak. A weak predictor is simply a prediction model that performs better than random guessing.Q: Wait a second.. that seems backwards. Don’t we want to have strong predictors that can make good guesses? Nope. We want the individual predictors to be weak so that the overall ensemble becomes strong. This is because every predictor is going to focus on the observations that the one preceding it got wrong. When we use a weak predictor, these mislabelled observations tend to have some learnable information which the next predictor can learn. Whereas if the predictor were already strong, it would be likely that the mislabelled observations are just noise or nuances of that sample data. In such a case the model will just be overfitting to the training data. Also note that if the predictors are just too weak, it might not even be possible to build a strong ensemble out of them.Now back to creating a weak predictor.. this seems like a good area to hyperparameterise.These are the parameters that we need to tune to make the right predictors (which are decision trees):max_depth (both XGBoost and LightGBM): This provides the maximum depth that each decision tree is allowed to have. A smaller value signifies a weaker predictor.min_split_gain (LightGBM), gamma (XGBoost): Minimum loss reduction required to make a further partition on a leaf node of the tree. A lower value will result in deeper trees.num_leaves (LightGBM): Maximum tree leaves for base learners. A higher value results in deeper trees.min_child_samples (LightGBM): Minimum number of data needed in a child (leaf). According to the LightGBM docs, this is a very important parameter to prevent overfitting.Note: These are also the parameters that you can tune to control overfitting.The subtree marked in red has a leaf node with 1 data in it. So, that subtree can’t be generated as 1 < `min_child_samples` for the above caseSubsamplingEven after we do all this, it might just happen that some trees in the ensemble are highly correlated.Q: Excuse me, what do you mean by highly correlated trees?I mean decision trees that are similar in structure because of similar splits based on same features. This means that the ensemble as a whole is going to store less amount of information than what it could have stored if the trees were different. So we want our trees to be as little correlated as possible.To combat this problem, we subsample the data rows and columns before each iteration and train the tree on this subsample. These are the relevant parameters to look out for:subsample (both XGBoost and LightGBM): This specifies the fraction of rows to consider at each subsampling stage. By default, it is set to 1, which means no subsampling.colsample_bytree (both XGBoost and LightGBM): This specifies the fraction of columns to consider at each subsampling stage. By default, it is set to 1, which means no subsampling.subsample_freq (LightGBM): This specifies that bagging should be performed after every k iterations. By default, it is set to 0. So make sure that you set it to some non-zero value if you want to enable subsampling.That is it. Now you have a good overview of the whole story of how a GBM works. There are 2 more important parameters though which I couldn’t fit into the story. So, here they are —learning_rate (both XGBoost and LightGBM): It is also called shrinkage. The effect of using it is that learning is slowed down, in turn requiring more trees to be added to the ensemble. This gives the model a regularisation effect.class_weight (LightGBM): This parameter is extremely important for multi-class classification tasks when we have imbalanced classes. I recently participated in a Kaggle competition where simply setting this parameter’s value to balanced caused my solution to jump from the top 50% of the leaderboard to the top 10%.You can check out the sklearn API for LightGBM here and that for XGBoost here.Finding the best set of hyperparametersYou can use sklearn’s RandomizedSearchCV in order to find a good set of hyperparameters. It will randomly search through a subset of all possible combinations of the hyperparameters and return the best possible set of hyperparameters(or at least something close to the best).But if you wish to go even further, you could look around the hyperparameter set that it returns using GridSearchCV. Grid search will train the model using every possible hyperparameter combination and return the best set. Note that since it tries every possible combination, it can be expensive to run.Where can you use these algorithms?They are good at effectively modeling any kind of structured tabular data. Multiple winning solutions of Kaggle competitions use them. Here’s a list of Kaggle competitions where LightGBM was used in the winning model.They are simpler to implement than many other stacked regression techniques and they easily give better results too.There is another class of tree ensembles called — Random Forests. While GBMs are a type of boosting algorithm, this is a bagging algorithm (did you check the link about bagging and boosting that I mentioned above?). So, despite being implemented using decision trees like GBMs, Random Forests are much different from them. Random Forests are great because they will generally give you a good enough result with the default parameter settings, unlike XGBoost and LightGBM which require tuning. But once tuned, XGBoost and LightGBM are much more likely to perform better.Below diagram is the sample of Random ForestsAlright. So now you know all about the parameters that you need to in order to successfully use XGBoost or LightGBM to model your dataset!Before I finish off, here are a few links that you can follow to understand the theory and the math behind gradient boosting (in order of my preference) — “How to explain Gradient Boosting” by Terrance Parr and Jeremy HowardThis is a very lengthy, comprehensive and excellent series of articles that try to explain the concept to people with no prior knowledge of math or the theory behind it.“A Kaggle Master Explains Gradient Boosting” by Ben GormanA very intuitive introduction to gradient boosting. (P.S: It is the very article that gave me the quote which I used in the beginning)“A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning” by Jason BrownleeIt has a little bit of history, lots of links to follow up on and a gentle explanation with no math (!).You can also take up Machine learning courses to understand these things better.
Rated 4.5/5 based on 12 customer reviews
Normal Mode Dark Mode

A Guide to Understanding Gradient Boosting Machines: Lightgbm and Xgboost

Nityesh Agarwal
Tutorials
06th Feb, 2019
A Guide to Understanding Gradient Boosting Machines:  Lightgbm and Xgboost


I have, in the past, used and tuned models without knowing what they do. I have mostly been successful at this because most of them had just a few parameters that needed tuning like learning rate, no. of iterations, alpha or lambda and it's easy to guess how they might affect the model.

So, when I came across LightGBM and XGBoost during a Kaggle challenge, I thought of doing the same with them too. I found it pretty complicated to understand the theory behind them so I tried to get away with using them as black-boxes.

But I soon found out that I can’t. Its because they have a HUGE number of hyperparameters.. ones that can make or break your model! To top it off, their default setting is often not the optimal one. So, in order to effectively use the model, I had to get at least a high-level understanding of what each parameter represents and understand which ones might be the most important.

The motive behind this article

My aim is to give you that quick, high-level, working knowledge of Gradient Boosting Machines (GBM) and making you understand Gradient boosting through LightGBM and XGBoostThis way you will be able to tell what’s happening in the algorithm, what parameters you should tweak to make it better and go directly to implementing them in your own analysis.

It's great if you want to know more about the theory and the math behind gradient boosting. First, let me explain to you what is Gradient boosting is and then point you to some excellent resources to understand the theory and the math behind GBM using the parameters of XGBoost and LightGBM.

What is Boosting?

Boosting refers to a group of algorithms which transforms weak learner to strong learners.
Boosting algorithms are classified into: 

  • Gradient Boosting
  • XGBoost
  • AdaBoost etc.


What is Gradient Boosting in Machine Learning: 

Gradient boosting is a machine learning technique for regression and classification problems which constructs a prediction model in the form of an ensemble of weak prediction models.

Elements in Gradient Boosting Algorithm

Basically, Gradient boosting Algorithm involves three elements:

  • A loss function to be optimized.
  • Weak learner to make predictions.
  • An additive model to add weak learners to minimize the loss function.

In the article, “A Kaggle Master Explains Gradient Boosting”, the author quotes his fellow Kaggler, Mike Kim says-

My only goal is to gradient boost over myself of yesterday. And to repeat this every day with an unconquerable spirit.

With each passing day, we aim to improve ourselves by focusing on the mistakes of yesterday.

And you know what? — GBMs do that too!

An ensemble of predictors

GBMs do it by creating an ensemble of predictors. Each one of those predictors is sequentially built by focusing on the mistakes of the previous one.

What’s an ensemble?

It is simply a group of items viewed as a whole rather than individually.

Now, back to the explanation...

A GBM basically creates lots of individual predictors and each of them tries to predict the true label. Then, it gives its final prediction by averaging all those individual predictions (note however that it is not a simple average but a weighted average).

Q-  “Averaging the predictions made by lots of predictors”.. that sounds like Random Forest!

  • That is in fact what an ensemble method is. And random forests and gradient boosting machines are just 2 types of ensemble methods.

One important difference between the two is that the predictors used in Random forest are independent of each other whereas the ones used in gradient boosting machines are built sequentially where each one tries to improve upon the mistakes made by its predecessor.

You should check out the concept of bagging and boosting. So, check out this quick explanation to do that.

Q: Okay, so how does the algorithm decide the number of predictors to put in the ensemble?

  • It does not. We do. And that brings us to our first important parameter — n_estimators : We pass the number of predictors that we want the GBM to build inside the n_estimators parameter. The default number is 100.

So, let’s talk about these individual predictors now.

In theory, these predictors can be any regressor or classifier but in practice, decision trees give the best results.

The sklearn API for LightGBM provides a parameter- boosting_type and the API for XGBoost has parameter- booster to change this predictor algorithm. You can choose from —  gbdt, dart, goss, rf (LightGBM) or gbtree, gblinear or dart (XGBoost). [Note however that a decision tree, almost always, outperforms the other options by a fairly large margin. The good thing is that it is the default setting for this parameter; so you don’t have to worry about it.]

Creating weak predictors

We also want these predictors to be weakA weak predictor is simply a prediction model that performs better than random guessing.

Q: Wait a second.. that seems backwards. Don’t we want to have strong predictors that can make good guesses? 

  • Nope. We want the individual predictors to be weak so that the overall ensemble becomes strong. This is because every predictor is going to focus on the observations that the one preceding it got wrong. When we use a weak predictor, these mislabelled observations tend to have some learnable information which the next predictor can learn. Whereas if the predictor were already strong, it would be likely that the mislabelled observations are just noise or nuances of that sample data. In such a case the model will just be overfitting to the training data. 

Also note that if the predictors are just too weak, it might not even be possible to build a strong ensemble out of them.

Now back to creating a weak predictor.. this seems like a good area to hyperparameterise.

These are the parameters that we need to tune to make the right predictors (which are decision trees):

  • max_depth (both XGBoost and LightGBM)This provides the maximum depth that each decision tree is allowed to have. A smaller value signifies a weaker predictor.
  • min_split_gain (LightGBM), gamma (XGBoost): Minimum loss reduction required to make a further partition on a leaf node of the tree. A lower value will result in deeper trees.
  • num_leaves (LightGBM): Maximum tree leaves for base learners. A higher value results in deeper trees.
  • min_child_samples (LightGBM): Minimum number of data needed in a child (leaf). According to the LightGBM docs, this is a very important parameter to prevent overfitting.

Note: These are also the parameters that you can tune to control overfitting.

The subtree marked in red has a leaf node with 1 data in it. So, that subtree can’t be generated as 1 < `min_child_samples` for the above case

Subsampling

Even after we do all this, it might just happen that some trees in the ensemble are highly correlated.

Q: Excuse me, what do you mean by highly correlated trees?

  • I mean decision trees that are similar in structure because of similar splits based on same features. This means that the ensemble as a whole is going to store less amount of information than what it could have stored if the trees were different. So we want our trees to be as little correlated as possible.

To combat this problem, we subsample the data rows and columns before each iteration and train the tree on this subsample. These are the relevant parameters to look out for:

  • subsample (both XGBoost and LightGBM): This specifies the fraction of rows to consider at each subsampling stage. By default, it is set to 1, which means no subsampling.
  • colsample_bytree (both XGBoost and LightGBM): This specifies the fraction of columns to consider at each subsampling stage. By default, it is set to 1, which means no subsampling.
  • subsample_freq (LightGBM): This specifies that bagging should be performed after every k iterations. By default, it is set to 0. So make sure that you set it to some non-zero value if you want to enable subsampling.

That is it. Now you have a good overview of the whole story of how a GBM works. There are 2 more important parameters though which I couldn’t fit into the story. So, here they are —

  • learning_rate (both XGBoost and LightGBM): It is also called shrinkage. The effect of using it is that learning is slowed down, in turn requiring more trees to be added to the ensemble. This gives the model a regularisation effect.
  • class_weight (LightGBM): This parameter is extremely important for multi-class classification tasks when we have imbalanced classes. I recently participated in a Kaggle competition where simply setting this parameter’s value to balanced caused my solution to jump from the top 50% of the leaderboard to the top 10%.

You can check out the sklearn API for LightGBM here and that for XGBoost here.

Finding the best set of hyperparameters

You can use sklearn’s RandomizedSearchCV in order to find a good set of hyperparameters. It will randomly search through a subset of all possible combinations of the hyperparameters and return the best possible set of hyperparameters(or at least something close to the best).

But if you wish to go even further, you could look around the hyperparameter set that it returns using GridSearchCV. Grid search will train the model using every possible hyperparameter combination and return the best set. Note that since it tries every possible combination, it can be expensive to run.

Where can you use these algorithms?

They are good at effectively modeling any kind of structured tabular data. Multiple winning solutions of Kaggle competitions use them. Here’s a list of Kaggle competitions where LightGBM was used in the winning model.

They are simpler to implement than many other stacked regression techniques and they easily give better results too.
Kaggle

There is another class of tree ensembles called — Random Forests. While GBMs are a type of boosting algorithm, this is a bagging algorithm (did you check the link about bagging and boosting that I mentioned above?). So, despite being implemented using decision trees like GBMs, Random Forests are much different from them. Random Forests are great because they will generally give you a good enough result with the default parameter settings, unlike XGBoost and LightGBM which require tuning. But once tuned, XGBoost and LightGBM are much more likely to perform better.

Below diagram is the sample of Random Forests

sample of Random Forests

Alright. So now you know all about the parameters that you need to in order to successfully use XGBoost or LightGBM to model your dataset!

Before I finish off, here are a few links that you can follow to understand the theory and the math behind gradient boosting (in order of my preference) — 


You can also take up Machine learning courses to understand these things better.

Nityesh

Nityesh Agarwal

Blog author

Student pursuing a bachelor's degree in IT. I am interested in Deep Learning, startups and I contribute to open source. I sometimes write about my personal experiences with programming on Medium.

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on

other Blogs