For some time, XGBoost was considered the Kaggle-Killer, being the winning model for most prediction problems. Recently Microsoft released their own gradient boosting framework called LightGBM, and it is way faster than XGB. In this post, I’m going to touch on the interesting portions of LightGBM.
What is LightGBM?
Similar to XGBoost, LightGBM is a gradient boosted tree based algorithm. Unlike other gradient boosted trees which grows hroizontally, LightGBM grows vertically. LightGBM grows Leaf-wise, while others grow Level-wise.


Dealing with Non-Numeric Data
The nice thing about LightGBM is that it can take in data as a whole, and it does not require inputs to be converted into numerical format! This means that if your data has a mix of numbers and strings, you can simply throw everything into the model to learn.
The one thing you have to do however, is to specify the string columns as category
. Below is a example of how we do it with Pandas Dataframe
dtypes = {
'MachineIdentifier':'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion':'category',
'AvSigVersion':'category',
'IsBeta':'int8',
'RtpStateBitfield':'float16',
'IsSxsPassiveMode':'int8'
}
df_train = pd.read_csv('train.csv', nrows=2000000, dtype=dtypes)
Or if you’re creating new features, you have to recast the datatype of the new column to categories
for feature in newFeatures:
Train[feature ] = Train[feature ].astype('category')
Important Parameters to Tune
LightGBM has a huge array of parameters to tune, and I wont be listing them here. I will however be highlighting those I think are important, and has helped me increase my model predictions
max_depth
: Defines how deep the tree growsnum_leaves
: Defines the maximum number of leaves in a nodemax_bin
: Defines the maximum number of bins your feature will be bucketed in
For a more comprehensive read, click here!
Conclusion
In this short post, we’ve very briefly covered about LightGBM, how it is different from other gradient boosted machines, and how to define categories for training.
An important thing to know is that LightGBM is very sensitive to overfitting, and should not be used for small data sets <10,000 rows.
Leave a Reply