LightGBM

Written in

by

For some time, XGBoost was considered the Kaggle-Killer, being the winning model for most prediction problems. Recently Microsoft released their own gradient boosting framework called LightGBM, and it is way faster than XGB. In this post, I’m going to touch on the interesting portions of LightGBM.

What is LightGBM?


Similar to XGBoost, LightGBM is a gradient boosted tree based algorithm. Unlike other gradient boosted trees which grows hroizontally, LightGBM grows vertically. LightGBM grows Leaf-wise, while others grow Level-wise.

LightGBM leaf-wise growth. This allows for deeper vertical growth
Other gradient boosted algortihms grow level wise, which results in longer horizontal growth

Dealing with Non-Numeric Data


The nice thing about LightGBM is that it can take in data as a whole, and it does not require inputs to be converted into numerical format! This means that if your data has a mix of numbers and strings, you can simply throw everything into the model to learn.

The one thing you have to do however, is to specify the string columns as category. Below is a example of how we do it with Pandas Dataframe

dtypes = {
        'MachineIdentifier':'category',
        'ProductName': 'category',
        'EngineVersion': 'category',
        'AppVersion':'category',
        'AvSigVersion':'category',
        'IsBeta':'int8',
        'RtpStateBitfield':'float16',
        'IsSxsPassiveMode':'int8'
}

df_train = pd.read_csv('train.csv', nrows=2000000, dtype=dtypes)

Or if you’re creating new features, you have to recast the datatype of the new column to categories

for feature in newFeatures:
    Train[feature ] = Train[feature ].astype('category')

Important Parameters to Tune


LightGBM has a huge array of parameters to tune, and I wont be listing them here. I will however be highlighting those I think are important, and has helped me increase my model predictions

  • max_depth: Defines how deep the tree grows
  • num_leaves: Defines the maximum number of leaves in a node
  • max_bin: Defines the maximum number of bins your feature will be bucketed in

For a more comprehensive read, click here!

Conclusion


In this short post, we’ve very briefly covered about LightGBM, how it is different from other gradient boosted machines, and how to define categories for training.

An important thing to know is that LightGBM is very sensitive to overfitting, and should not be used for small data sets <10,000 rows.

Tags

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: