Counts Based Featurization

While doing the Microsoft Malware Classification challenge, I encountered a way of Feature representation called Count Based Features (CBF).

CBF is good to use with very high cardinality features, and it transforms the high number of categories in the data to the number of it’s occurrences. This representation is helpful because it extracts out a simple inherent feature of the data: count

Below shows a simple example of how we get the CBF of a given feature

LabelFeature1
0A
0A
1A
0B
1B
1B
1B

CBF can be done in pandas in a single line

Train.groupby([' Feature1 '])[' Feature1 '].transform('count')

The output of this will give you

Label Feature1
0 3
0 3
1 3
0 4
1 4
1 4
1 4

As you can see, the categorical values are all converted their count values!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s