Feature Engineering is one of the neglected portion of machine learning. Most topics revolve around Model Training (parameter tuning, cross validation). While that might be really important, feature engineering is equally important as well, but I can’t seem to find good resources that talk about this. I suspect this is because to perform feature engineering, you need expert knowledge of the data, and what it represents.
A typical workflow would look something like this
- Project Scoping / Data Collection
- Exploratory Analysis (EDA)
- Data Cleaning
- Feature Engineering
- Model Training
- Project Delivery / Insights
What is not Feature Engineering
- Data cleaning (Outlier detection, Missing values)
- Scaling and Normalization
- Feature Selection
I would classify these as data massaging, as you’re just changing the data (except for Feature Selection). Feature Engineering is the creation of new data.
What is Feature Engineering
There are a few ways to create new features from existing ones
Indicator Variables
Indicator variables are new variables that help you isolate data. This new feature is discriminative and can help separate the data.
Examples:
- Threshold: If you’re studying data on alcohol consumption, you could create a new binary feature if the person is
>=21
years old. The expert knowledge in this is knowing where your data came from, and what is the minimum age of drinking in that country/state - Special Events: If you’re studying sales, there could be seasons that have higher sales, such as
isChristmas
,isSinglesDay
orisBlackFriday
. Expert knowledge is knowing what special events there are - Groupings: You can create artificial groups for the data, for example in network traffic, you can group the, according to protocols or source. Expert knowledge is knowing how to interpret the data, and what grouping makes sense
Interaction of Features
Features can interact with each other to create new variables. Interaction here means some mathematical operation between them.
Examples:
- Sum of Features: If you’re looking at sales of individual items, a new feature might be
overallSales
, where you add the sales of each item together - Product of Features: If you’re looking at wages, and you have features like
hourlyRate
, andworkingHours
, you can create a new feature calledtotalPay
The expert knowledge in these areas are knowing how the features interact with each other to produce new features. However, from unfortunate experience, I’ve seen some feature interactions that makes absolutely no sense, but the model seems to think otherwise. An example I saw was a new feature created from the multiplication of screenHorizontalSize
and totalRAM
which makes absolutely no sense, but it gave a boost in prediction accuracy. Machine Learning really is still a black box.
Feature Representation
For some features, you can better represent them in other formats that give more information.
Examples:
- Date to integer: When give a
datetime
format string, it almost always makes sense to decompose it to it’s integer components such asday
,month
andyear
. More than that, you can create features such asisWeekday
orisPeakHour
- Sparse classes to Other: In a categorical class, if some classes are hugely under-represented, they can be grouped together, and classified as
Others
External Data Augmentation
Another way to create new features is to bring in new data such as Geolocation information. These external data can be used to add in new features, which in turn can interact, represent or isolate current features.
Conclusion
Indicator Features, Feature Interactions, Feature Representation, External Data Augmentation are all several way to engineer new features. This is different from data massaging.
Feature Engineering is extremely important in your Machine Learning workflow.
Leave a Reply