Feature Engineering is one of the neglected portion of machine learning. Most topics revolve around Model Training (parameter tuning, cross validation). While that might be really important, feature engineering is equally important as well, but I can’t seem to find good resources that talk about this. I suspect this is because to perform feature engineering, you need expert knowledge of the data, and what it represents.
A typical workflow would look something like this
- Project Scoping / Data Collection
- Exploratory Analysis (EDA)
- Data Cleaning
- Feature Engineering
- Model Training
- Project Delivery / Insights
What is not Feature Engineering
- Data cleaning (Outlier detection, Missing values)
- Scaling and Normalization
- Feature Selection
I would classify these as data massaging, as you’re just changing the data (except for Feature Selection). Feature Engineering is the creation of new data.
What is Feature Engineering
There are a few ways to create new features from existing ones
Indicator variables are new variables that help you isolate data. This new feature is discriminative and can help separate the data.
- Threshold: If you’re studying data on alcohol consumption, you could create a new binary feature if the person is
>=21years old. The expert knowledge in this is knowing where your data came from, and what is the minimum age of drinking in that country/state
- Special Events: If you’re studying sales, there could be seasons that have higher sales, such as
isBlackFriday. Expert knowledge is knowing what special events there are
- Groupings: You can create artificial groups for the data, for example in network traffic, you can group the, according to protocols or source. Expert knowledge is knowing how to interpret the data, and what grouping makes sense
Interaction of Features
Features can interact with each other to create new variables. Interaction here means some mathematical operation between them.
- Sum of Features: If you’re looking at sales of individual items, a new feature might be
overallSales, where you add the sales of each item together
- Product of Features: If you’re looking at wages, and you have features like
workingHours, you can create a new feature called
The expert knowledge in these areas are knowing how the features interact with each other to produce new features. However, from unfortunate experience, I’ve seen some feature interactions that makes absolutely no sense, but the model seems to think otherwise. An example I saw was a new feature created from the multiplication of
totalRAM which makes absolutely no sense, but it gave a boost in prediction accuracy. Machine Learning really is still a black box.
For some features, you can better represent them in other formats that give more information.
- Date to integer: When give a
datetimeformat string, it almost always makes sense to decompose it to it’s integer components such as
year. More than that, you can create features such as
- Sparse classes to Other: In a categorical class, if some classes are hugely under-represented, they can be grouped together, and classified as
External Data Augmentation
Another way to create new features is to bring in new data such as Geolocation information. These external data can be used to add in new features, which in turn can interact, represent or isolate current features.
Indicator Features, Feature Interactions, Feature Representation, External Data Augmentation are all several way to engineer new features. This is different from data massaging.
Feature Engineering is extremely important in your Machine Learning workflow.