10 Tips to build a better modeling dataset for tree-based machine learning modelsTo make the model more accurate — simply one-hot encode all categorical features and impute all missing values with zero may not be enough.
Shiu-Tang LiBlockedUnblockFollowFollowingApr 14Silver Lake, Utah (photo by my wife, Yi)Assuming there’s a business problem that can be converted to a machine learning problem with tabular data as its input, clearly defined labels and metrics (say, RMSE for regression problems or ROCAUC for classification problems).
In the dataset, there’re a bunch of categorical variables, numerical variables and some missing values within the data, and a tree-based ML model is going to be built on top of the dataset (decision trees, random forests, or gradient boosting trees).
Are there some tricks to improve the data before applying any ML algorithms on top of it?This process may vary a lot with the dataset.
But I’d like to point out some general principles that could apply to a bunch of datasets, and also explain why.
Some knowledge of tree-based ML algorithms may help the reader better digest part of the materials.
Two kaggle datasets will be used to better demonstrate the ideas behind some of the tips.
Sberbank Russian housing market dataset — train.
Home Credit Default Risk dataset — application_train.
Don’t select feature simply based on correlation with labels.
Let’s say we’d like to predict the net income of different stores in a restaurant chain.
We have variables like city, number of other restaurants in 10 miles, number of employees, restaurant square footage, … etc.
Assume ‘restaurant square footage’ has very small correlation with the label (net income).
There could be large-sized restaurants in rural area not making big money, and small-sized restaurants with prefect location in a big city that generate decent profits.
Even though the correlation is small, when the ‘restaurant square footage’ feature is combined with ‘city’ feature, say, it may still show more values — larger restaurant could make more money in expensive areas, but the trend is not clear in cheap areas.
With a tree model, after the data is split into two groups by ‘city’ variable, the ‘restaurant square footage’ would show its predictive power in some cities (these cities are children nodes of the previous split).
This is an example of variable interactions, and tree-based models are good at capturing these interactions.
Using only correlation with labels to evaluate feature importance is like saying variable interactions are not important at all.
Impute missing values in a reasonable setting.
For every node split in a tree-based model, only one feature and one threshold is used.
Say to predict the probability that a person has diabetes, condition ‘age < 60’ is used in splitting a node.
Then younger people and older people are separated in two groups.
If the age info of person is missing and is imputed as 0, then younger people and the people without age info will be put into the same group, which does not make sense.
In this case, impute the age missing info as average value could be a better choice, or for certain tree-based algorithms (like XGBoost or LightGBM), we can just leave missing values there and let the algorithm decide what to do in each node split.
Let me give another example with Home Credit Default Risk dataset.
The goal is predict whether the client is able to repay a loan or not.
Among all the features there’re 3 features of interest, EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3, indicating credit scores from external data resources, with scores ranging from 0.
0 to 1.
In this dataset, imputing missing value as 0 means assigning lowest scores to those records, which would be absurd.
Encode ordinal variables properly.
For ordinal categorical variables, when we perform node splitting in a tree model, in many cases it would make more sense to group records with higher levels in a group, the ones with lower level into another group.
For example, to predict the whether a few combined factors would cause cancer, we have experiement data from laboratory mouses.
One variable is ‘sugar intake’ with levels ‘high’, ‘medium’, ‘low’, and we better encode them as 2, 1, 0 (or 0, 1, 2), but not based on the volume of data in each level.
(A reminder for Spark ML users: when you use StringIndexers, the variables are encoded simply based on data volume, which is not recommended for tree models.
Encode categorical variables carefully.
If the categrocial variables are ordinal, the reader can follow tip#3 to encode them.
If not, then there’re two common approaches to encode categrocial variables:Replace them with the average value (regression or 0,1 classification) in each categoryOne-hot encodingBut they don’t work well when there’re too many distinct values in the categorical variable.
For approach 1.
, the model could overfit.
For approach 2.
, the training data size could be too large, and each one-hot encoded feature may have very weak predictive power (careful feature selection is required in this case).
So it might be a good idea to group them further (better do it with enough domain knowledge).
Use common sense to correct part of the data.
Let me take a few features in Sberbank Russian housing market data as examples.
 ‘build year’ contains values like 0, 1, 2, or even values > 3000.
 ‘kitch_sq’ > ‘life_sq’.
This means the kitchen floor area > the house floor area? ‘floor’ > ‘max_floor’.
Say in a 10-story apartment, it’s located in 15th floor?…Wrong values need to replaced with null values.
(Or, you may consider removing these records if your data volume is large)Tip #6.
Split data with different structures.
Let me use Sberbank Russian housing market data again as an example.
If you dotrain = pd.
csv')train[(train['product_type'] == 'Investment') & (train['price_doc'] >= 4000000)& (train['price_doc'] <= 5000000)]['price_doc'].
hist(bins=80)and then replace ‘Investment’ in the above codes with ‘OwnerOccupier’, you’ll find that the distributions look quite different:InvestmentOwnerOccupierThis is just one indicator that shows the two datasets have very different structures.
Actually, if you split the dataset into two different datasets based on ‘product_type’, you’ll find the model accuracy is improved.
For a tree model, doing this is like splitting the root node with condition ‘product_type = Investment’.
Sometimes your model could benefit from this trick, sometimes not (cuz the data size is reduced).
So use this trick carefully.
Deal with highly correlated variables.
If there’re several variables that are highly correlated with each other, Here are some suggestions.
 If you want better interpretability and less model training time (but sacrificing a bit model accuracy), you can select the one with the highest feature importance and discard the others.
 When doing feature selections, these correlated variables together may be provide a lot of information, but because they’re similar, it could happen that each individual feature has low feature importance.
So select features with care when this happens.
 Logistic regression would be affected by multi-colinearity but not tree models.
It won’t hurt much if you just leave all correlated features in your model.
Make sure the rows you’re building model on are truely DISTINCT.
Multiple records in the dataset could actually refer to the same record.
Say when we’re building models on each person, and we use ‘name’ to assign an ID to each person, it could lead to two types of errors — different people with the same name are treated as the same person, or the same person with different records (John Smith / John A.
Smith) in the data.
There could be other issues with DOB / SSN.
Interested readers could google ‘record linkage’ for more info.
Detect data issues from weird variable distributions.
Say we’d like to use build a model using patient data, and we’ve collected the data from different data resources (for example, different hospitals).
And then we find that for hospital X, the average medical cost for patients are way lower than the other hospitals.
We are confused by results, so we dive into the dataset, and we find there’s record linkage issues — a lot of records from hospital X should be grouped together into a smaller set with fewer patients.
This is just an easy example.
Be skeptical when you see weird stats in your data.
How much data is enough for modeling?Let me explain why it’s tough to find a good answer for this.
Let’s say we’d like to predict the house price of City A in 2019.
Say in our training dataset, we only have the label (house price).
If we assume that all records are collected independently, randomly drawn from the whole population with equal probability, then simply taking the average of the house prices (y1 + … + y_n)/n would minimize RMSE for the test set.
But in reality, the above assumptions won’t hold, the samples won’t be i.
d (indepedent and identically distributed), the calculation of confidence intervals won’t be very precise, we will also have a lot of other features in our dataset, our tree model will be more compicated than the estimator above, and the model predictions for each record won’t be i.
Takeaway: Just get your hands dirty with the data first.
Don’t expect that the appropriate amount of data can be found before collecting them.
Thanks for reading!.Comment below if you have any suggestions.
* * * * *My other posts in Toward Data Science:Build XGBoost / LightGBM models on large datasets — what are the possible solutions?A step-by-step guide for creating advanced Python data visualizations with Seaborn / Matplotlib10 Python Pandas tricks that make your work more efficientAn interesting and intuitive view of AUCPlotting decision boundaries in 3D — Logistic regression and XGBoostXGBoost deployment made easy.. More details