Photo credit: PixabayColumnTransformer Meets Natural Language ProcessingHow to combine several feature extraction mechanisms or transformations into a single transformer in a scikit-learn pipelineSusan LiBlockedUnblockFollowFollowingApr 24Since published several articles on text classification, I have received inquiries on how to deal with mixed input feature types, for example, how to combine numeric, categorical and text features in a classification or regression model.
Therefore, I decided to write a post using an example to answer this question.
There are several different methods to append or combine different feature types.
One method is using scipy.
hstack that stack sparse matrices horizontally.
However, I will introduce one method, the new hot baby on the block: ColumnTransformer function from sklearn.
If you would like to try it out, you will need to upgrade your sklearn to 0.
The DataAn excellent data set for this demonstration is Mercari Price Suggestion Challenge that builds a machine learning model to automatically suggest the right product prices.
The data can be found here.
pyTable 1Our data contains heterogeneous data types, they are numeric, categorical and text data.
We want to use different pre-processing steps and transformations for those different types of columns.
For example: we may want to one-hot encode the categorical features and tfidfvectorize the text features.
“Price” is the target feature that we will predict.
Date Pre-processingTarget feature — pricedf.
describe()Figure 1Remove price = 0 and explore its distribution.
pyFigure 2The target feature price is right skewed.
As linear models like normally distributed data , we will transform price and make it more normally distributed.
pyFigure 3df["price"] = np.
log1p(df["price"])Feature EngineeringFill missing “category_name” with “other” and convert “category_name” to category data type.
Fill missing “brand_name” with “unknown”.
Determine the popular brands and set the rest as “other”.
Fill missing “item_description” with “None”.
Convert “item_description_id” to category data type.
Convert “brand_name” to category data type.
pyOur features and target:target = df.
valuesfeatures = df[['name', 'item_condition_id', 'category_name', 'brand_name', 'shipping', 'item_description']].
copy()Split the data in training and test sets:X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.
2, random_state=0)The following is how to apply ColumnTransformer.
Surprisingly, it is very simple.
Encode “item_condition_id” & “brand_name”.
CountVectorizer “category_name” & “name”.
TfidfVectorizer “item_description” .
We can keep the remaining “shipping” feature by setting remainder='passthrough'.
The values are appended to the end of the transformation:columntransformer.
pyModel & EvaluationWe will combine this preprocessing step based on the ColumnTransformer with a regression in a Pipeline to predict the price.
pyJupyter notebook can be found on Github.
Enjoy the rest of the week!References:4.
Pipelines and composite estimators – scikit-learn 0.
3 documentationTransformers are usually combined with classifiers, regressors or other estimators to build a composite estimator.
orgIntroducing the ColumnTransformer: applying different transformations to different features in a…Real-world data often contains heterogeneous data types.
When processing the data before applying the final prediction…jorisvandenbossche.
io.. More details