IllustrationThe dataset I’m going to use for this illustration can be found on Kaggle via this link.
House Sales in King County, USAPredict house price using regressionwww.
comThe goal of this illustration is to go through the steps involved in writing our own custom transformers and pipelines to pre-process the data leading up to the point it is fed into a machine learning algorithm to either train the model or make predictions.
There may very well be better ways to engineer features for this particular problem than depicted in this illustration since I am not focused on the effectiveness of these particular features.
The goal of this illustration to familiarize the reader with the tools they can use to create transformers and pipelines that would allow them to engineer and pre-process features anyway they want and for any dataset , as efficiently as possible.
Let’s get started then.
This dataset contains a mix of categorical and numerical independent variables which as we know will need to pre-processed in different ways and separately.
This means that initially they’ll have to go through separate pipelines to be pre-processed appropriately and then we’ll combine them together.
So the first step in both pipelines would have to be to extract the appropriate columns that need to be pushed down for pre-processing.
The syntax for writing a class and letting Python know that it inherits from one or more classes is pictured below since for any class we write, we get to inherit most of it from the TransformerMixin and BaseEstimator base classes.
Below is the code for our first custom transformer called FeatureSelector.
The transform method for this constructor simply extracts and returns the pandas dataset with only those columns whose names were passed to it as an argument during its initialization.
As you can see, we put BaseEstimator and TransformerMixin in parenthesis while declaring the class to let Python know our class is going to inherit from them.
Like all the constructors we’re going to write , the fit method only needs to return self.
The transform method is what we’re really writing to make the transformer do what we need it to do.
In this case it simply means returning a pandas data frame with only the selected columns.
Now that the constructor that will handle the first step in both pipelines has been written, we can write the transformers that will handle other steps in their appropriate pipelines, starting with the pipeline that will handle the categorical features.
Categorical PipelineBelow is a list of features our custom transformer will deal with and how, in our categorical pipeline.
date : The dates in this column are of the format ‘YYYYMMDDT000000’ and must be cleaned and processed to be used in any meaningful way.
The constructor for this transformer will allow us to specify a list of values for the parameter ‘use_dates’ depending on if we want to create a separate column for the year, month and day or some combination of these values or simply disregard the column entirely by passing in an empty list.
By not hard coding the specifications for this feature, we give ourselves the ability to try out different combinations of values whenever we want without having to rewrite code.
waterfront : Wether the house is waterfront property or not.
Convert to binary — Yes or Noview : How many times the house has been viewed.
Most of the values are 0.
The rest are very thinly spread between 1 and 4.
Convert to Binary — Yes or Noyr_renovated : The year the house was renovated in.
Most of the values are 0, presumably for never while the rest are very thinly spread between some years.
Convert to Binary — Yes or NoOnce all these features are handled by our custom transformer in the aforementioned way, they will be converted to a Numpy array and pushed to the next and final transformer in the categorical pipeline.
A simple scikit-learn one hot encoder which returns a dense representation of our pre-processed data.
Below is the code for our custom transformer.
Numerical PipelineBelow is a list of features our custom numerical transformer will deal with and how, in our numerical pipeline.
bedrooms : Number of bedrooms in the house.
Pass as it is.
bathrooms : Number of bathrooms in the house.
The constructor for this transformer will have a parameter ‘bath_per_bead’ that takes in a Boolean value.
If True, then the constructor will create a new column by computing bathrooms/bedrooms to calculate the number of bathrooms per bedroom and drop the original bathroom column.
If False, then it will just pass the bathroom column as it is.
sqft_living : Size of the living area of the house in square feet.
Pass as it is.
sqft_lot : Total size of the lot in square feet.
Pass as it is.
floors : Number of floors in the house.
Pass as it is.
condition : Discrete variable describing the condition of the house with values from 1–5.
Pass as it is.
grade : Overall grade given to the housing unit, based on King County grading system with values from 1–13.
Pass as it is.
sqft_basement : Size of the basement in the house in square feet if any.
0 for houses that don’t have basements.
Pass as it is.
yr_built : The year the house was built in.
The constructor for this transformer will have another parameter ‘years_old’ that also takes in a Boolean value.
If True, then the constructor will create a new column by computing the age of the house in 2019 by the subtracting the year it was built in from 2019 and it will drop the original yr_built column.
If False, then it will just pass the yr_built column as it is.
Once all these features are handled by our custom numerical transformer in the numerical pipeline as mentioned above, the data will be converted to a Numpy array and passed to the next step in the numerical pipeline, an Imputer which is another kind of scikit-learn transformer.
The Imputer will compute the column-wise median and fill in any Nan values with the appropriate median values.
From there the data would be pushed to the final transformer in the numerical pipeline, a simple scikit-learn Standard Scaler.
Below is the code for the custom numerical transformer.
Combining the pipelines togetherNow that we’ve written our numerical and categorical transformers and defined what our pipelines are going to be, we need a way to combine them, horizontally.
We can do that using the FeatureUnion class in scikit-learn.
We can create a feature union class object in Python by giving it two or more pipeline objects consisting of transformers.
Calling the fit_transform method for the feature union object pushes the data down the pipelines separately and then results are combined and returned.
In our case since the first step for both of our pipelines is to extract the appropriate columns for each pipeline, combining them using feature union and fitting the feature union object on the entire dataset means that the appropriate set of columns will be pushed down the appropriate set of pipelines and combined together after they are transformed!.Isn’t that awesome?I didn’t even tell you the best part yet.
It will parallelize the computation for us!.That’s right, it’ll transform the data in parallel and put it back together!.So it will be most likely be faster than any script that deals with this kind of preprocessing linearly where it’s most likely a little more work to parallelize it.
We don’t have to worry about doing that manually anymore.
Our FeatureUnion object will take care of that as many times as we want.
All we have to do is call fit_transform on our full feature union object.
Below is the code that creates both pipelines using our custom transformers and others and then combines them together.
Now you might have noticed that I didn’t include any machine learning models in the full pipeline.
The reason for that is that I simply can’t.
The FeatureUnion object takes in pipeline objects containing only transformers.
A machine learning model is an estimator.
The workaround for that is I can make another Pipeline object , and pass my full pipeline object as the first step and add a machine learning model as the final step.
The full preprocessed dataset which will be the output of the first step will simply be passed down to my model allowing it to function like any other scikit-learn pipeline you might have written!.Here’s the code for that.
We simply fit the pipeline on an unprocessed dataset and it automates all of the preprocessing and fitting with the tools we built.
The appropriate columns are split , then they’re pushed down the appropriate pipelines where they go through 3 or 4 different transformers each (7 in total!) with arguments we decide on and the the pre-processed data is put back together and pushed down the model for training!.Calling predict does the same thing for the unprocessed test data frame and returns the predictions!.Here’s a simple diagram I made that shows the flow for our machine learning pipeline.
Simple flow diagram for our pipelineIn addition to fit_transform which we got for free because our transformer classes inherited from the TransformerMixin class, we also have get_params and set_params methods for our transformers without ever writing them because our transformer classes also inherit from class BaseEstimator.
These methods will come in handy because we wrote our transformers in a way that allows us to manipulate how the data will get preprocessed by providing different arguments for parameters such as use_dates, bath_per_bed and years_old.
Just using simple product rule, that’s about 108 parameter combinations I can try for my data just for the preprocessing part!.Which I can set using set_params without ever re-writing a single line of code.
Since this pipeline functions like any other pipeline, I can also use GridSearch to tune the hyper-parameters of whatever model I intend to use with it!There you have it.
Now you know how to write your own fully functional custom transformers and pipelines on your own machine to automate handling any kind of data , the way you want it using a little bit of Python magic and Scikit-Learn.
There is obviously room for improvement , such as validating that the data is in the form you expect it to be , coming from the source before it ever gets to the pipeline and giving the transformers the ability to handle and report unexpected errors.
However , just using the tools in this article should make your next data science project a little more efficient and allow you to automate and parallelize some tedious computations.
If there is anything that I missed or something was inaccurate or if you have absolutely any feedback , please let me know in the comments.
I would greatly appreciate it.