I will be running multiple experiments and do some comparison with the base model.
I will cover that in the later section of Results.
Train/Test SplittingWe would need to split our data into train and test set to validate the accuracy of our model before we actually push this model into production.
Something important to note when creating our train test set:The distribution of train and test set should be similar.
A bad example of train and test set will be, the flat in the train set are all from the east region and the flat in the test set are all from the west region.
Clearly, the model that is being trained under such condition would not be very effective.
The distribution of test set and the data in real life should be similar.
Test set serves as a safe estimate for you to know how well your system will perform before rolling out to production.
If your test set do not capture what the distribution of real life data will be, then the results you get from the test set will not give you much confidence.
There is a good chance that your system will perform badly in production.
To produce a good train and test data set, we can use Stratified Sampling.
Before we do stratified sampling, we need to decide a key feature that is important and would make sure that your train and test set distribution are similar based on this feature.
For example, in our project, I have chosen location as the key feature to do stratified sampling.
It is a good idea that the flat in train and test set are having a similar distribution in terms of location.
Which means if your train set has 50% of flat in central, 30% of flat in west and 20% in east, you should maintain this ratio in your test set.
ExperimentA huge part of data science is about carrying out experiments.
Creating new features, changing model, tuning hyper-parameters and so on.
The effectiveness of all these activities need to be measured by running a lot of experiments.
In this section, I am going to cover a few tips + tools that can help you to run your experiment more effectively.
MLflowMlflow is a platform to help you to streamline your machine learning development.
There are multiple sub-projects under MLflow such as :MLflow Tracking — Help you track parameters, results, metrics of the experiments and compare them through an interactive UIMLflow Projects — Help you to package your code into reproducible run through Docker and CondaMLflow Models — Help you to package the model so that you can share it easily with othersIn this project, I have used MLflow to track my experiment and log the model after training so that I can easily share it with others to reproduce my results.
The snippet below shows the interactive UI provided by MLflow :MLflow UIIt is a good practice to leave some descriptions regarding any changes or extra processing that you have done in the experiment.
This will help you later during the analysis of the result.
Use a smaller sample size for experimentI learnt this advice from an amazing instructor — Jeremy Howard.
He is the founder of fastai, that carries the mission to make AI education free and accessible to anyone.
It is an awesome free online course about machine learning.
Do check out their courses if you are interested in getting your hands dirty and kick start your machine learning journey.
So what does this actually mean?.In the earlier section, we have explained about the train test split process so that we can get some feedback for our model.
Now we can go one step further to create an even smaller sample from our training set, which is meant for experimenting.
Why are we doing this?.When we are designing our models or adding new features, we want to get quick feedback from all the activities that we are doing.
Waiting 5–10 minutes to experiment on one thing is not going to be productive.
Hence, having a smaller sample size, you can run your experiments faster, gain feedback and make changes.
How small the sample size should be?.Rules of thumbs — it should be small but representative enough that can help you run your experiment in less than 20 seconds and gain a good rough estimation on the results.
Create our machine learning modelSklearn provide a lot of handy libraries that help us to create our model easily.
You can create multiple models and run all of them in a single experiment, see the code below as an reference :elasticnet = make_pipeline(RobustScaler(), ElasticNet(alpha =0.
00001, random_state=1))svm = make_pipeline(RobustScaler(),SVR(C=140))lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.
0005, random_state=1))GBoost = GradientBoostingRegressor(loss='huber',min_samples_split=10,n_estimators=200)experiment_description = [ 'Add in sale year and sales month feature', 'Add more data from 2015']run_experiment([elasticnet,svm,lasso,GBoost],experiment_description)The code might look very minimal, but it requires a great amount of knowledge in understanding the theory behind those models.
For instance, you need to know what is the mathematical theory behind the model, what are the hyperparameters, where and when we should use this model.
Each model deserve one whole article to explain the theory and mechanism behind it thoroughly.
Look up the resources online, there are a ton of useful materials out there.
EnsemblePhoto by Debby Hudson on UnsplashThis is a very useful techniques in machine learning in boosting up the accuracy of your overall model.
What it does is to create a bunch of different classifiers (for instance: random forest, elasticnet, xgboost), and ensemble the results of these classifiers to improve the result.
One simple and common ensemble technique that we could use is simply averaging the prediction of every classifier.
You might be wondering how does it help?.The rationale behind is that, different classifier is learning the pattern of the data set from different perspective.
By gathering and averaging the results, we could see some performance gain.
StackingThe ensemble techniques that I am using here is stacking.
This is a slightly advanced technique compared to averaging the model.
Nevertheless, what it did under the hood is very simple!.Below is the simple architecture of stacking:Image from Wikimedia commonsWe have multiple classifiers learning on the data and outputting many different predictions.
Instead of averaging the predictions, we send it to a meta learner, which is another machine learning model to give us the final prediction.
The meta learner is deciding the weights on the predictions made by the different classifiers.
Some classifiers might perform better than the others, so the meta learner will optimize and find the best combination.
As daunting as the whole process may sound, this actually requires a very minimal amount of codes as the heavy lifting is done by our awesome sklearn library !MetricsTime to see some results!.But first we need to understand what are the metrics before analyzing the result.
The metrics that we are using here is Root Mean Square Logarithmic Error (RMSLE).
I believe most of you have heard and learnt Root Mean Square Error (RMSE) before.
The difference between these two metrics are we apply log for the target value and the predict value before calculating the root mean square.
See the equations below :Equation of root mean square logarithmic errorWhy are we using RMSLE instead of RMSE?.By doing log, we are only looking at the relative difference/percentage difference between the real and predicted value.
Comparison between RMSE and RMSLERMSE gives a very big penalization if the difference between the target value and predict value is numerically big.
But RMSLE is looking at the percentage difference and give the same amount of loss score for both cases.
In our project, the price of different flats varies on a different level, we would like all the errors to be treated on a percentage basis, and that is the main reason why RMSLE is used.
ResultsComparison of results between the effects of feature creationThe lower the loss score, the lower the difference between the target and predicted value, hence means better result.
From the comparison above, we can know, among all the feature creations, creating the mapping of flat location to region is the most helpful.
By combining all the extra features that we created, the overall improvement is quite significant.
You can compare the result between the first column and last column to observe the improvement of the loss score.
The data size that we use for performing all these experiments is only a very small subset of the data.
We can run a full experiment (Using all the train data) and compare ‘The most basic model’ — Model A and ‘The model with all the extra features’ — Model B, to verify that our feature creation is actually helpful.
See the results below:Comparison between most basic models and models with all extra featuresModel B are performing much better across all the models that we are using.
So feature creation definitely helps a lot !I would like to highlight the best performing model among these two, which is the stacking classifier.
The stacking classifier of the Model B is having a loss score 15% lower than Model A, which is an amazing improvement.
ConclusionIn a nutshell, I have pretty much highlight the whole process of building a flat price predictor from scratch.
You can download the source code from this GitHub repository.
Besides of the machine learning code, I have also created a web application to perform real time prediction.
Click here to see the demo.
The web application code is in the same repository and you can refer to them if you find it useful for your use case.
Leave me a comment or ping me on LinkedIn if you have any doubts or questions regarding any part of the process, would be happy to discuss.
Enjoy the reading, thanks .
About the authorSie Huai, Gan is a software engineer in Visa.
He is a full stack software engineer, working in Data Product & Development Team.
His day to day job involve building big data pipeline to process client’s data and generate insight for the client.
He has been guiding people from different backgrounds and help them to kick start their software engineer career.
If you need any career advice or business consultation on how to improve your business with software technology or AI, connect with him on LinkedIn.
You can also book an appointment with him online.
Sie Huai Gan – Software Engineer – Visa | LinkedInJoin LinkedIn Software engineer in Visa, Passionate in sharing and learning !www.