It’s kind of simple it is an energy performance rating of an household , there is an example in the following figure.
It’s like the one that can be find on appliance for example, and it need to be refresh at each new tenant or landlord.
The data source that is opened by the ministry of Housing, Communities & Local Government is very complete (more than 15 millions certificates) , and there is more than just a simple rating on this data (data dictionnary), there is informations on the glazing, the energy consumption the floor area etc.
Nomis dataThe Nomis website “is a service provided by the Office for National Statistics, ONS, to give free access to the most detailed and up-to-date UK labour market statistics from official sources “, and on this website there is multiple informations on the UK citizens collected during different census.
There is multiple informations with a good level of details and the data of the census are mostly used to create the ACORN group that was defining in my article on the smart meter.
From this portal I extracted informations on the UK citizens at a district level on their :OccupationQualificationNS-SEC (National Statisitc-Socio Economic Classification)PopulationLet start now Dataiku DSS.
Data processing in DSSAll the data extracted for this project are csv files, in the following animation there is an illustration of the process to create a dataset in DSS.
Files can be easily drag and drop in DSS , and you can have an overview of the data, the quality etc.
In the free version a SQL database can be connected too and it’s very easy.
For this project the idea is to connect the data from the EPC and the Nomis data, so there is a big part of processing of the certificates to be aggregated at a district level and be connected to the Nomis data.
There is an overview of the complete processIn the following figure there is the process to prepare the EPC dataset that could be connect with the Nomis data.
There is a preparation phase (with the brush) where there is a selection of the right columns, a processing of the postcode to get the district code.
A second step that could have been put with the previous one, is to prepare the dataset to find the most recent EPC of the household (correct format of the date of the inspection date).
To get the last inspection date of each household in the dataset, there is a group by (the square triangle circle icon on the process image) function, there is an illustration of the process in the following animation.
Now that we have the EPC cleaned, and a list of the last inspection date for each household in another table a join between this two datasets with the join function (the join logo in the process), there is a presentation of the menu of the join where you can select the join key and the columns selected.
And finally there is a group by function per district, type of household, type of heating system and EPC rating.
And the final step is the python script to get aggregate informations at the district level with the pivot function of pandas (count of EPC per rating and type of household), but I could have used the pivot function of DSS.
Et voila we have the data of the EPC aggregated to a district level , that give us a knowledge on the rating of the household and the kind of household at this level.
This data can be easily joined with the data of Nomis with the join function.
I used some in house functions of DSS to do the join, groupby but I could have use:A python or R script to build the datasetSQL script if it was SQL tablesHive or impala in the case of a “big data” configurationNow that the data are available let’s make an analysis and a dashboard to wrap up some results.
Data analysis in DSSThe analysis are going to be super high level, it’s only to show the features of DSS.
This is an animation of a dashboard that I build with DSS.
The building of this dashboard is much easier than a Tableau one (my opinion), but they are sharing this drag and drop approach to build each graph that is very useful.
There is way to make some analysis directly from the dataset with the lab tool where in house functions can be used (to determine correlation for example) or used some script to analyse data, in this case I choose python to make some plot with seaborn.
The analysis part is really cool and I think that it could fit a lot of needs, but the other part that is really impressive is the ML part to build model based on the data processed.
Model serving in DSSSo there is multiple ways to build a model, but first let’s define a purpose for this part:“Build an energy rating estimator based on the location, the total floor area and the type of heating of the household”There is three “levels” that DSS gives you to build a model but let’s be honest it’s the same interface just that your journey on the model configuration part start higher in the hierarchy of the menu.
In the menu to build the model there is the possibility to:Prepare the training and testing setsChoose the evaluation metricPick up the featuresChoose the models and the parameters for the grid searchCompare the models after the testing partThere is a simple animation that is making an overview of the features of the model builder.
The interesting part is that you could use pre built functions (I presumed that scikit learn functions), or write your own python code.
The tool to test the model is really impressive in terms of visualisation of the process and the results.
A good point is the visualisation of the results with for example for a decision tree that is really easy to understand (decompose with this tool).
After that you find the right model there is an API builder to embed the model.
In this space you can define some test requests to see the model in action.
To expose the model, the feature is not activated for the free version (or i miss to use it) but it seems pretty intuitive.
Now it’s time to conclude.
Feedback on the projectI will highly recommend to people that are working/interesting on the energy sector to dive in the EPC data because they are a very good source of knowledge on house market in the UK.
In general and it’s very painful for me the most French guy abroad to say that but the UK government is doing a great job to collect and share data and there is very interesting datasets on the government platform that could be used by data scientists (in France we are very late on this topic but things are changing slowly).
For Dataiku DSS, it is a great tool for data scientist experimented or not, I can feel that this tool has been designed by data scientists for data scientists and there is so much features that I didn’t used like all the collaboration part, the deep learning etc.
There is multiples in house functions to make the data processing easier, that’s really cool but it could become a burden if for example Dataiku decides to drop theses features (or make it premium) , if data people doesn’t know how to do a join a groupby etc the data pipeline transfer could be difficult but I really like the fact that Dataiku doesn’t stuck the user with their in house functions and let the possibility other way to manipulate data like (with SQL for example).
In my day to day job do DSS can fill MY need ?, NO because I have currently multiple tools at my disposal to do my job and I need flexibility on the data side and on the development side to experiment and deploy things but this tool is definitely worth trying because it can fill the needs of data teams who doesn’t have my needs (and they are numerous around the world).
Originally published at the odd data guy.