Rating London Properties by their “Pub Score”: An Alternative Lens on the London Housing Market

Interestingly (or not), the number 42 has ASCII code *, which in programming means ‘all’ or ‘everything’.

I therefore initially limited myself to looking for properties within one area — Bermondsey.

I want to make use of my LISA, so properties would have to be <£450k in value, which narrowed down my search to a nice amount.

A simple yet effective combination of Beautiful Soup and Requests later and I have a list of links for each of the properties within my search criteria.

I then iterated down the list of links, again using Beautiful Soup and Requests to parse the information from said links.

In order not to impose too much strain on the site, I left a 5 second delay between each request.

A simple version of the code for obtaining the price can be seen below:def get_soup(url, params=None): r = requests.

get(url, params=params) c = r.

content soup = BeautifulSoup(c, 'html.

parser') return(soup)descriptions = []for index, row in listing_df.

iterrows(): url = row['property_link'] time.

sleep(5) soup = get_soup(url) price.

append(soup.

find('div', attrs={'class':'propertyHeaderPrice'})This was great, and worked a treat — I was able to get details such as date added, number of rooms, property type, price, nearest stations etc.

Save this as a csv/into a database, whack the script into a scheduler, and you’re good to go.

However, I wasn’t content with only obtaining data for one specific part of London, oh no.

I wanted all of it, mainly so I could have some sort of average London property price tracker over a set amount of time, as well as compare various metrics throughout London.

The problem was twofold:I couldn’t simply search for all of London as this would create too many results to iterate through, and I wouldn’t be able to get more than 42 pages worthI didn’t want to go through every property in London individually.

There are about 25,000 properties listed within some area of ‘London’ on the site, and I didn’t want to single-handedly bring down their website by submitting 25,000 get requests.

I started thinking of ways in which I could break down the search process into smaller chunks to be able to see all of the properties per chunk in fewer than 42 pages.

My initial thought was boroughs, although having run a script to obtain the number of properties in each borough, quite a few boroughs had too many properties.

I then looked to district, so the first letters of your postcode before the space e.

g.

SE8, EC2V, E1 etc.

It turns out that only CR0 (Croydon central) had too many properties, so I thought this was acceptable (not sure I wanted to live in Croydon anyway…).

The second problem was submitting 25,000 requests, so I tried to see if I could obtain as much of the key information for each property from the summary page displaying 25 properties at a time.

If successful, this would reduce the number of requests down to ~1,000.

Most of the key information was there, except longitude and latitude.

Luckily, the site has a ‘view on map’ function, which I figured must have coordinates in order to accurately plot each property.

This map view allowed 500 properties at a time, which wasn’t a problem (except for Croydon) and whatsmore, contained a machine readable format at a certain point within the source code for the page.

This meant that all I had to do for each district was iterate through each page of results to get most of the details, then change the URL slightly to access the map source code, from which I could parse the coordinates with a little help from Beautiful Soup and the JSON.

Luckily each property has a unique identifier, so the two datasets were easyish to match.

But what about Croydon.

Although not sure that I wanted to live in Croydon, the perfectionist in me wanted to have a complete dataset.

It turns out that there were fewer than 42 pages, so obtaining most of the data was relatively simple.

However, only 500 properties are allowed to be shown on the map at any one time, creating a problem.

Luckily, having examined the url when zooming in/out, I realised that I could use a viewport argument when requesting the map url, effectively splitting Croydon in two.

A rather substantial amount of engineering later, Croydon is back in business.

Having whacked this on the scheduler, all I had to do was wait (the process took about an hour and a quarter to run in total).

Step 2: EDAHaving obtained and cleaned the data, it looked something like this:Which is all well and good, but if you’re like me, you’re more of a visual person, so I decided to produce some charts.

Step 3: Initial VisualisationsMatplotlibMatplotlib is great once you get the hang of it, and can be used reasonably effectively to produce some basic charts, such as:histogramsbar plotsline chartsLooking at these charts, there are a few interesting points, such as there are more 2 bed properties within my search range in London than there are any other size properties, and there appears to be some sort of clustering around round numbers for property price.

A bit more work, and you can plot number/percentage of houses at each room size per district:One of the things I wanted to look into was how the price changed per house type, such as maisonette, flat, terraced house etc.

When plotting this, there appeared to be quite a few house categories…:This required me to build a mapping function to map to parent property type.

The property types I had were:property_types_to_match = ['apartment', 'barn conversion', 'block of apartments', 'chalet', 'character property', 'cluster house', 'coach house', 'cottage', 'detached house', 'duplex', 'end of terrace house', 'flat', 'ground floor flat', 'ground maisonette', 'house', 'link detached house', 'maisonette', 'manor house', 'mews house', 'penthouse', 'property', 'semi-detached house', 'terraced house', 'town house']which I decided I wanted to match to:['apartment', 'duplex', 'house', 'maisonette', 'other']Firstly, i’m very intrigued that there is a ‘manor house’ or a ‘chalet’ within my price range.

Secondly, luckily, pandas has a handy ‘map’ function, into which I fed a dictionary of property types to map vs parent property values.

Grouping by parent company and plotting vs price and quantity, we can see that maisonettes are the cheapest house type, with the ‘other’ category being the most expensive.

There are also more flats for sale on the site than than any other type of property within London:These types of chart are great for showing some high level trends, but I wanted to see if there was a better way to present the data.

My next thought?EVERYONE LOVES MAPSStep 4: MapsEnter Folium and geopandas.

Each have a different place in mapping data:Folium enables you to create interactive maps that cluster/de-cluster based on zoom levelGeopandas allows you to plot data in dataframes by geographic regionsLets first look at Folium:Admittedly installing Folium was a bit of a ball-ache, but it’s definitely worth it.

Having installed it, use the three lines below to import some of its features:import foliumfrom folium.

plugins import MarkerClusterfrom folium.

plugins import FastMarkerClusterFrom there, creating maps is pretty simple given you have a dataframe with coordinates:# simple marker maplondon_map = folium.

Map(location=[51.

4986, -0.

0691], zoom_start = 10)for index, row in london_df.

iterrows(): folium.

Marker([row['latitude'], row['longitude']], popup=row['address'])).

add_to(london_map)# standard cluster maplondon_cluster_map = folium.

Map(location=[51.

4986, -0.

0691], zoom_start = 10)london_cluster_map.

add_child(MarkerCluster(london_df[['latitude','longitude']].

values.

tolist()))# fast cluster maplondon_fast_cluster_map = folium.

Map(location=[51.

4986, -0.

0691], zoom_start = 10)london_fast_cluster_map.

add_child(FastMarkerCluster(london_df[['latitude','longitude']].

values.

tolist()))The simple marker map simply plots a marker at each property location, whereas the standard and fast cluster maps introduce, you guessed it, some sort of clustering by location.

The difference is functionality — a standard cluster map allows you to include popups etc.

whereas a fast cluster map doesn’t.

Having said that, for >25,000 points, a standard cluster map took way too long to plot, so I settled for the fast cluster map, which for London, looks a little something like this:Unfortunately, I’m unaware of how to embed interactive maps into Medium articles, so a static image will have to do.

Zoomed in, this looks like this:which, you have to admit, if you’re a data nerd like myself, is pretty cool.

Any sort of interactivity and you’ve got me hooked — maybe that’s why I became interested in Excel in the first place at the genesis of my coding career.

OK, now on to GeoPandasIn order to get full usage out of the geopandas package, I needed to obtain a shape file of the districts in London.

Luckily, a combination of data from the data london gov uk website (https://data.

london.

gov.

uk/dataset/statistical-gis-boundary-files-london) and various other open datasets made this pretty easy to obtain.

Again, strangely, installing geopandas was a bit of a pain, but once you’ve done it, simply import it and read in your shape file:import geopandas as gpddistrict_shape_file = 'london_postal_districts.

shp'map_df = gpd.

read_file(district_shape_file)This will give you a standard pandas dataframe, but will recognise the geometry within the shape file, and give you something like this:The next thing we need to do is combine with a grouped by district version of your other dataset, an example of which is below:map_df = map_df.

set_index('postdist')mean_district_price = london_df.

groupby('district')['price'].

mean()mean_district_price_df = pd.

DataFrame(mean_district_price)mean_district_price_df = mean_district_price_df.

reindex(district_list, fill_value=mean_district_price_df['price'].

max())map_df = pd.

concat([map_df, mean_district_price_df], axis=1)Great, we’re ready to start plotting something.

If you simply use ‘map_df.

plot()’ as below, you’ll get the outline of the districts, which is reassuring:fig, ax = plt.

subplots(figsize=(16,9))map_df.

plot(ax=ax)ax.

set(title='districts')plt.

show()Plotting number of properties per district tells us the same as the data within the matplotlib charts — that croydon has more properties than any other district:which in itself, isn’t a very interesting plot.

What are interesting plots, in my point of view anyway, are :the average house price per district, which tells me I can’t afford anywhere in the city or Richmond:the number of properties listed for sale vs the total number of properties within each district, which shows the level of residential property development within each district:Cool, so now I have some cool graphs and maps, now what…What other things do I look for when assessing whether a house is in a good location or not?Step 5: Repeat with more dataTransport links.

Great, to the TFL website, where they have a file containing the lat/long of all of the stations on the network.

Next I wanted to try to find the three closest stations to each of the properties in my dataset, and the distance they are away from the property.

First, I looked at distance:Because of the curvature of the earth, I had to use the Haversine Formula (https://en.

wikipedia.

org/wiki/Haversine_formula) which allowed me to find the actual distance between two sets of coordinates.

A quick implementation of this in python looks something like:def haversine(lon1, lat1, lon2, lat2): # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 – lon1 dlat = lat2 – lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) r = 3956 # Radius of earth in miles= 3956.

kilometers = 6371 return c * rNow, for the next bit, I know there is a quicker, more efficient way of calculating the closest 3 stations, but I couldn’t quite put my finger on it, so I effectively did a rather large cross join between stations and properties:no_properties ~ 25000no_stations ~ 400no_calculations = 400 * 25000 = 10,000,000So that’s 10 million calculations, which take roughly 12–13 minutes to complete on my machine.

If anyone does has a more efficient way of calculating this, please do let me know!Anyway, having calculated the distances to each station, I then took the 3 smallest values and voila, you have the distances to the three closest stations.

In order to augment this dataset, I also used the Google Distance Matrix API to obtain a travel time to and from work for each of the properties.

Google let you have a certain amount of credits for free, which is great!.A request to the API looks like this:coords = list(zip(london_df['listing_id'],london_df['latitude'].

tolist(),london_df['longitude'].

tolist()))responses = []for id, lat, long in coords: url = 'https://maps.

googleapis.

com/maps/api/distancematrix/json' params = {'units':'imperial', 'origins':str(lat) + ',' + str(long), 'destinations':work_lat,work_long, 'mode':'transit', 'arrival_time':epoch_time, 'key':google_routes_API_key} r = requests.

get(url, params=params) response_json = r.

json() responses.

append(response_json)responses_dict = dict(zip(london_df['listing_id'].

tolist(),responses))OK, transport metrics achieved.

What next?Pubs.

Everyone likes a good pub, especially when it has a good selection of beers/wines, and when it’s close enough to stumble back from but far enough away to not keep you up when you want to go to sleep.

I managed to stumble across a POI dataset for the whole of the UK, containing the latitude and longitude of pretty much all points of interest in the UK, including pubs, stationary shops, cinemas etc.

This was a gold mine, so i went through and extracted all of the points of interest within the category ‘’Pubs / Restaurants’.

I understand that this includes restaurants, which is a bit frustrating, as I don’t really care if i’m that close to a wasabi/nandos etc.

but it’ll have to do.

Luckily as well as lat/long, the dataset includes post code, so I then went through and found all of the pubs that were contained within the districts I had data for (all 298 of them).

This amounted to ~1400 pubs.

For each pub, I wanted to find a measure for ‘how good a pub is it’ — re-enter Google.

Google has a ‘places’ API, which allows you to enter the name/address of any place in the world, and it will try to return various pieces of data it has about said place within its extremely large google maps database.

The query looks something like this:Unfortunately my dataset had a few missing/NaN values within various columns, so that took a bit of cleaning.

For example, where the pub had a NaN value in the name column, i replaced it with the parent company name, such as 'Wetherspoons' or 'O’Neils' etc.

I then ran these pubs through the places API, and was able to get various metrics for 99% of the pubs in my dataset, the most important of which being average review rating, number of reviews and price level.

I decided that there were three measurements I needed to calculate to obtain a ‘pub score’ for each house.

These were:quality of nearby pubsproximity of nearby pubsquantity of nearby pubsI also decided to look at pubs within 1 mile of each house, so again did a rather large cross join between pubs and houses using the haversine formula to obtain distances.

I then only took pubs that were within 1 mile of each house as the individual datasets from which to obtain each houses’ pub metric.

Pub Quality Score.

This was simply calculated as an average of the google ratings of the pubs within 1 mile of each house.

This was then normalised across properties, so that a property near the lowest rated pub achieved a score of 0 and one near the the highest rated pub achieved a score of 1.

Pub Proximity Score.

This was a bit trickier.

I decided that 300m was the optimal distance to be away from a pub, as it is within stumbling distance, but should be far enough away not to hear any late night noise.

In order to calculate the score, I took 300m from each distance measurement, so that if a pub was 300m away, it would obtain a score of 0.

I then took the modulus of this value, ensuring any pubs that were within 300m were penalised the same amount as any that were further away.

These scores were then averaged over the pubs within 1 mile of each house, and were normalised over all properties in my dataset.

Pub Quantity Score.

This was simply calculated as the number of pubs within 1 mile of each house, and normalised over all properties in the dataset.

Total Pub Score.

In order to aggregate these calculated metrics into one ‘pub score’, I had to have a think about which value was the most important to me.

Is being close to one top rated pub better than being close to 5 badly rated pubs?.I hypothesised yes.

Therefore, I decided that the order of importance was Quality, Proximity, Quantity.

In order to calculate the score, I used weightings of 2 for Quality, 1 for Proximity and 0.

5 for Quantity, and aggregated them up as follows:(2 x Normalised Quality) + (0.

5 x Normalised Quantity) — (Normalised Proximity)I also used the rank function out of interest, so I could see the top and bottom ranked properties in my dataset based on their pub score.

In order to visualise this data, I dumped it all into a geopandas dataframe, and used the handy shapely function “point” which converts lat/long coordinates into actual plottable points.

I also decided I wanted the colour to change depending on the value of the total score, so I used the very handy LinearSegmentedColormap function from matplotlib.

colors in conjunction with the ‘cmap’ argument of the ‘plot’ function.

OK so describing it didn’t work, here’s the code:Which produced the following plot:when used in combination with the districts map file imported previously.

This shows that the area around Streatham/Dulwich/Croydon is the best place to buy a property, if all you care about is pubs…I’ve also plotted the distributions for some of the districts within the dataset, which show that central Croydon has properties with a large variety of scores, whereas somewhere like Notting Hill (W2) appears to have properties at a more consistently high score.

Here, n corresponds to number of properties within each district in the dataset:Next Steps:investigate how to find the nearest ‘x’ points more efficientlylook at data for schools/catchment areas and how that correlates with pricelook into plotting journey time isocrones with python — any ideas would be great!Hope you’ve found this interesting!.

. More details

Leave a Reply