(Source: Louie Martinez)As a part of the final IBM Capstone Project, we get a tang of what data scientists go through in real life.
Objectives of the final assignments were to define a business problem, look for data in the web and, use Foursquare location data to compare different districts within wards (municipalities) of Tokyo to figure out which neighborhood is suitable for starting a restaurant business.
As prepared for the assignment, I go through the problem designing, data preparation and final analysis section step by step.
Discussion and Background of the Business Problem:Problem Statement: Prospects of a Lunch Restaurant, Close to Office Areas in Tokyo, Japan.
Tokyo, where I am currently staying, is the most populous metropolitan area in the world.
Currently ranked 3rd in the global economic power index, Tokyo is definitely one of the best places to start up a new business.
During the daytime, specially in the morning and lunch hours, office areas provide huge opportunities for restaurants.
Reasonably priced (one lunch meal 8$) shops are usually always full during the lunch hours (11 am — 2 pm) and, given this scenario, we will go through the benefits and pitfalls of opening a breakfast cum lunch restaurant in highly densed office places.
Usually the profit margin for a decent restaurant lie within 15−20% range but, it can even go high enough to 35%, as discussed here.
The core of Tokyo is made of 23 wards (municipalities) but, I will later concentrate on 5 most busiest business wards of Tokyo — Chiyoda (千代田区), Chuo (中央区), Shinjuku (新宿区), Shibuya (渋谷区) and Shinagawa (品川区), to target daily office workers.
We will go through each step of this project and address them separately.
For this week I just describe the initial data preparation and future steps to start the battle of neighborhoods in Tokyo.
Target AudienceWhat type of clients or a group of people would be interested in this project?Business personnel who wants to invest or open a restaurant.
This analysis will be a comprehensive guide to start or expand restaurants targeting the large pool of office workers in Tokyo during lunch hours.
Freelancer who loves to have their own restaurant as a side business.
This analysis will give an idea, how beneficial it is to open a restaurant and what are the pros and cons of this business.
New graduates, to find reasonable lunch/breakfast place close to office.
Budding Data Scientists, who wants to implement some of the most used Exploratory Data Analysis techniques to obtain necessary data, analyze it and, finally be able to tell a story out of it.
Scrapping Tokyo Wards Table from WikipediaI first make use of Special Wards of Tokyo page from Wiki to scrap the table to create a data-frame.
For this I’ve used requests and Beautifulsoup4 library to create a data-frame containing name of the 23 wards of Tokyo, Area, population and 1st Major District.
We start as below —After little manipulation, the data-frame is obtained as below —Data-frame from Wikipedia Table.
Getting Coordinates of Major Districts : Geopy ClientNext objective is to get the coordinates of these 23 major districts using geocoder class of Geopy client.
Using the code snippet as below —Some Simple Statistical Analysis.
description of the business problemAs you can see 4 coordinates are completely wrong (Bunkyo, Koto, Ota, Edogawa), which is due to the names of the districts are written little different than the way they are in this data-frame (ex.
Hongō — Hongo), so, I had to replace these coordinates with values acquired from google search.
After little more playing around with pandas, I could get one well-arranged data-frame as below —2.
Average Land Price in Major Wards of Tokyo: Web ScrappingAnother factor that can guide us later for deciding which district would be best to open a restaurant is, the average land price of 23 wards.
I get this information from scrapping ‘land market value area in Tokyo’ web-page, similarly to the Wiki page before.
As I want to consider the 5 busiest business municipalities of Tokyo as mentioned in section 1 , the data-frame looks as below2.
Using Foursquare Location Data:Foursquare data is very comprehensive and it powers location data for Apple, Uber etc.
For this business problem I have used, as a part of the assignment, the Foursquare API to retrieve information about the popular spots around these 5 Major Districts of Tokyo.
The popular spots returned depends on the highest foot traffic and thus it depends on the time when the call is made.
So we may get different popular venues depending upon different time of the day.
The call returns a JSON file and we need to turn that into a data-frame.
Here I’ve chosen 100 popular spots for each major districts within a radius of 1 km.
Below is the data-frame obtained from the JSON file that was returned by Foursquare —3.
Visualization and Data Exploration:3.
Folium Library and Leaflet Map:Folium is a python library that can create interactive leaflet map using coordinate data.
Since I am interested in restaurants as popular spots first I create a data-frame where the ‘Venue_Category’ column in previous data-frame contains the word ‘Restaurant’.
I used the following snippet of code —Next step is to use this data-frame to create a leaflet map with Folium to see the distribution of the most visited restaurants in the 5 major districts.
With the code snippet above the leaflet map looks as belowFigure 1: Circular marks represent the most frequently visited restaurants in the 5 Major (Nihombashi- Green, Nagatacho- Red, Shibuya- Orange, Shinjuku- Magenta, Shinagawa- Blue) districts of Tokyo, according to Foursquare data.
Exploratory Data Analysis:There are 134 unique venue categories and Ramen Restaurants top the charts as we can see in the plot below —Figure 2: Most Frequent venues around Shinjuku, Shibuya, Nagatacho, Nihombashi, Shinagawa, according to Foursquare data.
Now, as that reminds of Ramen, definitely it is time to take a break.
Ramen Restaurants are the most frequently visited places around 5 major districts of Tokyo.
Yum!After delicious ramen, let’s get back to exploring the data a little more.
To know about the top 5 venues of each district we proceed as followsCreate a data-frame with pandas one hot encoding for the venue categories.
Use pandas groupby on the District column and obtain the mean of the one-hot encoded venue categories.
Transpose the data-frame at step 2 and arrange in descending order.
Let’s see the code snippet below —The above code outputs top 5 venues of each district —From the several data-frames that I had to create for exploratory data analysis, using one of them, I’ve plotted which district has restaurants among the most frequently visited places and, Nagatacho of Chiyoda ward comes on top with 56 restaurants.
Figure 4: Number of restaurants as top most common venues in 5 districts of Tokyo.
We can also look at the violin plots which are used to represent categorical data, and I used seaborn library to show the distribution of 4 major types of restaurants in different districts —Figure 5: Lots of Japanese and Chinese restaurants in Nagatacho, whereas Shinagawa has many Ramen restaurants.
Once we get quite a broad overview of the different types of venues and specially restaurants around 5 major districts of Tokyo, it is time to use clustering the districts using K-Means.
Clustering the DistrictsFinally, we try to cluster these 5 districts based on the venue categories and use K-Means clustering.
So our expectation would be based on the similarities of venue categories, these districts will be clustered.
I have used the code snippet below —5 districts of Tokyo divided in 3 clusters based on the most common venues obtained from Foursquare Data.
We can represent these 3 clusters in a leaflet map using Folium library as below —Figure 6: 5 major districts of Tokyo segmented into 3 clusters based on the most common venues.
The size of the circles represents number of restaurants as most common venues for each district, which is highest at Nagatacho and lowest at Shibuya as shown in figure 4.
Results and Discussion:We reached at the end of the analysis, where we got a sneak peak of the 5 major wards of Tokyo and, as the business problem started with benefits and drawbacks of opening a lunch restaurant in one of the busiest districts, the data exploration was mostly concentrated on the restaurants.
I have used data from web resources like Wikipedia, python libraries like Geopy, and Foursquare API, to set up a very realistic data-analysis scenario.
We have found out that —Ramen restaurants top the charts of most common venues in the 5 districts.
Nagatacho district in Chiyoda ward and Nihombashi in Chuo ward are dominated by restaurants as the the most common venue whereas Shibuya and Shinjuku areas are dominated by bars, pubs, and cafe as most common venues.
Nagatacho has maximum number of restaurants as the most common venue whereas has Shibuya area has the least.
Since the clustering was based only on the most common venues o each district, Shinjuku, Shibuya fall under the same cluster and, Nagatacho, Nihonbashi fall under another cluster.
Shinagawa is separated from both of these clusters as, convenient stores stand out as the most common venue (with a very high frequency).
According to this analysis, Shinagawa area will provide least competition for an upcoming lunch restaurant as convenience store is the most common venue in this area and, the frequency of restaurants as common venue are very low compared to the remaining districts.
Also seen from the web-scrapped data, the average land price in and around Shinagawa is muchcheaper compared to the districts close to central Tokyo.
So, definitely this region could potentially be a target for starting quality restaurants.
Some drawbacks of this analysis are — the clustering is completely based on the most common venues obtained from Foursquare data.
Since land price, distance of the venues from closest stations, number of potential customers, benefits and drawbacks of Shinagawa being a port region, could all play a major role and thus, this analysis is definitely far from being conclusory.
However, it definitely gives us some very important preliminary information on possibilities of opening restaurants around the major districts of Tokyo.
Also, another pitfall of this analysis could be consideration of only one major district of each ward of Tokyo, taking into account of all the areas under the 5 major wards would give us an even more realistic picture.
Furthermore, this results also could potentially vary if we use some other clustering techniques like DBSCAN.
ConclusionFinally to conclude this project, We have got a small glimpse of how real life data-science projects look like.
I’ve made use of some frequently used python libraries to scrap web-data, use Foursquare API to explore the major districts of Tokyo and saw the results of segmentation of districts using Folium leaflet map.
Potential for this kind of analysis in a real life business problem is discussed in great detail.
Also, some of the drawbacks and chance for improvements to represent even more realistic pictures are mentioned.
Finally, since my analysis were mostly concentrated on the possibilities of opening a restaurants targeting the huge pool of office workers, some of the results obtained are surprisingly what I have expect after staying 5 years in Tokyo.
Specially cafe, bars, pubs as most frequent venues around Shinjuku and Shibuya area, and Japanese restaurants around Nihombashi, Nagatacho area!.Hopefully, this kind of analysis will provide you initial guidance to take more real-life challenges using data-science.
Stay strong and Cheers !!Find the code in Github.
Find me in Linkedin.