Tuning-In To NYC’s Musical Neighborhoodsby Michael CascioMichael CascioBlockedUnblockFollowFollowingApr 4AbstractMachine learning allows for the creation of computational models capable of identifying patterns in multi-dimensional datasets.
This project aims to leverage venue data from Foursquare’s ‘Places API’ and a machine learning algorithm called ‘k-means clustering’ to identify ‘New York City’ neighborhoods of similar ‘music profile’.
IntroductionBackgroundMusic is a form of art that has, and probably always will be, deeply embedded within the cultural activity of cities, communities, and groups of people more generally.
Music is a means of communication, expression, and sometimes even protest with the power to peacefully bring together large amounts of like-minded people, influence popular culture, and hypnotize you with a memorable lyric that you end up singing in the shower subliminally for weeks on end even after consciously being disappointed in yourself for doing so….
ProblemCities are, in part, composed of musical entities such as record shops, instrument vendors, concert halls, amphitheaters, and more, that not only provide to the music needs of local citizens but also to tourists from around the world.
For bigger cities, music entities can be spread apart, resulting in an ecosystem of hip niche neighborhoods that evolve and change over time.
This ecosystem is often learned by humans looking for a cool music scene through either natural life experience (wandering/flaneur) or recommendations in the form of internet reviews, comments, and conversations with people in-real-life.
This project aims to quantify a ‘music profile’ for neighborhoods in a major metropolitan city, New York City, to identify clusters of similar music scenes.
StakeholdersDifferent parties may be interested in a model that is able to quantify neighborhood similarity based on the types of music outlets available.
Such a model would be able to inform renters and home buyers who prefer to live where the music is happening that they’re next home is properly located.
Future music venue start-ups can utilize the model to identify neighborhoods lacking live music venues and ensure they are investing in an area that is not saturated.
Future music retail vendors, sellers of things like records and instruments, can similarly utilize the model to ensure they are launching a business where competition is in their favor.
MethodologyData SourcesNYU Spatial Data Repository: I am using the ‘2014 New York City Neighborhood Names’ dataset hosted by NYU’s Spatial Data Repository as the basis for the neighborhood names and associated location centroids .
The image below shows a sample of this information:DataFrame created from NYU’s ‘2014 New York City Neighborhood Names’Foursquare — ‘Places API’: I will be using Foursquare’s ‘Places API’ to acquire data related to ‘venues’ (as defined by Foursquare) categorized to be somehow associated with music .
It is important to note that Foursquare defines a ‘venue’ as a place that one can go to, or check-in to, and that a ‘venue’ is not necessarily a music venue but can be any establishment such as a restaurant or type of retail shop.
Each Foursquare ‘venue’ is assigned a ‘category’ and each ‘category’ is associated with a particular ‘categoryID’.
The image to the right shows the ‘categoryID’ values provided by Foursquare that will be used to acquire music related venues within New York City:Foursquare Music-Related Venue CategoryIDsData RetrievalNeighborhood Name & Location Centroid Data: The ‘2014 New York City Neighborhood Names’ dataset hosted by NYU’s Spatial Data Repository was easy to download as a JSON file and import into a Jupyter Notebook:Importing the newyork_data.
json fieThe ‘Borough’, ‘Neighborhood’, ‘Latitude’, and ‘Longitude’ values associated with each neighborhood were then converted from JSON to a Pandas DataFrame that serves as the foundation of the analysis.
Creating a DataFrame out of the JSON dataFoursquare Music-Related Venue Data: As mentioned in the Data Sources section of this report, Foursquare has numerous ‘Venue Categories’ that are used to identify each type of venue.
A ‘get’ request to the ‘api.
com/v2/venues/search?’ endpoint that provides a category ID will return venues of that category.
The example code below sends a ‘get’ request to Foursquare that asks for one venue with the “Music Store’ category (categoryID = ‘4bf58dd8d48988d1fe941735’):Example Foursquare Places API RequestA preliminary dataset of music related venues associated with each New York City neighborhood was created by recursively sending ‘get’ requests to the previously mentioned endpoint, making sure the results are specific to venues with music related ‘category IDs’.
For each neighborhood, we can include all of the selected category IDs in a single ‘get’ request by passing them as comma separated values.
Shown below is a function that creates the required url and an example:Dynamically creating API request URLsThe following function recursively sends a ‘get’ request to Foursquare for each neighborhood that requests all music related venues.
While looping through each neighborhood from the NYU dataset, the function appends each music related venue entry to a list and, after looping through each neighborhood, creates a DataFrame of all of the results.
Included for each entry in the dataset are neighborhood name and location, and venue name, location, and category.
Recursively retrieving music-related venues for each New York City neighborhoodThe resulting preliminary venue DataFrame includes 9,442 venues that were pulled from Foursquare:9,442 venues were pulled from FoursquareSince I had some issues with exceeding Foursquare’s API rate limits, after the preliminary dataset was acquired, a copy was saved to csv so that future development would not require re-requesting information from Foursquare.
Write data to csvSample of csv fileExploratory Data AnalysisThe series of images below are meant to capture my process for exploring the data retrieved from Foursquare in an effort to better understand what kind of venues were actually pulled during my requests.
In a perfect world, each entry would be music-related and located in New York City, but that needed to be verified.
The questions below informed how the preliminary venue data was pre-processed, shown in the Data-Preprocessing section of this report.
Question: What states are the venues located in?Answer: Most entries pulled from the API request included a ‘state’ parameter equal to either ‘New York’ or ‘NY.
’ Some entries included a ‘state’ parameter equal to ‘CA’, ‘MA’, and ‘NJ’ and will need to be removed.
Showing the Venue State countsQuestion: What venue categories are the entries in?Answer: There are 149 unique venue categories included in this dataset.
Some of the categories are not music related, which was a result of using higher level ‘venue categories’ defined by Foursquare.
Non-music related categories will be removed.
Showing the Venue Category countsQuestion: How many venues did not have their ‘city’ parameter filled-out?Answer: The image below shows that there were quite a few venues that did not have a ‘city’ parameter filled-out.
At first I thought this would not be an issue because I still have a latitude and longitude associated with the each entry.
Upon further analysis, it was determined that entries with no ‘city’ parameter were no longer active establishments and thus will be removed.
Showing venues that do not have a Venue City parameterQuestion: Are there any null values in the dataset?Answer: No.
Checking for null values in the dataQuestion: How many unique venues were retrieved?Answer: In the preliminary dataset, there are less unique venue names than there are entries in total.
This means that there are venues associated with more than one neighborhood, which is the result of queries that overlapped because of radius being set to 1000m in the API request.
This will be accepted because the venue is within walking distance of the neighborhood centroid and can influence that neighborhood’s scene.
Checking for duplicate Venue NamesData Pre-ProcessingData CleaningThe preliminary dataset was cleaned according to the answers listed in the Exploratory Data Analysis section above.
First, venues located in states other than “New York” or “NY” were removed.
Entries with “Venue State” equal to “New York” were changed to “NY.
”Removing venues not in New York stateEntries returned by Foursquare with no ‘Venue City’ and given the ‘N/A’ treatment were also removed:Removing venues with no Venue City parameterA list of music-related venue categories was created based on the unique venue categories included in the preliminary dataset.
This list was used to filter out the non-music-related entries that snuck into our request.
Removing venues that are not music-relatedThe image below shows the total number of entries and number of unique entries in the ny_music_venues dataframe.
As previously mentioned, some venues are assigned to multiple neighborhoods because the venue is within 1,000 meters of the neighborhood’s centroid location.
Checking for duplicate venuesOne-Hot-Encoding Venue CategoriesIn order to use Foursquare’s category values to find similar neighborhoods based on music venues, a one-hot-encoding representation of each entry was created using Pandas’ ‘get_dummies’ function.
The result was a dataframe of New York City music-related venues where entry venue category is represented by a value of 1 in the column of matching venue category, as shown below:One-Hot-Encoding categorical variablesData VisualizationVenue counts were determined for each venue category and neighborhood in New York City using the one hot encoded DataFrame:Total amount of venues of each category in each neighborhoodUsing the DataFrame of venue counts shown above, horizontal bar plots were created for select venue categories to help visualize the top 25 neighborhoods with the most of each particular venue.
Using the following loop and matplotlib:Code for recursively plotting top neighborhoods with venues of particular categoryNeighborhoods With Most ‘Concert Halls’Neighborhoods With Most ‘Music Venues’Neighborhoods With Most ‘Nightclubs’Neighborhoods With Most ‘Jazz Clubs’Neighborhoods With Most ‘Piano Bars’Feature GenerationThe encoded dataset of music-related venues in New York City was then used to quantify a music profile for each neighborhood.
For each venue category, the percent distribution of venues across each neighborhood was calculated.
This information would then be used to fit a K-Means clustering algorithm to the data in an effort to determine neighborhoods of similar music venue profile.
First, the total number of venues for each category was determined:Creating a dictionary of venue category and total countFinally, the percentage of venues in each neighborhood was calculated with respect to the total amount of venues in the dataset, by venue category.
So it’s clear, the value shown in the “Lounge” column for Astoria represents the percentage of lounges in the dataset that are located in Astoria.
Percentage of entities of particular venue in a particular neighborhoodWith the above, a DataFrame showing the top five music venue categories for each neighborhood was created:Showing the top five venue categories per neighborhoodResultsCluster ModelingScikit-learn’s K-Means clustering was used to determine similar neighborhoods based on music venue percentage.
The image below shows the data being scaled and the K-Means model being created:Clustering neighborhood venue dataA new DataFrame was created by merging neighborhood location data with cluster labels and top venue categories.
Merging neighborhood location and cluster dataCluster VisualizationThe following code uses folium to visualize neighborhoods of similar music profile by coloring each neighborhood point based on cluster label:Code to generate a folium leaflet map with neighborhoods colored by clusterMap of New York City showing clustersCluster EvaluationThe following code iterates through and prints the results of each cluster:Code to iterate through and print each clusterThe resulting clusters can be seen in the Clusters section of this document’s Appendix.
Each cluster shows a list of neighborhoods with their respective top venue categories.
We can compare the resulting clusters to the bar plots in the Data Visualization section and get a sense that the clusters are properly grouping neighborhoods based on music-related venue counts.
It is interesting to see that some clusters are very small, sometimes only holding a single neighborhood, and appear to have identified a niche music profile.
Examples of this are:Cluster 1 — Coney Island — Music Festival (Coney Island Music Festival)Cluster 2 — Lincoln Square — Opera House (Metropolitan Opera House)Other clusters, such as Cluster 4 and Cluster 7, are very large and appear to be grouping neighborhoods with assortments of live music type venues such as Music Venue, Nightclub, Rock Club, Lounge, etc.
The inter-cluster 1st Top Venue Category of Clusters 9, 11, 12, 13, & 14 are all the same; Jazz Club, Recording Studio, Rock Club, Jazz Club, and Piano Club respectively.
It’s interested to see that Cluster 9 and 12 are both mainly interested in Jazz but were clustered differently because their other top venue categories were different, meaning a different music profile.
ConclusionMachine learning and clustering algorithms can be applied to multi-dimensional datasets to find similarities and patterns in the data.
Clusters of neighborhoods of similar music profile, or any profile, can be generated using high-quality venue location data.
There is a preface on high-quality because analysis models are only as good as the input into them (garbage in, garbage out).
Luckily, Foursquare offers a robust ‘Places API’ service that, although (as we have seen) not perfect (nothing is), can be leverages in similar studies and model-making.
This project is by no mens finished and could be expanded on in a number of different ways.
Foursquare’s API could be further interrogated to retrieve and consider more music-related venues in New York City.
New datasets of music-related venues can be acquired and potentially merged with what was retrieved from Foursquare.
The DBSCAN clustering algorithm, better at maintaining dense clusters and ignoring outliers, could be implemented and compared to KMeans.
The clustering model could become the basis for a recommendation system aimed to provide neighborhoods of similar music profile to users.
I look forward to continuing to explore and leverage music-related datasets in the future.
Project GitHub: https://github.
com/cascio/IBM_Data_Science_CapstonePersonal LinkedIN: https://linkedIN.
com/in/mscascioReferences — 2014 New York City Neighborhood Names — NYU Spatial Data Repository — ‘Places API’ Documentation — FoursquareAppendixClustersCluster 0:Cluster 1:Cluster 2:Cluster 3:Cluster 4:Cluster 5:Cluster 6:Cluster 7:Cluster 8:Cluster 9:Cluster 10:Cluster 11:Cluster 12:Cluster 13:Cluster 14:.