An understanding of the commuting patterns of the Bay Area should allow us to suggest transit agency mergers when appropriate, and keep the agencies separate when it is not.MethodologyKaggle has posted the commute flows from every census tract in the United States to every other census tract in the United States, with the 2010 US Census as the original source..The ‘flow’ values from this data, i.e..the number of commuters travelling from a given origin census tract to a given destination census tract, can be used to calculate the distances between census tracts in feature space.We define the distance d(i,j) in feature space between census tracts i and j as:where:This implies that the feature space distance between two census tracts that have no commuters flowing between them is equal in quantity to the maximum commute flow in the dataset, i.e..those census tracts are the furthest apart in feature space..It also means that the pair of census tracts with the biggest commute flow has a feature space distance of ~1, i.e..those census tracts are the closest in feature space.The directionality of commute flows presented a challenge when creating the distance matrix..Clustering algorithms generally take a datasets position in feature space, calculate the distance matrix, and then establish the linkage; however, for this problem, we are defining the distance matrix directly from the source data..This results in the unusual property that the feature space distance from A to B is likely to be different from the feature space distance from B to A, as more commuters will commute in one direction than the other..For this project, we made the decision to examine origin to destination commute flows only, as this resulted in the clusters that were clearly defined in both feature space and real space, while the inverse resulted in clusters that were significantly overlapping in real space.Once we had calculated the distance between every pair of census tracts, we fed this into SciPy’s hierarchical clustering algorithm to determine the relationship between census tracts in the Bay Area..(For this analysis, the Bay Area is defined as all census tracts existing between 37 and 38.5 latitude and between -123 and -121.5 longitude.) The number of clusters is not determined in advance, but the validity and consistency of the clusters can be calculated by calculating the average silhouette score of all data points in the sample..This is shown below for varying values of k:Bay Area Silhouette ScoresThe existence of a dual peak in the silhouette score chart is indicative of both regional and local coherence in commuting patterns..Optimal values of k are found at 3, and again at 9, 10, and 11.Let’s start by looking at k=3:Bay Area Census Tract Clustering (k=3)This clustering represents the first internally cohesive sub-regional division of the Bay Area.. More details