Visualizing Citi Bike NYC Network Data using Python and Gephi

Visualizing Citi Bike NYC Network Data using Python and GephiConnor HanafeeBlockedUnblockFollowFollowingFeb 17As someone who enjoys both networks and data visualization I decided to play with some data that would allow me to explore both topics.

The following is a network-based approach for visualizing a random sample of NYC Citi Bike traffic from October 2018.

Citi Bike generously publishes all of their ridership data on their website.

I chose October because I started thinking about the project in November 2018.

Citi Bike: NYC's Official Bike Sharing System | Citi Bike NYCExperience the best way to get around Manhattan, Brooklyn, Queens & Jersey City with Citi Bike, New York's bike share…www.

citibikenyc.

comGephi is a great software package for visualizing graphs and doing exploratory data analysis on them.

Gephi website:Gephi – The Open Graph Viz PlatformGephi is the leading visualization and exploration software for all kinds of graphs and networks.

Gephi is open-source…gephi.

orgThere are already many great examples of Citi Bike analyses out there, so I will not recreate that wheel.

I am writing this to share my little bit of code that helped me get network data ready with node attributes for input to Gephi and show how I used Gephi visualizations to make basic inferences on the Citi Bike network.

Note: I use the terms graph and network interchangeably.

Import, subset and sample the dataimport pandas as pdimport networkx as nx# import data, get columns of interest and sample 10k rowsbike_data = pd.

read_csv("citibike_oct_2018.

csv")bd_sub = bike_data[['start station name', 'end station name', 'start station latitude', 'start station longitude', 'end station latitude', 'end station longitude','gender', 'hour']].

sample(10000)bd_sub.

head()In this code block, I imported the Citi Bike data from a csv file (downloaded from their website), grabbed the columns I needed, and sampled 10,000 rows (bike rides).

The data looks like this:bd_sub headshotA few notes:I sampled 10,000 rows because it is computationally reasonable and still captures the pertinent information for the purposes of this analysis.

I will rename the start and end station names to Source and Target respectively.

Each source and target pair or row of data encodes an edge in the graph we will eventually make.

Also, Gephi likes having the names Source and Target explicitly outlined for its input data.

For this analysis, I kept the graph (network) in multigraph form.

What that means is that instead of having one single (possibly weighted) edge representing relationships between two nodes in a graph, there can be multiple parallel edges.

Specifically, two nodes in the graph will have as many edges between them as they have co-occurences in the rows of a dataset.

Creating a weighted graph with single edges that have a corresponding edge weight (sum of number of edges) can be useful for computing certain metrics, but I really like multigraphs for visualization.

Create nodes dataframe with each node’s respective lat long coordinatesThe challenging part (for me at least) of this analysis was to get latitude and longitude as node attributes.

Edge attributes are easy because they are just corresponding data for a row in a dataframe.

Node attributes require a little more wrangling.

The following code blocks show how to build a graph in NetworkX from pandas and import lat-long coordinates as attributes.

Source nodes# rename original start and end station names to Source and Targetbd_sub.

rename(index=str, columns={'start station name':'Source', 'end station name':'Target'}, inplace=True)# build source nodes dfsource_nodes_df = pd.

DataFrame(bd_sub['Source'].

unique())source_nodes_df.

rename(columns={0:'Source'}, inplace=True)source_nodes_latlong = bd_sub[['Source', 'start station latitude', 'start station longitude']]source_input_nodes = source_nodes_df.

merge(source_nodes_latlong, left_on='Source', right_on='Source', how='right')source_nodes = source_input_nodes.

drop_duplicates(['Source'], keep='first').

rename(columns= {'Source':'node', 'start station latitude':'latitude', 'start station longitude':'longitude'})Target nodes# Build Target nodes dataframetarget_nodes_df = pd.

DataFrame(bd_sub['Target'].

unique())target_nodes_df.

rename(columns={0:'Target'}, inplace=True)target_nodes_latlong = bd_sub[['Target', 'end station latitude', 'end station longitude']]target_input_nodes = target_nodes_df.

merge(target_nodes_latlong, left_on='Target', right_on='Target', how='right')target_nodes = target_input_nodes.

drop_duplicates(['Target'], keep='first').

rename(columns= {'Target':'node', 'end station latitude':'latitude', 'end station longitude':'longitude'})Merge source and target data framesgephi_nodes = source_nodes.

append(target_nodes,sort=False).

drop_duplicates(['node'], keep='first')gephi_nodes.

head()Nodes with lat long coordinatesNow lets build the actual graph using NetworkX.

There are many ways to create graphs in networkx.

I chose to use the pandas option, but we still need to input the node attributes.

# build graph using subset of original datasetG = nx.

from_pandas_dataframe(df=bd_sub, source='Source', target='Target', edge_attr='gender', create_using=nx.

Graph())NetworkX requires the node attributes to be in dictionary format, so a little more wrangling is in order.

# Build a dictionary for latitude valueslat_dict = pd.

Series(gephi_nodes.

latitude.

values, index=gephi_nodes.

node).

to_dict()# Set the node attributenx.

set_node_attributes(G, ‘latitude’, values=lat_dict)# repeat for longitudelong_dict = pd.

Series(gephi_nodes.

longitude.

values, index=gephi_nodes.

node).

to_dict()nx.

set_node_attributes(G, 'longitude', values=long_dict)Finally, write the graph into a GEXF file so we can open it in Gephi to do some visualization.

Read more about gexf here: https://gephi.

org/gexf/format/nx.

write_gexf(G, “file_name.

gexf”)Visualizations in GephiAt this point, the data is in GEXF file format and ready for input into Gephi.

I will not walk through the steps to get to the visualizations as there are a number of good Gephi tutorials already available.

I ran a clustering algorithm (native to Gephi) and colored the nodes based on what cluster they belong to.

I used 3 different layouts to make inferences about the Citi Bike ridership data.

Geo-located nodes colored by communityGephi comes with a geographic layout function that you can use if you have the latitude longitude coordinates of the nodes.

That is why I went through the trouble of making them a node attribute with Python.

This layout is nice in this instance because it allows us to see how the community detection algorithm works on the Citi Bike data.

It seems pretty clear cut that the communities of stations in the Citi Bike network are largely based on geographic location.

Lower Manhattan stations(pink) are lumped together, Brooklyn stations(orange) are lumped together, etc.

However, there does seem to be overlap between the communities.

Let’s use another layout to visually investigate that overlap some more.

Circular layout with same community coloringThe circular layout is nice because it allows us to see which communities have overlap between them.

Each station is laid out around the circle and colored by the cluster it belongs to.

This graph clearly shows that most rides stay within the general vicinity of the starting station.

That is indicated by the strong, almost gravitational, clusters around the edge of the circle.

The noteworthy overlaps happen between the three sections of Manhattan.

The dominant pink dash across the middle indicates that most cross-community traffic leaves lower Manhattan going towards Midtown.

Notice there is not an eye-catching dash between lower and upper Manhattan.

Let’s use another layout to characterize the stations that cross communities.

Radial layout with high degree nodes on outside of radiusThis is a radial layout where each radii is a different community and the ordering of the nodes on each radius is based on ascending degree (number of connections in a network) starting from the middle.

This means that the highest degree nodes in each community are on the outer end of the radius.

Visually inspecting the graph, you can see that the high degree nodes are the ones that tend to interact with each other across communities.

How is this information useful?Well, to be honest, I think that making pretty visualizations is fun in itself, but from an insights standpoint, these visualizations can be useful for exploratory data analysis.

Imagine a hypothetical situation where we are trying to figure the best places to market a new product.

We could use a bikeshare system’s ridership data to select optimal locations for placing advertisements.

After all, people will have to dock their bikes at a station and walk around for a bit.

Good visualizations will allow us to characterize the nature of a system, then narrow in on its specifics.

For example, with the Citi Bike network visualizations, we understand that a relatively small subset of bike stations account for a large number of cross-cluster trips.

Therefore, it could be useful to investigate that specific subset for specific targeting, which can be done with network statistics.

If you have anything to add, I would love to hear it.

Connor.

. More details

Leave a Reply