Plotting Spatial data in R

Plotting Spatial data in RVisualize neighborhoods with high concentration of businesses in San FranciscoAditya TandelBlockedUnblockFollowFollowingMar 8I recently got an opportunity to work on spatial data and wanted to share my analysis on one such dataset.

The data consisted of various registered business in the San Francisco Bay Area which can be found here.

An updated version can be found here.

Spatial data pertains to data which is associated with locations.

Typically its described by a coordinate reference system, latitude and longitude.

The goal of this exercise was to find pockets of neighborhoods in San Francisco with high concentration of businesses.

You would need to get a key from Google’s Geolocation API to use their maps.

I used the ggmap package in R to plot this data.

Then I narrowed down my analysis on one particular high concentration neighborhood to see how businesses were dispersed within that area.

First…Quick scan of the datasetstr(biz)head(biz, 25)summary(biz)For the purpose of this exercise I was only concerned with the Neighborhoods, address, dates and most importantly the location columns which contained latitude and longitude data for each business.

Names of the businesses and their codes (which are assigned by the city for registered businesses) were not considered for now.

After doing basic data cleaning activities such as eliminating duplicates and nulls I extracted information only pertaining to the city of SF and eliminated records related to adjoining cities in the Bay Area.

Identify data pertaining to San Francisco onlyThere were a few ways I could go about achieving this; filter dataset based on city or by business.

location or by zip codes.

I chose to use the zip code logic as the other two fields had inconsistent patterns of the San Francisco city name which could easily be missed out.

I have however included commands for all three methods of filtering this data.

By zipsf_biz_zip <- biz %>% filter(grepl(pattern = "94016|94105|94110|94115|94119|94123|94127|94132|94139|94143|94147|94156|94161|94171|94102|94107|94108|94109|94111|94112|94114|94116|94117|94118|94120|94121|94122|94124|94125|94126|94129|94130|94131|94133|94134|94137|94140|94141|94142|94144|94145|94146|94151|94153|94154|94158|94159|94160|94162|94163|94164|94172|94177|94188", Business.

Location))By citysf_biz_city <- biz %>% filter((grepl(".

*San Francisco.

*|.

*SAN FRANCISCO.

*|.

*SF.

*|.

*S SAN FRAN.

*|.

*Sf.

*|.

*San+francisco.

*|.

*S+san+fran.

*", City)))By Business.

Locationsf_biz_loc <- biz %>% filter((grepl(".

*San Francisco.

*|.

*SAN FRANCISCO.

*|.

*SF.

*|.

*S SAN FRAN.

*|.

*Sf.

*|.

*San+francisco.

*|.

*S+san+fran.

*", Business.

Location)))Converting date objectsNext I wanted to eliminate businesses which had seized to exist.

For this I used the end dates for each location.

However the date fields were stored as factors which were converted to posixct as that generally helps in further analysis when it comes to dates.

sf_biz_zip$Business.

Start.

Date <- as.

POSIXct(sf_biz_zip$Business.

Start.

Date, format = "%m/%d/%Y")sf_biz_zip$Business.

End.

Date <- as.

POSIXct(sf_biz_zip$Business.

End.

Date, format = "%m/%d/%Y")sf_biz_zip$Location.

Start.

Date <- as.

POSIXct(sf_biz_zip$Location.

Start.

Date, format = "%m/%d/%Y")sf_biz_zip$Location.

End.

Date <- as.

POSIXct(sf_biz_zip$Location.

End.

Date, format = "%m/%d/%Y")Filter out inactive businessesBusinesses which seized to exist after December 1, 2018 were eliminated.

sf_biz_active_zip <- sf_biz_zip %>% filter(is.

na(Location.

End.

Date))sf_biz_active_zip <- sf_biz_zip %>% filter(Location.

Start.

Date < "2018-12-01")Stripping out coordinates from the Business Location fieldThe Business Location column contained addresses along with the coordinates information.

So the latitude and longitude information needed to be extracted.

sf_biz_active_zip <- sf_biz_active_zip %>% separate(Business.

Location, c("Address", "Location"), sep = "[(]")sf_biz_active_zip <- sf_biz_active_zip %>% filter(!(is.

na(Location)))sf_biz_active_zip <- separate(data = sf_biz_active_zip, col = Location, into = c("Latitude", "Longitude"), sep = ",")Other characters needed to be cleaned out too.

sf_biz_active_zip$Longitude <- gsub(sf_biz_active_zip$Longitude, pattern = "[)]", replacement = "")I then converted latitude and longitude variables from discrete to continuous and stored them as numerical variables as this helps when plotting/visualizing data and to avoid errors.

sf_biz_active_zip$Latitude <- as.

numeric(sf_biz_active_zip$Latitude)sf_biz_active_zip$Longitude <- as.

numeric(sf_biz_active_zip$Longitude)Now the fun part…Visualization the dataThe resultant dataset had 88,785 records which needed to be plot on a Google map.

Interpreting these many records on a map would be overwhelming to say the least!.Although sampling would be one way to proceed, I instead tried to find out the top 10 neighborhoods which had the largest number of businesses and plot one such neighborhood on the map.

viz <- sf_biz_active_zip %>% group_by(Neighborhoods.

Analysis.

Boundaries) %>% tally() %>% arrange(desc(n))col.

names(viz)[2] <- “Total_businesses”viz <- viz[1:10, ]I then created a histogram of these top 10 neighborhoods.

fin_plot <- ggplot(viz, aes(x = Neighborhood, y = Total_Businesses)) + geom_bar(stat = "identity", fill = "#00bc6c")fin_plot <- fin_plot + geom_text(aes(label = Total_Businesses), vjust = -0.

2) + theme(axis.

text.

x = element_text(angle = 45, size = 9, hjust = 1), plot.

title = element_text(hjust = 0.

5))fin_plot <- fin_plot + ggtitle("Top 10 neighborhoods by business count", size = 2)Let’s look at the Financial District/South Beach neighborhood in more detail since it has the maximum number of active businesses.

Registering Google Maps keyI installed the “ggmap”, “digest” and “glue” packages then registered with Google API to get the the Geolocation API key.

install.

packages("ggmap","digest","glue")register_google(key = "<google maps key>")Google provides terrain, satellite, hybrid among other types of maps.

I chose to use the terrain map.

A simple Google search can give you the city coordinates for San Francisco.

sf <- c(lon = -122.

3999, lat = 37.

7846)map <- get_map(location = sf, zoom = 14, scale = 2)By adjusting the zoom you can get a closer look.

The two images below are with different zoom sizesfin_map <- ggmap(map) + geom_point(aes(Longitude, Latitude), data = fin_dis) fin_map <- fin_map + ggtitle("Concentration of businesses in Fin.

District and South Beach") + xlab("Longitude") + ylab("Latitude") + theme(plot.

title = element_text(hjust = 0.

5))Zoom out viewZoom in viewA better visualizationA heatmap will probably make the visualization more intuitive.

fin_heatmap <- ggmap(map) + stat_density2d(data = u, aes(x = Longitude, y = Latitude, fill = .

density.

), geom = 'tile', contour = F, alpha = .

5)fin_heatmap <- fin_heatmap + ggtitle("Concentration of businesses in Fin.

District and South Beach") + xlab("Longitude") + ylab("Latitude") + theme(plot.

title = element_text(hjust = 0.

5))")ConclusionAreas around Powell Street bart station, Union Square and Embarcadero bart station have a relatively large number of businesses while as areas around South Beach and and Lincoln Hill are sparse populated.

Similarly other individual neighborhoods can be plotted to understand the distribution of businesses there.

This was a fairly straightforward way of visualizing spatial data.

I welcome any feedback and constructive criticism.

Thank you for reading!.

. More details

Leave a Reply