Web Scraping Google Sheets with RSelenium

We can check this use this via length(webElems).

There is actually 49 tables in total but since we started on the first one above there is only 48 links.

Rather than hard-coding 1:48 it’s better to do it via code as there will be more tables added in the future.

# Create NULL object to be used in forloopbig_df <- NULLfor (i in seq(length(webElems))){ # for every check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE))if (any(class(check) == "try-error")) { # choose the i'th Month/Year tab webElem <- webElems[[i]] webElem$clickElement()# Find the tableElem again other wise you get a StaleElementReference ## TO DO: look into WebDriverWait: https://stackoverflow.

com/questions/5709204/random-element-is-no-longer-attached-to-the-dom-staleelementreferenceexception tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # The inner table frame# switch to table frameremDr$switchToFrame(tableElem)Sys.

sleep(3)# parse html with XML packagedoc <- htmlParse(remDr$getPageSource()[[1]])Sys.

sleep(3)# Extract data from HTML table in HTML doucmenttable_tmp <- readHTMLTable(doc)Sys.

sleep(3)# put this into a format you can usetable <- table_tmp[[1]][-2,-1]table <- table[-1,]# rename the columnscolnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")# add city name to a columntable$city <- rep("Montreal", nrow(table))# add the Month/Year this table was extracted fromtoday <- Sys.

Date() %m-% months(i + 1)table$date <- today# concatenate each table togetherbig_df <- dplyr::bind_rows(big_df, table)# Switch back to the main frameremDr$switchToFrame(NULL)arrowElem <- arrowElems[[1]]# once you "click"" the element it is "held down" as far as I know there is no way to " unclick"# to prevent it from scrolling too far I make sure not to take too long by setting the sleep shortarrowElem$clickElement()# give it "just enough time" to scroll rightSys.


3)# switch back to outer frame to re-start the loopremDr$switchToFrame(NULL) }}temp1 <- dplyr::bind_rows(table_tmp, big_df)The problem here is that the for loop eventually fails at the end when it tries to click the right arrow but it’s as far to the right as it can go — therefore it won’t download the last few tables (~5).

Typically one would handle such conditions with something like:check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE))if (any(class(check) == "try-error")) { # do stuff}and it usually works fine, including when using selenium.

The issue encountered here however is clicking on the arrow once would always bring me to the last visible sheets — skipping everything in middle.

Therefore my work around to get the remaining sheets was this:# ctrl+x from the google sheet then use the read.

delim() to assign it to an objectmarch <- read.

delim("clipboard")february <- read.

delim("clipboard")january <- read.

delim("clipboard")december <- read.

delim("clipboard")november <- read.

delim("clipboard")# add the city and datejanuary$city <- rep("montreal", nrow(january))january$date <- rep("2015-01-30", nrow(january))february$city <- rep("montreal", nrow(february))february$date <- rep("2015-02-15", nrow(february))march$city <- rep("montreal", nrow(march))march$date <- rep("2015-03-15", nrow(march))december$city <- rep("montreal", nrow(december))december$date <- rep("2014-12-15", nrow(december))november$city <- rep("montreal", nrow(november))november$date <- rep("2014-11-15", nrow(november))# clean up the column namesjanuary %<>% janitor::clean_names()february %<>% janitor::clean_names()march %<>% janitor::clean_names()december %<>% janitor::clean_names()november %<>% janitor::clean_names()# reorder the columnsxyz %<>% dplyr::select(team_name, team_size, start_time, end_time, total_time, puzzles, city, date)# rename this column to match temp1xyz <- rename(xyz, puzzels_solved = puzzles)# change to a characterxyz$puzzels_solved <- as.

character(xyz$puzzels_solved)# add NA for team sizexyz$team_size <- rep(NA, nrow(xyz))# concatenate these five dataframes togetherxyz <- bind_rows(march, february, january, december, november)# convert characters into actual datesxyz$date <-as.

Date(xyz$date)# concatenate this onto the larger dataframetemp2 <- bind_rows(temp1, xyz)# save the objectwrite_csv(temp2, "puzzeld_pint_raw.

csv")Method # 2Big thanks to Nate on SO for pointing out an alternative solution that solves the task of scrapping the tables but not the task of exception handling in the above sense.

# table yielding function# just for readability in the loopcreate_table <- function (remDr) { # parse html with XML package doc <- XML::htmlParse(remDr$getPageSource()[[1]]) Sys.

sleep(3) # Extract data from HTML table in HTML document table_tmp <- XML::readHTMLTable(doc) Sys.

sleep(3) # put this into a format you can use table <- table_tmp[[1]][-2, -1] # add a check-up for size mismatch table_fields <- as.

character(t(table[1,])) if (!.any(grepl("size", tolower(table_fields)))) { table <- table[-1, ] # rename the columns colnames(table) <- c("team_name", "start_time", "end_time", "total_time", "puzzels_solved") table$team_size <- NA_integer_ table <- table[,c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")] } else { table <- table[-1, ] # rename the columns colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved") } # add city name to a column table$city <- rep("Montreal", nrow(table)) # add the Month/Year this table was extracted from today <- Sys.

Date() lubridate::month(today) <- lubridate::month(today)+1 table$date <- today # returns the table table}# II.

Scrapping the content# # 1.

selenium to generate the pages# 2.

use create_table to extract the table# big_df <- NULLfor (k in seq_along(sheet_url)) { # 1.

navigate to the page remDr$navigate(sheet_url[k]) # remDr$screenshot(display = TRUE) maybe one wants to see progress table <- create_table(remDr) # 2.

concatenate each table together big_df <- dplyr::bind_rows(big_df, table) # inform progress cat(paste0('!.Gathered table for: ', sheet_months[k]))}# close sessionremDr$close() To perform the task, what was done was firstly generating the links to all spreadsheets in the document.

To do this:Navigate once to the documentExtract source codeExtract the sheet months and URLs (via gid digit) using regexOnce this is done, loop through the Urls, gather and bind the tablesThere is a small function called create_table which returns the final table in the proper format with a safety measure for the number of columns (some of the spreadsheets do not have the team_size field – in those cases I set it to NA_integer).

If you find this article useful feel free to share it with others or recommend this article!.????As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.

Till then, see you in the next post.????.. More details

Leave a Reply