Web Scraping using R

So i went to my url and fired up the “firebug” on the browser and soon figuered that the names of the hikes have been encapsulated in the “.trailname” css class, using this css class i can extract all the trail names on the webpage.There are 2 functions that we will use here:html_nodes : Use this function to extract the nodes that we like (in this case nodes with “.trailname” as css classhtml_text: Use this function to extract the text in between the html nodes (in this case our trail names)#Scraping Trail Names using css class ‘trailname’trail_names_html <-html_nodes(trails_webpage, ‘.trailname’)trail_names <- html_text(trail_names_html)head(trail_names)Output:[1] “Abby Grind” “Admiralty Point” [3] “Al’s Habrich Ridge Trail” “Aldergrove Regional Park”[5] “Alice Lake” “Ancient Cedars Trail”Similarly, now i will do this for all other attributes for each trail: Region, Difficulty, Time, Distance, Season..Each of these attributes have their own css classes:i-name, i-time,i-difficulty,i-distance,i-schedule#Trail Regiontrail_region_html <-html_nodes(trails_webpage, '.i-name')trail_region <- html_text(trail_region_html)head(trail_region)Output:[1] "Fraser Valley East" "Tri Cities" "Howe Sound" [4] "Surrey and Langley" "Howe Sound" "Whistler"#Trail Difficultytrail_diff_html <-html_nodes(trails_webpage, '.i-difficulty')trail_diff <- html_text(trail_diff_html)head(trail_diff)Output:[1] "Intermediate" "Easy" "Intermediate" "Easy" "Easy" [6] "Intermediate"#Trail Seasontrail_season_html <-html_nodes(trails_webpage, '.i-schedule')trail_season <- html_text(trail_season_html)head(trail_season)Output:[1] "year-round" "year-round" "July – October" "year-round" [5] "April – November" "June – October" >One thing to note, when we extract time, it is in the form of an character: Eg:1.5 Hours, 3 Hours..We want it in numeric form, To convert it into a numeric form, used a library : Stringr and the function:`str_extract`Logic is that i used the regular expression to match the pattern and extracted the same from the html text..For regular expression help, you can refer to the cheatsheet of `stringr` here.So this is what i did to convert it to numeric form:#Extracting Trail Times:trail_time_html <-html_nodes(trails_webpage, '.i-time')trail_time <- html_text(trail_time_html)head(trail_time)#"1.5 hours" "1.5 hours" "5 hours" "2 hours" #Extracted data is in the form of character, we need to extract digits and convert it into numeric formattrail_time <- as.numeric(str_extract(trail_time,pattern = "-*d+.*d*"))head(trail_time,25)Output:[1] 1.50 1.50 5.00 2.00 2.00 2.00 3.50 5.00[9] 5.00 1.50 1.00 5.00 11.00 3.00 1.00 2.00[17] 1.50 1.00 0.50 3.50 0.25 5.00 4.00 2.00[25] 3.50Similarly did the same thing for Trail Distance as the information is in character form Eg: 4km:#Trail Distancetrail_dist_html <-html_nodes(trails_webpage, '.i-distance')trail_dist <- html_text(trail_dist_html)head(trail_dist)trail_dist <- as.numeric(str_extract(trail_dist,pattern = "-*d+.*d*"))head(trail_dist,25)Output:[1] 4.0 5.0 7.0 5.0 6.0 5.0 6.1 12.0 10.0 3.0 2.6 10.0 29.0 8.0 2.5[16] 5.0 4.0 4.2 1.0 6.0 0.8 7.5 7.0 8.0 10.0Step 3: Now that we have all our information lets collate it into a dataframe and export it to .csv file..To be able to do this, we use a function write_csv from library readrlibrary(readr)#Combining all the extracted features of the trailstrails_df <- data.frame( Name =trail_names, Region = trail_region, Difficulty=trail_diff, Distance=trail_dist, HikeTime = trail_time, Season = trail_season )str(trails_df)write_csv(trails_df, "vancouver_trails.csv")Step 4: Analysis of DataFor this part, using a library ggplot2 to visualise the dataRegions vs Hike Time & Difficulty LevelRegions Vs Distance & Difficulty LevelRegion Vs Seasons with Difficulty & Hike Time. More details

Leave a Reply