As the elements that we search for may have a regular style, they are likely to have the same CSS classes.
On this first web page, I want to find a wrapper that contains all the relevant information about a single item.
When I navigate to the web page, and right-click and select inspect, I can see that the wrapper I am interested in has a li tag and a series of classes;<li class=”Grid-col u-size-1–4-l u-size-3–12-m u-size-6–12 u-size-1–5-xl”To help find all of these list items whilst I am in the browser, I can switch across to the console tab in the developers tool.
When I am here I can write a test query.
To do so, I write two dollar signs, $$ and input my query in quotations within parentheses as such:$$(‘li.
u-size-1–5-xl’)This brings up a list with a length of 40 as shown.
I quickly validate that each list item corresponds to the appropriate beach ball on the web page by scrolling through the list items and visually inspecting the corresponding highlights in the browser.
A working practice I encourage, is to check the last item in the list and check whether it is the last item you are interested in on the page.
In our case it is!If this was not the case, you could again try different CSS queries in the console developer tool.
This is a good practice to adopt, as we can perform initial validation in the browser, not within our Python Scripts.
It is also very easy to see, because we can look at the CSS query and the browser page at the same time!We can now pass this query directly into the .
I create a variable which holds all the beach_balls found.
This variable beach_balls points to a list which we can iterate over.
print(type(beach_balls))<class 'list'>Using this same approach I can write a simple for loop to extract the information I am interested in.
Here, I use the find_element_by_css_selector (I have used element, not elements) to find the tags and classes pertaining to the other pieces of information that are contained within the original wrapper.
When I find the appropriate elements for the desc ect, I use the .
text method to extract the text and the lstrip method for simple clean-up on the string.
This code is working well, but the really useful aspect of web scraping is the automation it provides.
To demonstrate, I will scrape just 2 pages with Beach balls from Walmart.
I have written a while loop, that will iterate twice, based on the condition I have provided.
It is important to note that the not all the information is available for every item.
For example, the shipping information is missing in a few instances.
Normally, when you encountered this situation, you should write a condition within the for loop, so all information matches up.
However, this Walmart page is organised as such, that when information is missing, empty padding fills the space.
This means no conditional checks are requited within the for loop, but be careful.
Multiple Page ScrapeWith each iteration, I add the relevant item to the appropriate list.
At the end of the first iteration of the for loop, I click onto the next page.
I find the tag and classes which correspond to the next page and use the .
click() method to navigate onto it.
The script should end on the third page if everything has worked as intended.
There should have been two iteration according to my while condition.
The script has worked as intended.
Below, the third page has loaded, as show by the green icon.
Finally, I will write the output to a CSV file, by zipping my lists together and giving them sensible names by using the pandas Dataframe method.
To conclude, go and get yourself a beach ball and head down the beach!.