Scraping Job Posting Data from Indeed using Selenium and BeautifulSoup

Once you have it downloaded, you just need to unzip and put the executable file in your project folder and that’s it!OverviewLet’s look at the major steps needed so that we can divide the work into Python functions:Get all the job posting linksClick each link and parse text from the job posting pageStore the parsed text dataIn step 1, since our Indeed search results can have many pages, we’ll need to visit all these pages and obtain all the job posting links from each page..These links then can be used to open up the actual job posting pages.We can have a grab_job_links() function to get all the job posting links for a given search result page, and then use another function get_urls() to loop through all the search result pages and get all the job posting links.To further simplify the functions, we can separate out another function get_soup() to obtain the BeautifulSoup soup object for a given url..Adding in BeautifulSoup to get the soup object, we have the full “get_soup()” function as below:def get_soup(url): driver = webdriver.Firefox() driver.get(url) html = driver.page_source soup = BeautifulSoup(html, 'html.parser') driver.close() return soupGrabbing Job LinksGiven a search result page like below, how do we get all the links for all found jobs excluding the sponsored ones?We can find it in two simple steps..For example, to get the result page count like this:We need:soup.find(name='div', attrs={'id':"searchCount"}).get_text()Getting All Job LinksSince there are many search result pages, we need to loop through to grab all the job links and store them in a list:for i in range(2, num_pages+1): num = (i-1) * 10 base_url = '{}&l={}&start={}'.format(query, location, num) try: soup = get_soup(base_url) urls += grab_job_links(soup) except: continueThe “base_url” above specifies the format of a Indeed search where queried job title goes in the first pair of curly brackets, queries location goes in the second, and the starting count of job postings is in the third (i.e. first pages has 0 to 9 job postings, second page has 10 to 19 etc.).Extracting TextWith the above work, we now should have all these job posting links..The syntax is quite similar as before:def get_posting(url): soup = get_soup(url) title = soup.find(name='h3').getText().lower() posting = soup.find(name='div', attrs={'class': "jobsearch-JobComponent"}).get_text() return title, posting.lower()Again, we just need to find the containing tag and extract the text using “get_text()” method.Handling ExceptionsFinally, we can loop through all the captured job posting links and extract the text.. More details

Leave a Reply