(The identity is masked so that the site creators don’t make the site harder to scrape in the future.
)Data UnderstandingThe Scope of the DataI narrowed down to data related job postings and expanded to jobs that require the use of data related software.
As that website constantly updates its postings, running the same set of requests may fetch different sets of postings.
As such, I downloaded the required source code and worked on them in offline mode.
Another issue ensued: storage.
Downloading the source code for each posting may occupy substantial disk space.
Fortunately, Python provided a gzip library to compress and uncompress the data during storing and loading respectively.
Here’s the code to search and to crawl across different result pages for each keyword, followed by the code to download each job posting’s source code and store them in compressed format:Search using keywords and crawl across all results pagesDownload each job posting’s source code and save the compressed version1,640 job postings (possibly with duplicates) occupying relatively large disk space even after compressionData Storage using PostgreSQLAfter storing the source code, I retrieved the source code offline and extracted the raw data that I needed.
In order to hone my SQL skills and to have a feel on how to create a PostgreSQL database on Amazon Web Services (AWS), I decided to store my raw data on AWS free tier Relational Database Service (RDS).
# These are the raw features used in extracting the data ['job_id','job_title','emp_type','emp_seniority','job_categories','emp_min_exp','job_skills_req','job_desc','job_req','search_query','salary_freq', 'salary_low', 'salary_high', 'company_name', 'company_info', 'link']Uncompress and extract the relevant data before populating a new SQL Table.
Querying the data using pgAdmin.
Looks like everything is in now!Can I do better?Definitely yes!During the crawling for the links to the job postings, I could have used Scrapy instead of Selenium to extract all the links.
I wasn’t aware of that method at that point of time.
Instead of converting a list of links to a Pandas DataFrame and called the built-in method .
to_csv(), perhaps I could use python CSV library to output that list into a .
However, I find it much easier to perform the former step.
There was an issue, albeit extremely small chance, that the HTML source code for a job posting could not be downloaded due to extremely long filename.
Perhaps the code could trim the filename before creating the output file.
Elasticsearch could be used to store and ingest HTML source code file, making it an alternative to PostgreSQL approach.
If there is any issue with the code, or in any way I can improve my coding skill or article writing skill, please leave your comment below so that I can learn from the community the best practices and rules of thumb.
Thank you for reading!.Stay tuned for Part 2, where we access the database, fetch the data and work on analysis!Project after project, I am fascinated by how things work using the data gathered and the model built upon them, and to solve real-life problems by deploying the model.
You can reach me via my LinkedIn.
.. More details