Pipelines is where the action happens.
Spider & ItemsAgain, the goal of this project is to check that my webpages contain the proper DTM/Launch scripts and notify me via email if they do not.
My spider works off of a .
csv file with 3 columns:DTM/Launch Property Namethe URL of the page to checkand the correct production DTM/Launch scriptWhen my scraped data is output to me I want to see the URL or path of the page that was checked, its property name, and whether or not the scripted provided in the csv was found on the page.
+———–+———————-+——————————+| Property | url | script |+———–+———————-+——————————+| Ally Home | https://www.
js |+———–+———————-+——————————+(Example of my csv, using Ally.
com here as an example)In script_check.
py below you can see that I start off by building a new object called ScriptSpider that inherits the properties of the scrapy Spider class.
The name variable here is important as this is how scrapy will refer to my spider and what I will use to run it later.
For example, to run this spider locally I will use scrapy crawl script_check.
The http_user variable is my API key to use a ScrapingHub Splash server instead of a local Docker server.
This is required to run Spiders that require Splash instances on scrapinghub.
com, but not for Docker.
Line 24 is where I load the data from my .
Scrapy can open .
csv files the same as any other Python script can.
However, in order to run my spider in the cloud, I need add my data file into the python package that the scrapinghub deploy tool creates.
You can see this in analytics-crawler > analytics_check > setup.
Because my csv file is now part of the package, I need to use pkgutil to access it.
Hence the slight inconvenience of all the decoding, splitting, and stripping.
Once I have a proper list of URLs, or rather a list of tuples which contain URLs, I can use them in the scrapy start_requests method.
Here I am iterating over my list of tuples.
Scrapy is requesting the URL and using Splash to render the full page contens.
Request method takes the following arguments: the url to scrape, the callback method to call when that page is successfully loaded, and an optional dictionary of metadata to use with the request.
This meta dict is where the arguments to Splash are provided, and any extra data needed later is passed along (my channel and script from the csv for example).
In the splash args I am telling splash to wait 1 second for the page to load before scraping it and not to bother loading png images.
The callback function parse takes the individual responses generated by start_requests and is where specific data is extracted from the response and transformed.
I am checking the HTML body of the response for the presence of a DTM/Launch script starting on line #57.
I search the response text for a script element with a src attribute that matches the corresponding script from my csv.
If there is a match, then the script exists on the page.
Otherwise, it does not.
I set this as a boolean value.
Parse must return either a Python dict or a scrapy item.
I am first organizing the data I want to return as a dict then transforming it into a scrapy item using ItemLoader.
You can see my predefined item called ScriptCheckItem in items.
Instances of ScriptCheckItem will be sent to my pipeline file for processing using item.
Script Check SpiderPipelinesOnce my spider has collected, cleaned, and transformed my scraped data into items, its time to do something with them.
This is done in scrapy pipelines.
I am going to do 2 things with my data: print out a table of the results in the terminal console, and if there are any pages where scripts are not found, email that table to myself.
You can see the file pipeline.
Scrapy pipelines must have a process_item method, and optionally have the open_spider, close_spider, and from_crawler methods.
Terminal TableThe table output is simple thanks to the terminaltables library.
Give it lists, get a table.
I spiced things up by making the True values green and the False values red using the termcolor library.
Terminal Table checking Ally.
com and Ally.
Note the timestamps, Scrapy works FAST.
My actual spider checks 50+ pages in under a minute.
The homepage check here is a test False valueEmailI also want to be notified via email if any item returns False (a page is found with either an incorrect DTM/Launch script or the DTM/Launch script is not present at all).
To accomplish this, I use Amazon Simple Email Service (SES) and Jinja2 templates to format the email.
Note that my template.
html file is another external resource file (like my csv) that must be explicitly included in the Python package in order to be used on scrapinghub.
You can see it included in my setup.
In my EmailPipeline class, the __init__ and from_crawler methods populate my Amazon SES credentials from my project settings file (which is populated with credential values from my environment).
The open_spider method creates an empty list to be filled with False items should the process_item method come across any.
If there are False items that I need to be notified of, the email is built and sent in the close_spider method.
In the close_spider method, if there are false items, I first load up the template.
html file and convert it into a Jinja template.
This allows me to programmatically and concisely build HTML for my email.
Jinja templating offers many of the same programming paradigms as Python (for loops, if statements, familiar data types, etc.
If you know basic Python and HTML you will find it very easy to work with.
You can see in my template.
html file I am iterating over an item called data and I am passing the list of all items (using all items here just for testing purposes) as the data kwarg in my template.
render call on line #76.
Jinja iterates over this list adding rows (<tr> elements) to a <table> element.
Note that email CSS must be added inline as not all email clients support external CSS or CSS declared in a style element.
Boto3 is Amazon’s AWS Python SDK that is required for using SES.
After importing boto3 I simply followed the steps outlined in the documentation to send an email.
When sending emails in SES, the sender and receiver addresses must be verified via the AWS console before they can be used.
Sorry, no sending spam.
Side note: I read on the Boto3 github discssion board that the SDK is named Boto3 after the Boto Amazon River Dolphin.
Alert Email showing all pages scraped.
Sent via my scrapy pipeline.
comSo running this script locally is all well and good, but I need it to run automatically at set intervals and remove myself from the equation.
com is a website that allows you to deploy scrapy spiders to the cloud.
(Again, not free).
I think scrapinghub.
com folks wrote the scrapy library or vice versa.
It is an extremely convenient way to schedule and run spiders.
A lot of the files in my repository are generated automatically during the scrapinghub deploy process, shub deploy.
Scrapinghub is very customizable in terms of settings.
You can pass arguments to your spiders at run time (think specifying pipelines) and set global environment variables (think credentials).
You can go individually inspect requests, scraped items, errors, and run a virtual console.
Definitely check out scraping hub if you are at all interested taking your web scraping projects to the next level.
Scheduling my spider to run every hour on the hour in scraping hub.
If it finds pages without DTM/Launch, I will be sent an email notification.
To check more pages, I just add them to my csv data file and redeploy the spider to scrapinghub via shub deploy.
ConclusionIs this overkill?.Maybe.
Does it solve the problem needing a check in place to make sure the correct analytics scripts are in place on my pages?.Yes.
There is unlimited potential in web scrapers so even if this particular application is does not appeal to you, I guarantee learning to scrape the web effectively can help you in some way.
.. More details