We provided more explanation for how we use the logs from our web application and contrasted our previous experience of writing our own log viewer.
Not everything is a Data Frame Having forsaken the Google Analytics service we ended up having no process to perform analytics on our logs.
Our first attempt at solving the issue was to download a log and try to load it into a Data Frame with hilarious outcomes.
‘Don’t just run to Pandas every time!’We cannot just run to Pandas and therefore we needed to come up with a new approach.
Fasten your seat belt and sit back!Photo by Goh Rhy Yan on UnsplashReaders who are not Software Engineers or Data Scientists might find the following passages contain strong programming language with explicit images of program code and unhelpful explanations written in a pseudo-English Language called ‘Python3’.
Our new approachThe exercise in hand is Web Log analytics where the traffic from our site is available to us via text format and we can retrieve the files from our S3 bucket.
Previously we ran to Pandas first thing, and that didn’t work out.
We needed a new approach.
The stepsThe log files are on AWS in the S3 serviceWe need to retrieve them and store them in a processing context ( local )A job to iterate over each file and parse out the detailsStore the parsed out details in a Data Frame, yes, we are running to Pandas again!!Produce some analytics to illustrate the use case with some visualsGetting the log filesIt turns out that AWS have provided a Python Library called Boto3 which you can install from the Anaconda Navigator.
S3 functions written to handle the log analytics for www.
orgWith Boto3 we defined three small functions:-s3() Calling this function provides a listing of all S3 buckets owned by the operators of the site.
We need to get a list of all the log files within our bucket.
This function will provide a list of all files and returns a list object.
the s3_download function takes a list object as an argument.
We call s3_download with a list of s3 objects we wish to download.
s3_reconcile function that saves s3 Aws related chargesA final function s3_reconcile() provides an important service.
S3 will receive a few new files every day.
It would be expensive to download all the log files each time we want some analytics about our web traffic.
Our function asks AWS for the current list of objects then gets the objects stored locally.
Makes a comparison and then just downloads new files from AWS.
We need to save money.
Parsing the log filesOur job first exploits s3_reconcile() from above to refresh or synchronize the S3 bucket with the local drive.
Next we build a file list and iterate over the list to parse out all the log records.
com/mlexperience/0e2d1c12427e8646e3b6da6ff179a05fThe code, for this aspect, is long and therefore we will offer an abridged version here.
Using the OS library we get a list of files in our target directory.
We define a function called parse.
Our function takes a single line of a log record as an argument called ‘text’.
The files are ‘tab’ separated so we only need to use the split() command to chunk the record up.
The function returns a Python Dictionary.
Each line of the web log becomes a Python dictionary.
The field ‘message’ needs further processing and the code handles that later on.
Moving further down along the codeLine 94 shows we are using the gzip library in Python to open a gzip file.
Line 104 shows some cleaning and the final parsing of the ‘message’ key.
Line 91 shows how we iterate over the list of files stored on the local drive.
The s3_reconcile() function downloads all the log files from S3 in gz compressed format.
The code continues to iterate over the files in our library until it has opened them, read each one line by line and parsed the contents into a list of dictionary objects in Python.
The final step is to translate the list of dictionaries into a Data Frame so we can do analytics.
Analytics from PandasIn our article ‘Not everything is a Data Frame — Don’t just run to Pandas every time!’ we explained why we cannot start with Pandas and that was because the order of the records in the log files have meaning.
Now with our parsing strategy all sorted out we are free to exploit Pandas for Analytics.
The code in this section requires a further explanation.
The code snippets that build the Data FrameFirst consider lines 63–73 from the above image.
We define a list object called ‘field_list’.
The ‘field_list’ sets out the known field names from the log file as defined by Papertrail.
Mostly the fields are okay except for ‘message’ which needs additional treatment.
We do not know the field names in the ‘message’ content at the start and therefore we have to build a dynamic list as we go along.
Line 137, above left, makes an object called ‘extra’ a unique list.
Lines 138–139 add the extra field names to the ‘field_list’.
We also add, lines 140 and 141 some extra fields to store the filename and a field called ‘additional’ which stores the browser information from the log record.
Lines 142 through 149 show the construction of a Data Frame from a Python Dictionary, adding column names, cleaning up the date field and doing some plots.
An illustration of the plot produced is below.
The x-axis represents the activity date.
The y-axis represents the number of responses made by the web application.
We can see, day by day, the number of files served up by our application.
On April 30th, we see about 55 downloads.
55 downloads do not represent 55 page views since our landing page has at least 20 different files that load client side.
An output to the console after the job has completed.
Traffic has been light to our site and thankfully we do not depend on traffic to generate a revenue stream from the site.
Another example from the job output using the Groupby() function of PandasWe see an illustration of some requests for /robots.
txt and the count by method, by ‘fwd’ IP address.
We can ask Pandas how often /robots.
txt got requested in the sampled logs.
The answer is 27 timesAsk another question.
What log files contain the request for ‘/robots.
txt’?With Pandas providing such capabilities it is no wonder we want to ‘just run to Pandas every time!’Making sense of Web LogsClosing and retrospectivePhoto by Mantas Hesthaven on UnsplashOur view was that much material about Web Log analysis with Python is already available.
We stated the importance of the logs as telling the service operators what is happening with their infrastructure.
This article introduced using Python with Boto3 to automate log file retrieval.
We created some functions to show how AWS Boto3 works with S3.
We presented the log file parsing strategy, and we showed how to use Python Pandas to produce some basic visuals and insights.
We have completed our job now, all that remains is to wash our hands and hit the road.
Did you enjoy this article?.Do you have ideas for the next steps?.Post your thoughts in the responses section.
It was fun getting our hands dirty and writing about an end to end analytics exercise.
Thanks for reading.
Further work ideas:Add the Python script to the Cron scheduler on an Ubuntu VM over in AWS land.
Add a MongoDB instance and write the parsed out values to a Database.
Create an interactive Dashboard using Plotly or Vega and publish the Dashboard as part of the core application.
Your ideas?Photo by Stephane YAICH on UnsplashNo Pandas were harmed during this exercise.