Do you think it is immune to the tone of the question?If you are reading this, you probably know that asking Google “Why chicken is bad for you?” and “Why chicken is good for you?” is going to give you two sets of very different results that will have little overlap, at least on the first page of results.
These are questions asking for a specific facet of the subject.
You ask what is bad, and Google tries to answer exactly that.
If you ask the same questions to experts of different camps, say Dr.
Peter Attia, and Dr.
Michael Greger, they are still bound to answer the specific facet that was asked for, irrespective of whether they think if it’s good or bad on the whole.
What about the questions that have different tones, but stem from the same underlying question?.For example, the questions, “Is chicken healthy?”, and “Is chicken unhealthy?” stem from the same underlying fact that you are not sure if it’s healthy or not.
You are asking if it is healthy or unhealthy on the whole.
Given the keyword difference, it is a no brainer that the search results won’t be identical.
But to what extent do they differ?.We are about to find out.
TL; DR: We will Google the above-mentioned questions, use Beautiful Soup to scrape the content of the top results that most of us are likely to click, build a summarizer using python NLTK to create a four-line summary of each of these top links, and look at what we end up with.
Long version: I used SEO quake to export the URLs in search results of each question to a CSV file and then read them into a pandas DataFrame.
Google search results vary based on your location, past search history, etc.
So, you will most likely get different results.
So, we now have a DataFrame with the google query as column headers and corresponding top URLs as column values.
Scraping: I looped through these URLs, used Beautiful Soup to parse HTML and extract all contents of each of these websites that were enclosed in paragraph tags <p>, and then used a bunch of regular expressions to weed out unwanted stuff.
I made this scraper generalized enough to be used for all websites to extract just good enough content to feed into our summarizer.
That way we can scrape all websites using one for loop.
Extracting every bit of the exact content would require us to dig into class ids of each of these HTML pages.
For our purposes, this isn’t required, and our generalized scraper is sufficient to do a good job.
Summarizer:Tokenize the text of each website into a list of individual sentences using sentence tokenizer(nltk.
sent_tokenize), and into a list of individual words using a word tokenizer (nltk.
Calculate the number of occurrences of each word in the given text, ignoring the stopwords using a simple for and if loop counter.
Calculate the frequency of each word (number of occurrences of a word divided by the number of occurrences of the most frequent word).
Calculate the score of each sentence by adding the frequencies of the words in the sentence, store them in a dictionary.
Using heap, pick out 4 sentences that have the largest scores.
This is our four line summary.
Store these summaries in a DataFrame, export them as CSV.
So what do we have?When you ask “is chicken healthy?” 4 out of the top 5 links tell you it is healthy, whereas if you had asked: “is chicken unhealthy?” all 5 top links tell you it is unhealthy.
So be a little more mindful of your questions, even to Google.
Instead of expecting Google to give you the full picture, ask specific questions and build the full picture yourself, or search specific websites that, you trust, will give you good answers.
Thanks for reading!Reference: Usman Malik, Text Summarization with NLTK in Python.