Or about how the host would respond to their late replies?Topic #2 suggests that the host is good and kind and the place was comfortable and clean.
Topic #3 suggests that the host is a great help and the place again was great and clean.
Topic #4 suggests that the guests enjoyed their journey.
maybe including the weed?Topic #5 again is fairly vague.
Good and badThe good: topics are automatically discovered from the data itself without any labelled data required.
The bad: there is no right way in deciding the number of topics upfront unless you have prior knowledge.
It takes a lot of trial and error.
At its best, LDA can only provide a rough idea of topics that exist within the data.
Interpreting a set of words to an abstract topic is a subjective guessing game.
TF-IDFTF-IDF is short for Term Frequency-Inverse Document Frequency.
This scoring mechanism is commonly used in information retrieval and text mining for reflecting the relevance of words in a document.
There are 2 parts to this score:Term Frequency — Number of times word is found within a documentInverse Document Frequency —The inverse number of times word is found in a collection of documents.
The term inverse is important to note, as we are not interested in words that appear frequently across all documents.
A word which appears frequently within a document but is not frequently found in the collection will have high TF-IDF score as this word is relevant to the document.
For example, we can expect sleep in an article that discusses “Benefits of 8 hours of sleep” to have a high TF-IDF score as the word would be frequently mentioned in the article but might not be used as frequently in other articles.
In contrast, words like the, good, how are common words which can be used in various articles.
These words would have a low TF-IDF score.
It is also worth mentioning that the same word in different documents will have a different TF-IDF score.
ImplementationData preparationSymbols and stop words were removedTokens were stemmed using Snowball Algorithm (improved from Porter)TF-IDF vectors created for each review using bigramsCodeExample output['great host', 'perfect host', 'public transport', 'high recommend', 'place clean', 'get around', 'make sure', 'stay amsterdam', 'recommend stay', 'host place']The top 10 relevant keywords from all guest reviews indicate that the host is great, the place is clean and they highly recommend this place.
There were also frequent mentions of public transportation.
Compared to LDA, the keywords extracted from TF-IDF are less ambiguous.
But there are still keywords like make sure and get around which are a little too vague to interpret.
Good and badThe good: relevant keywords were extracted using statistical methods.
Simple to implement.
The bad: semantic meanings of different words are not taken into consideration.
Terms like clean apartment and clean flat semantically share the same meaning but in TF-IDF, these are treated as two different strings.
Text SummarisationText Summarisation is used to find the most informative sentences in a document or collection of documents.
Extractive Summarisation is the most popular approach which involves selecting sentences that most represent information in the document or collection of documents.
A commonly used technique for Extractive Summarisation is a graph-based technique called TextRank Algorithm.
This algorithm was built from PageRank (think Google!).
Sentences are ranked by their importance based on the similarity of one sentence to another.
ImplementationData preparationSymbols and stop words were removed.
GloVe embeddings pre-trained on Wikipedia+Gigaword 5 for 100 dimensions were downloaded and extractedSimilarity matrix built by applying GloVe embeddings to each sentence in review and calculating similarity between each sentence using cosine distanceApplied TextRank Algorithm to get sentence rankingsCodeExample output['HOST: <HOST> was very accomodating, has prepared everything you will need for your stay in the city, you get to have great and fun conversations with him, you will be for sure well taken care of!','Not only was the room comfortable, colourful, light, quiet, and equipped with everything we could possibly need – and <HOST>'s flat spotless and beautifully furnished and in a great location – but <HOST> himself is the perfect host, spending the first hour of our arrival talking to us about Amsterdam, answering our many questions, showing us how to get around.
','He was friendly, extremely helpful & went the extra mile to make sure my friend and I were at home at his place.
','His attention to details and kindness make his place an excellent alternative for those considering a bed and breakfast in Amsterdam!.I strongly advise to consider his place: Great location, an affordable price, a clean and organized room and a great host.
','I traveled first time to Amsterdam with a friend and we stayed at <HOST>´s.
He was an excelent host with helping to find out routes and gave lots of tips how to handle things in Amsterdam.
The place was very clean and quiet.
We recomment <HOST>´s room.
']Immediately I see some resemblance between the top 5 most informative sentences and the top 10 relevant keywords from TF-IDF:Great host/ Perfect host- <HOST> was very accomodating, has prepared everything you will need for your stay in the city, you get to have great and fun conversations with him, you will be for sure well taken care of- He was an excelent host with helping to find out routes and gave lots of tips how to handle things in Amsterdam- His attention to details and kindness make his place an excellent alternative for those considering a bed and breakfast in Amsterdam- He was friendly, extremely helpful & went the extra mile to make sure my friend and I were at home at his place.
Place clean- <HOST>'s flat spotless and beautifully furnished and in a great location- a clean and organized room- The place was very clean and quiet3.
High recommend- We recomment <HOST>´s room.
– I strongly advise to consider his placeGood and badThe good: Approach being unsupervised means no labelled training data is required.
Extraction In Action (How have I used it for my holiday?)Having tried all 3 approaches above, I have found the Text Summarisation approach to be most insightful, humanly readable and interpretable with the least amount of ambiguity.
The below section will showcase how I have applied it on my recent trip planning to the beautiful Lisbon.
After having short-listed 5 properties that I wanted to stay in, I’ve copied the URLs into my Jupyter Notebook for extraction.
The workflow involved:extracting reviews submitted from the last 12 months for each listingperforming the same text cleaning process as discussed aboveapplying Text Summarisation using TextRank Algorithm as discussed abovevisualising the top 5 most informative sentences from each listingTop 5 most informative review sentences from Airbnb listings14 reviews written in the last 12 months for Listing#888141 with summarised text highlightedVoila!.Without summarising the reviews, I would have had to read through 64 reviews for these 5 listings.
Hits: All 5 summaries covered main points of concern about the host, location, the cleanliness and comfort of the place.
The summary for Listing#21042405, in particular, was insightful as it pointed out that keys had to be collected from a different location.
Misses: A guest from Listing#888141 complained about the place having no A/C and it was really hot during their visit.
This comment was not picked up in the summarisation and the main reason could be because the guest was the only one who had made such a complaint hence it was not as significant when compared to other comments.
The EndThanks for reading.I’ve enjoyed this wee project that combines 2 of my favourite things — travel and Data Science.
Hopefully, you have enjoyed this read and found this application interesting and practical as well.
Jupyter notebooks used can be found here on Github.
com/blog/2018/11/introduction-text-summarization-textrank-python/.. More details