Photo Credit: PixabayBuilding a Content Based Recommender System for Hotels in SeattleHow to use description of a hotel to recommend similar hotels.
Susan LiBlockedUnblockFollowFollowingMar 31The cold start problem is a well known and well researched problem for recommender systems, where system is not able to recommend items to users.
due to three different situation i.
for new users, for new products and for new websites.
Content-based filtering is the method that solve this problem.
Our system first uses the metadata of new products when creating recommendations, while visitor action is secondary for a certain period of time.
And our systems recommend a product to a user based upon the category and description of the product.
Content-based recommendation systems may be used in a variety of domains ranging from recommending web pages, news articles, restaurants, television programs, and hotels.
The advantage of content-based filtering is that it doesn’t have a cold-start problem.
If you just start out a new website, or any new products can be recommended right away.
Let’s assume we are starting a new online travel agency (OTA), and we have signed up thousands of hotels that are willing to sell on our platform, and we start seeing traffic coming from our website users, but we don’t have any users history, therefore, we are going to build a content-based recommendation systems to analyze hotel descriptions to identify hotels that are of particular interest to the user.
We would like to recommend hotels based on the hotels that a user has already booked or viewed using the cosine similarity.
We would recommend hotels with the largest similarity to the ones previously booked or viewed or showed interest by the user.
Our recommender system is highly dependent on defining an appropriate similarity measure.
Eventually, we select a subset of hotels to display to the user or to determine an order in which to display the hotels.
The DataIt’s very hard to find public available hotel description data, therefore, I collected them by myself from each hotel’s homepage for over 150 hotels in Seattle area, that includes downtown business hotels, boutique hotels and bed and breakfast, airport business hotels, inns near the universities, motels in the middle of nowhere, and so on.
The data can be found here.
import pandas as pdimport numpy as npfrom nltk.
corpus import stopwordsfrom sklearn.
pairwise import linear_kernelfrom sklearn.
text import CountVectorizerfrom sklearn.
text import TfidfVectorizerfrom sklearn.
decomposition import LatentDirichletAllocationimport reimport randomimport plotly.
graph_objs as goimport plotly.
plotly as pyimport cufflinkspd.
max_columns = 30from IPython.
interactiveshell import InteractiveShellimport plotly.
figure_factory as ffInteractiveShell.
ast_node_interactivity = 'all'from plotly.
offline import iplotcufflinks.
set_config_file(world_readable=True, theme='solar')df = pd.
head()print('We have ', len(df), 'hotels in the data')Table 1Have a look few hotel name and description pairs.
pyprint_description(10)Figure 1print_description(100)Figure 2EDAToken (vocabulary) Frequency Distribution Before Removing Stop Wordsunigram_distribution.
pyFigure 3Token (vocabulary) Frequency Distribution After Removing Stop Wordsunigram_distribution_stopwords_removed.
pyFigure 4Bigrams Frequency Distribution Before Removing Stop Wordsbigrams_distribution.
pyFigure 5Bigrams Frequency Distribution After Removing Stop Wordsbigrams_distribution_stopwords_removed.
pyFigure 6Trigrams Frequency Distribution Before Removing Stop Wordstrigrams_distribution.
pyFigure 7Trigrams Frequency Distribution After Removing Stop Wordstrigrams_distribution_stopwords_removed.
pyFigure 8Everyone knows Seattle’s Pike Place Market, it is way more than a public farmers market.
It is a historical vibrant tourism attraction comprised of hundreds of farmers, craftspeople, small businesses.
The hotel industry thrives on location, tourists look for a hotel that is possibly nearest to downtown and / or must-visit attractions of the city.
Therefore, every hotel would brag about it if it is not too far from the hotel.
Hotel Description Word Count Distributiondf['word_count'] = df['desc'].
apply(lambda x: len(str(x).
split()))desc_lengths = list(df['word_count'])print("Number of descriptions:",len(desc_lengths), ".Average word count", np.
average(desc_lengths), ".Minimum word count", min(desc_lengths), ".Maximum word count", max(desc_lengths))word_count_distribution.
pyFigure 9Many hotels use description to their full potential, know how to utilize captivating descriptions to appeal to travelers’ emotions to drive direct bookings.
Their descriptions may be longer than others.
Text PreprocessingThe test is pretty clean, we don’t have a lot to do, but just in case.
pyModelingCreate a TF-IDF matrix of unigrams, bigrams, and trigrams for each hotel.
Compute similarity between all hotels using sklearn’s linear_kernel (equivalent to cosine similarity in our case).
Define a function that takes in hotel name as input and returns the top 10 recommended hotels.
pyRecommendationsLet’s make some recommendations!recommendations('Hilton Seattle Airport & Conference Center')The following are recommended by Google for “Hilton Seattle Airport & Conference Center”:Figure 10Three out of four recommended by Google were also recommended by us.
The following are recommended by tripadvisor for “Hilton Seattle Airport & Conference Center”:Figure 11Not bad either.
Try a bed & breakfast.
recommendations("The Bacon Mansion Bed and Breakfast")The following are recommended by Google for “The Bacon Mansion Bed and Breakfast”:Figure 12Cool!The following are recommended by tripadvisor for “The Bacon Mansion Bed and Breakfast”, which I was not impressed.
Figure 13Jupyter notebook can be found on Github, if you prefer, this is a nbviewer version.
Have a productive week!.