Exploring & Machine Learning for Airbnb Listings in Toronto

Exploring & Machine Learning for Airbnb Listings in TorontoPhoto credit: Inside AirbnbAirbnb does not provide open data in the sense of giant databases or dumps that we can work with. However, Inside Airbnb utilizes public information compiled from the Airbnb web-site and analyzes publicly available information about a city’s Airbnb’s listings, and provides filters and key metrics so we can see how Airbnb is being used in the major cities around the world. Inside Airbnb is an independent, non-commercial set of tools and data that is not associated with or endorsed by Airbnb or any of Airbnb’s competitors.However, the information provided by Inside Airbnb isn’t going to be enough for us. We are going to download data from there for our own analysis.I will be working with Toronto data. Because I live here and I know some of the neighborhoods here. You are welcome to choose any city you prefer.We are going to look at Airbnb listings and calendars, and trying to provide some exploratory analysis around predicting listing prices, both for, if we were hypothetically working at Airbnb, and also for a consumer. Let’s get started!Calendarcalendar = pd.read_csv('calendar.csv.gz')print('We have', calendar.date.nunique(), 'days and', calendar.listing_id.nunique(), 'unique listings in the calendar data.')calendar.date.min(), calendar.date.max()The calendar covers one year time frame, that is, price and availability every day for the next one year. In our case, from 2018–10–16 to 2019–10–15.Figure 1Availability on the CalendarWhen we look at calendar data, we may want to ask questions like: how busy will it be for Airbnb hosts in Toronto for the next year?calendar.available.value_counts()f(false) means not available, t(true) means available. To find out daily average availability for one year, we will convert available column to 0 if available and 1 if not.calendar_new = calendar[['date', 'available']]calendar_new['busy'] = calendar_new.available.map( lambda x: 0 if x == 't' else 1)calendar_new = calendar_new.groupby('date')['busy'].mean().reset_index()calendar_new['date'] = pd.to_datetime(calendar_new['date'])plt.figure(figsize=(10, 5))plt.plot(calendar_new['date'], calendar_new['busy'])plt.title('Airbnb Toronto Calendar')plt.ylabel('% busy')plt.show();Figure 2The busiest month in Toronto was October which has just passed.The next busy months seems after April and extend to the summer. These are all within our experience and expectations.Price on the CalendarHow price changes over the year by month?We remove “$” symbol in price column and convert it to numeric, and convert date to datetime data type.calendar['date'] = pd.to_datetime(calendar['date'])calendar['price'] = calendar['price'].str.replace(',', '')calendar['price'] = calendar['price'].str.replace('$', '')calendar['price'] = calendar['price'].astype(float)calendar['date'] = pd.to_datetime(calendar['date'])mean_of_month = calendar.groupby(calendar['date'].dt.strftime('%B'), sort=False)['price'].mean()mean_of_month.plot(kind = 'barh' , figsize = (12,7))plt.xlabel('average monthly price');Figure 3The Airbnb price in Toronto increases in the months of July, August and October. Agreed, these three months are the best months visiting Toronto.How price changes during day of week?calendar['dayofweek'] = calendar.date.dt.weekday_namecats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']price_week=calendar[['dayofweek','price']]price_week = calendar.groupby(['dayofweek']).mean().reindex(cats)price_week.drop('listing_id', axis=1, inplace=True)price_week.plot()ticks = list(range(0, 7, 1)) # points on the x axis where you want the label to appearlabels = "Mon Tues Weds Thurs Fri Sat Sun".split()plt.xticks(ticks, labels);Figure 4Fridays and Saturdays are over $10 more expensive than the rest of the week.ListingsNumber of listings in each neighbourhoodlistings = pd.read_csv('listings.csv.gz')print('We have', listings.id.nunique(), 'listings in the listing data.')listings.groupby(by='neighbourhood_cleansed').count()[['id']].sort_values(by='id', ascending=False).head(10)Figure 5The neighbourhood that has the highest number of listings is Waterfront Communities-The Island, and almost four times more than the second most neighbourhood (Niagara). From the above header map, we can see that too.Review score ratingplt.figure(figsize=(12,6))sns.distplot(listings.review_scores_rating.dropna(), rug=True)sns.despine()plt.show();Figure 6listings.review_scores_rating.describe()Figure 7As expected, most of reviewers leave high scores.Exploring the priceThe price column needs some cleaning such as remove “$” and convert to numeric.listings['price'] = listings['price'].str.replace(',', '')listings['price'] = listings['price'].str.replace('$', '')listings['price'] = listings['price'].astype(float)listings['price'].describe()Figure 8The most expensive Airbnb listing in Toronto is $12933/night. From the listing url, it seems legitimate as far as I can tell. An Art Collector’s Penthouse in Toronto’s most stylish neighbourhood. Nice!source: AirbnbIn order not to be affected by the extreme cases, I decided to remove listings that exceed $600/night, as well as 7 listings at price 0, for the following exploratory analysis.Listings price distribution after removing outlierslistings.loc[(listings.price <= 600) & (listings.price > 0)].price.hist(bins=200)plt.ylabel('Count')plt.xlabel('Listing price in $')plt.title('Histogram of listing prices');Figure 9Neighbourhood vs. Priceneighbourhood vs. priceFigure 10Not only Waterfront Communities-The Island has the highest number of listings, it also enjoys the highest median price, and Milliken has the lowest median price.property type vs. priceproperty_type vs. priceFigure 11When we look at the median price for each property type, we have to be careful that we can’t say “The most expensive property type is Aparthotel, and that Tent and Parking Space have a higher median price than Apartment and Castle.”, because Aparthotel, Tend and Parking Space has only one listing each.room type vs. priceroom_type vs. priceFigure 12This goes without saying, entire room/apt has a a much higher median price than the other room types.listings.loc[(listings.price <= 600) & (listings.price > 0)].pivot(columns = 'room_type', values = 'price').plot.hist(stacked = True, bins=100)plt.xlabel('Listing price in $');Figure 13Entire home/apt also has the most number of listings. Inside Airbnb has indicated that Entire homes or apartments highly available year-round for tourists, probably don’t have the owner present, could be illegal, and more importantly, are displacing residents. We will put our worries aside for the moment.bed type vs. pricebed_type vs. priceFigure 14There is no surprise here.AmenitiesThe amenities text field needs a little cleaning.listings.amenities = listings.amenities.str.replace("[{}]", "").str.replace('"', "")listings['amenities'].head()Figure 15Top 20 most common amenities.pd.Series(np.concatenate(listings['amenities'].map(lambda amns: amns.split(",")))) .value_counts().head(20) .plot(kind='bar')ax = plt.gca()ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=12)plt.show();Figure 16Wifi, heating, essential, kitchen and smoke detector etc are among the most common amenities.Amenities vs. price top 20amenities = np.unique(np.concatenate(listings['amenities'].map(lambda amns: amns.split(","))))amenity_prices = [(amn, listings[listings['amenities'].map(lambda amns: amn in amns)]['price'].mean()) for amn in amenities if amn != ""]amenity_srs = pd.Series(data=[a[1] for a in amenity_prices], index=[a[0] for a in amenity_prices])amenity_srs.sort_values(ascending=False)[:20].plot(kind='bar')ax = plt.gca()ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=12)plt.show();Figure 17Interesting, amenities feature does seem to have some relationship with price.Number of beds vs. pricelistings.loc[(listings.price <= 600) & (listings.price > 0)].pivot(columns = 'beds',values = 'price').plot.hist(stacked = True,bins=100)plt.xlabel('Listing price in $');Figure 18Vast majority of the listings have one bed, the one-bed listing has a very wide range in prices. There are listings that have no bed.sns.boxplot(y='price', x='beds', data = listings.loc[(listings.price <= 600) & (listings.price > 0)])plt.show();Figure 19Interesting to discover that the median price for no bed listings is higher than 1-bed and 2-bed listings, and median price for 10-bed listings is very low.Numeric featuresWe select several numeric features and try to explore them all together.col = ['host_listings_count', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'number_of_reviews', 'review_scores_rating', 'reviews_per_month']sns.set(style="ticks", color_codes=True)sns.pairplot(listings.loc[(listings.price <= 600) & (listings.price > 0)][col].dropna())plt.show();Figure 20corr = listings.loc[(listings.price <= 600) & (listings.price > 0)][col].dropna().corr()plt.figure(figsize = (6,6))sns.set(font_scale=1)sns.heatmap(corr, cbar = True, annot=True, square = True, fmt = '.2f', xticklabels=col, yticklabels=col)plt.show();Figure 21There are some not bad news such as number of bedrooms and accommodates seem to be correlated with the price. Also accommodates, beds and bedrooms are correlated, we will keep one of them for the model.Modeling Listing PricesData pre-processing and feature engineeringClean up price feature. The feature we are going to model and predict.listings['price'] = listings['price'].str.replace(',', '')listings['price'] = listings['price'].str.replace('$', '')listings['price'] = listings['price'].astype(float)listings = listings.loc[(listings.price <= 600) & (listings.price > 0)]Term document matrix for amenities feature.amenitiesReplace the values in the following feature to 0 if “f”, to 1 if “t”.columns = ['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic', 'is_location_exact', 'requires_license', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification']for c in columns: listings[c] = listings[c].replace('f',0,regex=True) listings[c] = listings[c].replace('t',1,regex=True)Same way to clean up the other monetary value columns.listings['security_deposit'] = listings['security_deposit'].fillna(value=0)listings['security_deposit'] = listings['security_deposit'].replace( '[$,)]','', regex=True ).astype(float)listings['cleaning_fee'] = listings['cleaning_fee'].fillna(value=0)listings['cleaning_fee'] = listings['cleaning_fee'].replace( '[$,)]','', regex=True ).astype(float)The following are the numeric features we will be using.numeric_featuresFill the missing values in the numeric features with median.for col in listings_new.columns[listings_new.isnull().any()]: listings_new[col] = listings_new[col].fillna(listings_new[col].median())Processing and adding categorical features.for cat_feature in ['zipcode', 'property_type', 'room_type', 'cancellation_policy', 'neighbourhood_cleansed', 'bed_type']: listings_new = pd.concat([listings_new, pd.get_dummies(listings[cat_feature])], axis=1)Add Term document matrices that we created earlier from amenities feature.listings_new = pd.concat([listings_new, df_amenities], axis=1, join='inner')Data pre-processing and feature engineering done!Random Forest RegressorRandomForestRegressorFeature importance of Random Forestcoefs_df = pd.DataFrame()coefs_df['est_int'] = X_train.columnscoefs_df['coefs'] = rf.feature_importances_coefs_df.sort_values('coefs', ascending=False).head(20)Figure 22LightGBMLightGBMFeature importance of LightGBMfeat_imp = pd.Series(clf.feature_importances_, index=X.columns)feat_imp.nlargest(20).plot(kind='barh', figsize=(10,6))Figure 23The feature importance produced by these two models are similar.So, the best results we achieved is less than 60 dollars RMSE error, by lightGBM.Jupyter notebook can be found on Github. Enjoy the rest of the week.

Leave a Reply