Predicting Airbnb prices with machine learning and deep learningExperimentation with XGBoost and tuning neural networksLaura LewisBlockedUnblockFollowFollowingMay 22Project aims and backgroundAirbnb is a home-sharing platform that allows home-owners and renters (‘hosts’) to put their properties (‘listings’) online, so that guests can pay to stay in them.
Hosts are expected to set their own prices for their listings.
Although Airbnb and other sites provide some general guidance, there are currently no free and accurate services which help hosts price their properties using a wide range of data points.
Paid third party pricing software is available, but generally you are required to put in your own expected average nightly price (‘base price’), and the algorithm will vary the daily price around that base price on each day depending on day of the week, seasonality, how far away the date is, and other factors.
Airbnb pricing is important to get right, particularly in big cities like London where there is lots of competition and even small differences in prices can make a big difference.
It is also a difficult thing to do correctly — price too high and no one will book.
Price too low and you’ll be missing out on a lot of potential income.
This project aims to solve this problem, by using machine learning and deep learning to predict the base price for properties in London.
I’ve explored the preparation and cleaning of Airbnb data and conducted some exploratory data analysis in previous posts.
This post is all about the creation of models to predict Airbnb prices.
The datasetThe dataset used for this project comes from Insideairbnb.
com, an anti-Airbnb lobby group that scrapes Airbnb listings, reviews and calendar data from multiple cities around the world.
The dataset was scraped on 9 April 2019 and contains information on all London Airbnb listings that were live on the site on that date (about 80,000).
The data is quite messy, and has some limitations.
The major one is that it only includes the advertised price (sometimes called the ‘sticker’ price).
The sticker price is the overall nightly price that is advertised to potential guests, rather than the actual average amount paid per night by previous guests.
The advertised prices can be set to any amount by the host.
Nevertheless, this dataset can still be used as a proof of concept.
A more accurate version could be built using data on the actual average nightly rates paid by guests, e.
from sites like AirDNA that scrape and sell higher quality Airbnb data.
After cleaning and dropping collinear columns, the features in the model were:The number of people the property accommodatesThe number of bathroomsProperty type (e.
apartment) and room type (e.
entire home)Location of the property (on the level of borough (discussed further in a previous post), or in one model on the level of latitude and longitude — discussed further below)Security deposit, cleaning fee and extra person feeMinimum and maximum nights stayNumber of days available to book in the next 90 daysTotal number of reviewsReview ratings for each category (accuracy, cleanliness, check-in, communication, location, value and overall total)Amount of time since the first and most recent reviewsThe type of cancellation policyWhether the property is instant bookableThe presence or absence of a wide range of amenities (discussed in further depth in a previous post, but including items like TVs, coffee machines, balconies, internet and parking, whether or not the property is child-friendly, allows self check-in or allows pets, and many others)Host response times and ratesWhether or not a host is a superhost (a mark of quality, requiring various conditions to be met) or has their identity verified (e.
by verifying government ID, a phone number and an email address)How many listings the host is responsible for in totalHow many days the host has been listing on AirbnbBuilding a machine learning modelIn the interests of space I’ll skip the data preparation stage here, but all the code for this project can be found in my GitHub repo if you’re interested.
To summarise, after cleaning the data, checking for multi-collinearity and removing collinear features, the data was standardised using sklearn’s StandardScaler() unless otherwise stated.
Categorical features were one-hot encoded using pd.
A train-test split was performed with a test size of 0.
Although I was keen to experiment with deep learning models for price prediction, I first built a vanilla (non-tuned) XGBoost machine learning model (specifically xgb.
This was in order to provide a baseline level of accuracy, and also to allow for the measuring of feature importance (something which is notoriously difficult once you enter the realm of deep learning).
XGBoost is likely to provide the best achievable accuracy using machine learning models (other than possible small accuracy increases from hyper-parameter tuning) due to its superior performance and general awesomeness as observed in Kaggle competitions.
Because this is a regression task, the evaluation metric chosen was mean squared error (MSE).
I was also interested in accuracy, so I also had a look at the r squared value for each model produced.
Here’s my code to fit and evaluate the model:Results:Training MSE: 0.
1576Validation MSE: 0.
159Training r2: 0.
7321Validation r2: 0.
7274Not bad for an un-tuned model.
Now for the feature importances:The top 10 most important features are:How many people the property accommodatesThe cleaning feeHow many other listings the host has (and whether they are a multi-listing host)How many days are available to book out of the next 90The fee per extra personThe number of reviewsThe number of bathroomsThe security depositIf the property is in WestminsterThe minimum nights stayIt is not surprising that the most important feature is how many people the property accommodates, as that’s one of the main things you would use to search for properties in the first place.
It is also not surprising that features related to location and reviews are in the top ten.
It is perhaps more surprising that the third most important feature is related to how many other listings the host manages on Airbnb, rather than the listing itself.
However, this does not mean that a host that manages more properties will result in a listing gaining higher prices (although this is indeed the direction of the relationship).
Firstly, the data appears to be somewhat skewed by a few very large property managers.
Secondly, the relationship is with the advertised prices set, rather than actual prices achieved, suggesting that if anything more experienced hosts tend to set (rather than necessarily achieve) higher prices.
And thirdly, we cannot necessarily imply a causative relationship — it could be that more experienced multi-listing hosts tend to take on more expensive properties (which is indeed the case for some, e.
One Fine Stay).
It is also notable that three other fee types — cleaning, security and extra people — all make the top 10 feature list.
It is likely that when a host sets a higher price for the nightly stay they are also likely to set other prices high, or vice versa.
Building a deep learning modelNext up, I decided to experiment with neural networks (NN), to see if I could improve upon the XGBoost model’s score.
I started off with a relatively shallow three layer NN with densely-connected layers, using a relu activation function for the hidden layers and a linear activation function for the output layer (as it is being used for a regression task).
The loss function was mean squared error (again, because this is for regression).
Here’s my code:And here’s the summary and visualisation:In order to save time when evaluating multiple methods, I built a handy function to print the MSE and r squared results for the test and train sets, as well as produce a line graph of the loss in each epoch for the test and train sets and a scatterplot of predicted vs.
actual values:Here are the results:Training MSE: 0.
0331Validation MSE: 0.
2163Training r2: 0.
9438Validation r2: 0.
6292Compared to the XGBoost model the neural network has performed worse.
Overfitting also seems to be an issue, as seen from the difference between the train and test MSE and r squared results, as well as the difference between the train and test losses in the line graph, and the fact that the values cluster more closely to the line in the scatterplots.
I then iterated through various other versions of the model in order to try and remove the overfitting and increase the accuracy.
Overfitting was removed in each other version, although accuracy did vary.
The adjustments that I experimented with were:Adding a fourth and fifth layer — a fourth layer improved the accuracy but a fifth layer didn’t help.
Using L1 regularization — this proved to be the biggest boost to accuracy.
Using dropout regularization at 30% and 50% dropout rates — 50% turned out to be a terrible idea and significantly increased the MSE.
30% performed better, but not as well as L1 regularization.
Using a stochastic gradient descent (SGD) optimiser instead of Adam — this performed slightly worse.
Changing the batch size — this didn’t make much difference.
Training for more epochs — this helped a bit for some models, but most models minimised the loss function fairly quickly anyway.
Removing most of the review rating columns — high review ratings in one category were fairly highly correlated with other categories, so I tried removing all except the overall rating.
I then used this new truncated dataset to train the previously highest performing model architecture (with L1 regularization and an Adam optimizer).
This performed essentially the same, but with 18 fewer columns, so would be the preferred model when choosing which model to put into production as it would require less data and be less computationally expensive.
Using latitude and longitude instead of borough — again, this adjusted dataset was used with the best model architecture I had come up with so far.
This performed slightly worse.
Using MinMaxScaler() instead of StandardScaler().
This also performed slightly worse.
In the end, the best NN was the four-layer model with L1 regularization and an Adam optimizer, with the extra review columns removed:My best performing neural network architectureResults:Training MSE: 0.
1708Validation MSE: 0.
1689Training r2: 0.
7096Validation r2: 0.
7105I still haven’t gotten to the bottom of why the NN is unable to predict values for the log-transformed price lower than about 3.
1, but I’m pretty sure it’s something to do with the use of regularizationHowever, even this model architecture did not perform quite as well as the XGBoost model.
Overall, the XGBoost model is the preferred model, as it performs ever so slightly better than the best neural network and is less computationally expensive.
It could possibly be improved even further with hyper-parameter tuning.
ConclusionsThis is one of those situations where deep learning simply isn’t necessary for prediction, and a machine learning model performs just as well.
However, even in the best performing model, the model was only able to explain 73% of the variation in price.
The remaining 27% is probably made up of features that were not present in the data.
It is likely that a significant proportion of this unexplained variance is due to variations in the listing photos.
The photos of properties on Airbnb are very important in encouraging guests to book, and so can also be expected to have a significant impact on price — better photos (primarily better quality properties and furnishings, but also better quality photography) equal higher prices.
Potential directions for future workFind a way to incorporate image quality into the model, e.
by using the output of a convolutional neural network to assess image quality as an input into the pricing model.
Use better quality/more accurate data which includes the actual average prices paid per night.
Include a wider geographic area, e.
the rest of the UK or other major cities around the world.
Augment the model with natural language processing (NLP) of listing descriptions and/or reviews, e.
for sentiment analysis or looking for keywords.
In addition to predicting base prices, a sequence model could be created to calculate daily rates using data on seasonality and occupancy, which would allow the creation of actual pricing software.
Tailor the model more specifically to new listings in order to help hosts set prices for new properties, by removing features that would not be known at the time — e.
other fees, availability and reviews.
If you found this post interesting or helpful, please let me know via the medium of claps and/or comments, and you can follow me in order to be notified about future posts.
Thanks for reading!.