How Machine Learning Can Lower the Search Cost for Finding Better HikesPerry JohnsonBlockedUnblockFollowFollowingJul 9I recently went on a weekend camping trip in The Enchantments, which is just over a two hour drive from where I live in Seattle, WA.
To plan for the trip, we relied on Washington Trails Association (WTA) and a few other resources to make sure we had the optimal trail routes and camping spots for each day.
Many of these outdoor adventure resources can help folks plan for multi-day camping trips, figure out where to go for a hike with parents or make sure to correctly traverse Aasgard Pass, a sketchy 2300 feet elevation gain in less than a mile.
But there is still something lacking.
The one thing I really wish all these hiking resources had was personalized hiking suggestions based on hikes I’ve enjoyed in the past (i.
if I liked hike X, I’ll like hike Y because it has similar hiking trail features) or based on how I’ve rated hikes in the past (i.
Goodreads personalized book recommendations based on your reviews).
Many of these outdoor adventure apps give users the ability to search for hikes via custom filtering of specific hiking trail attributes (distance, elevation gain, difficulty, etc.
) but they lack any intelligent recommendation algorithms that would lower the search cost to find hikes that users really like.
I set out to ask the following questions:1.
What types of intelligent recommendations would be useful for an avid hiker?2.
Can I build a personalized recommendation engine that leverages hiking reviews?3.
Can I create a Power Ratings formula that blends together total number of reviews and average rating for a given hike?To answer these questions, I walk through how to build a full-stack machine learning web application, which would provide folks with intelligent hiking recommendations based on hiking trail attributes and user reviews.
This would help avid hikers to find better trails.
Why Not Just Build this Machine Learning Application?I actually already built and deployed this machine learning application, but it’s a violation of many companies’ Terms of Service to scrape and use their data so I took it offline along with the corresponding blog and GitHub repository.
I certainly was naive and at the time I didn’t completely grasp the conventions around scraping.
I feel really bad about this as I wasn’t familiar with how scraping data was viewed.
I just got deeply interested in solving this problem and wanted to add an intelligent feature on top of an application I really love, so that my friends and I could find better hikes.
I was hoping this would drive even more traffic to the underlying application as a result.
Unfortunately things didn’t quite work out that way — and the company politely asked me to take down the app I built.
So for the rest of this post, I’ll share a high-level view of how I’d build a hypothetical trail recommendations app.
I’ll avoid going into detail about how to scrape or extract data, and focus on the pipeline instead.
Data Collection and Machine Learning PipelineHypothetical full data pipelineDataIn a perfect world, I would be pulling down data directly from an internal database.
In a hypothetical world, I’d write a Python web scraping script to grab hiking trail attribute and user hiking review data.
For Washington state, this would leave me with data for ~3,500 hikes and ~200,000 user reviews which I would want to store in a MongoDB database that I hooked up locally.
Hiking Trail FeaturesThese are the hiking trail features that I would want to use to build a similar hike algorithm, and a personalization algorithm based on user ratings.
Synthetically Generated Hiking Trail FeaturesThe numerical features would be total distance (in miles), elevation gain (in feet), and the elevation severity.
The remaining features would be categorically tagged with a value of 0 for “No” or 1 for “Yes” depending on if the feature described a given hike.
Most of these features could be created directly from cleaning the raw data stored in MongoDB.
I’d also want to engineer a few additional features.
I would include:· Elevation Severity: Elevation in feet gained per mile of hike distance· Foot Traffic: I would parse a trail summary paragraph and user reviews for language that described the typical foot traffic for a given trail.
These would be categorized as Heavy, Moderate, Light and UnknownRecommender SystemsTo build the machine learning models, I would leverage Apple’s open-source machine learning library, Turi Create, as it’s incredibly flexible to develop custom recommendation models.
Item Content SimilarityThis recommender only takes into account hiking trail attributes.
It looks at each distinct pair of hikes and calculates how similar they are based on the hiking trail features.
This similarity score is calculated by first computing the similarity between each feature and then takes a weighted average of those to get the final similarity.
This is useful because a user can specify a hike they know they like and this recommender will provide hikes that are most similar to that.
Example: “If you like the Mount Si Trail, here are hikes that have the most similar hiking attributes to the Mount Si Trail.
”Model Efficacy HeuristicPrior to building the model, I would have a few specific trail examples where I could test the algorithm for the quality of its recommendations.
These are a couple examples I am familiar with based on my own experience:Pike Place Market→ should probably get recommendations for.
other urban-like, short distance, low elevation gain walksLake Serene Trail→ should probably get a recommendation Colchuck Lake as they are both high foot traffic, challenging hikes of similar distance with an alpine lakeIn the above example, both of these test cases check out as they should with Pike Place Market returning other short urban walks and Lake Serene Trail returning Colchuck Lake as its second most similar hike.
Ranking FactorizationIf we had the actual hike ratings given by users then choosing the optimal model depends on whether we wanted to predict the rating a user would give for any particular hike, or if we wanted the model to recommend hikes that it believes the user would rate highly.
We likely care more about ranking performance, as in we’d want to recommend hikes that users would likely rate highly.
The RankingFactorizationRecommender recommends hikes that are both similar to the hikes in a user’s reviewed hikes dataset and those that would be rated highly by the user.
The intuition behind this recommender is that there should be some latent features that determine how a user rates a hike.
Building a Hiking Recommender System with Matrix FactorizationExample: “For Perry, a user that has rated some hikes, here are the hikes that Perry would likely rate very highly.
”Building This ModelOn most outdoor applications, users explicitly rate hikes with number of stars (0=strong dislike, 5=strong like).
If we had ~200,000 of these ratings (records saying that user A rated hike X with Y stars) from the past, then we could build the Ranking Factorization recommender.
I would use a technique called split validation: where we take only a subset (80%) of these ratings (called the training set) to train the model, and then we ask the model to predict the ratings on the 20% we’ve hidden (the test set).
For example, it may happen that a test user rated some hike with 4 stars, but your model predicts 3.
5, hence it has an error of 0.
5 on that rating.
Then we just compute the average of the errors from the whole test set using the root mean squared error (RMSE) formula to get a final result.
That’s how to quantify the prediction performance of this recommender system.
I would iterate through a few different hyper-parameter values for this model to minimize the RMSE on training set data before implementing it in the application.
PopularityPopularity based recommenders are not intelligent, but they are a useful data product and a potential solution for the cold start problem if a user hasn’t hiked in a specific area before (or has never hiked before!).
These are generally fun and a useful baseline when searching for hikes to go on.
Number of ReviewsThis would recommend the most popular hikes based on number of reviews.
Average Stars (Specify a Minimum Number of Reviews to Qualify)This would recommend the most popular hikes based on the ratings.
I’d start by analyzing the distribution of number of reviews to get a feel for where the reviews are.
In this synthetically generated dataset, it’s clear that 5-star (perfect score) hikes are dominated by a low number of reviews.
Therefore, I’d determine that a hike needed to have a minimum 100 reviews to count towards this recommender.
For example, I would want to ensure that a hike with ~6 reviews that had a 5-star score was not included in this.
Synthetically Generated: 5 Star Ratings dominated by Low Number of Reviewed HikesPower RatingsI would create a custom Power Ratings score ranging from zero to 100 that blended the number of reviews and average rating into the same score.
A hike rated as 4.
9 stars with only 10 reviews should probably not be rated as highly as a hike rated as 4.
6 stars with 1000 reviews.
Synthetically Generated: Distribution of Hike Ratings before blending in the Number of ReviewsStep One: The FormulaPower Rating = (Number of Reviews / (Number of Reviews + Number of Reviews in 90% quantile) * Rating) + ((Number of Reviews in 90% quantile / (Number of Reviews in 90% quantile + Number of Reviews) * Average Rating Across All Hikes)Synthetically Generated: Distribution of Hike Ratings after applying the formula to blend number of reviews and ratings scoreStep Two: MinMaxScalerThe MinMaxScaler scales and transforms data such that it is in a range between zero and one based on a formula using the minimum and maximum value of the specified data.
I would then multiply each value by 100 to scale the Power Ratings Score to a range between zero and 100.
Synthetically Generated: Power Ratings Distribution once scaled to values between 0 and 100The ApplicationOnce we have viable machine learning models for hiking recommendations, I’d build out the application using the Flask web framework written in Python.
Find a Hike Similar to One You’ve LikedTo get similar hikes to ones you’ve enjoyed, a user would simply specify a hike that they enjoyed and then a list of the n most similar hikes with trail information along with an embedded link back to the respective profile would be provided.
Find Personalized Hikes Based on Your RatingsIdeally, we’d have integration with an outdoor adventure company’s user login and profile data but if that wasn’t possible, I could create a unique User ID that maps to a user’s full name.
To get personalized reviews, a user enters the full name associated with their account, in my case: Perry Johnson.
Then a list of the top n likely hikes that I would really like based on my reviews would be provided along with trail information and an embedded link back to the respective hike profile.
ConclusionBy building this novel data product by leveraging hiking trail attributes and user reviews, we would provide avid hikers smarter hiking recommendations.
This would be the first set of machine learning algorithms that would personalize the user experience and lower the search cost to find better hiking trails.
Comments or Questions?.Please email me at: perryrjohnson7@gmail.
comYou can check out some of my other work:How Machine Learning Can Help You Charge Your E-ScootersReverse Engineering the Walk Score Algorithm.