ML-Powered Product categorization for smart shopping options

Enters Machine Learning!!Machine Learning Approach for hierarchical product categorizationChallengesThe machine learning (ML) approach uses text classification methods to categorise each product to a category based on product title/description. Before going into the details of the method, let us have a look at the challenges in implementing the ML approach.Availability of training data — ML algorithms need training data to learn the right answers before they start predicting. In most ML applications, this data comes from historical business processes. But, it is unavailable in this case. In such cases, it is very common to create a training data manually or through some heuristic method. For a text classification problem, there should be substantial representation from each category one is trying to predict. The manual approach is time-consuming and error prone as discussed earlier. Here is how the training data looks in our case.Fig.5: Training data with the entire category block as one categoryWe used a mix of manual effort and rule-based heuristics applications like Feedonomics to come-up with the training set of 25 million productsLarge scale data — The number of product listings for typical e-commerce aggregators are in the range of 45–50 millions. Such huge data requires huge storage and in many cases distributed computing power. The text classification needs text pre-processing methods which are themselves very computation intensive.We used a 32GB RAM machine on AWS for prototyping the model and running experiments.Hierarchical taxonomy — As is evident in the examples above, the categories we want to predict are hierarchical with different levels. The one shown in the example below has 4 levels with Home & Garden at level 1, Household Supplies at level 2 and so on.There are two approaches to handle this:i) All the levels together as one category — In this approach, the whole block shown in the example above is treated as one category..If there are 10 distinct categories across products in level 1, 15 in level 2 and 20 in level 3, there would be 3000(10*15*20) such categories. As you can see, the number of categories would increase exponentially with number of distinct categories in each level. A large number of categories is difficult to handle and leads to lower accuracies in the text classification methodsii) Different classification for different levels — This method entails a nested/iterative approach..In the first pass, level 1 of hierarchy is predicted..In the 2nd pass, a separate model is run for each category in level 1 to predict level 2 category..This method gives better accuracy but increases the number of models which needs to be trained.In the 2nd case, the category column would be split in several levels, as shown below.Fig.6: Training data with categories broken into sub-levels..Each level of a category is predicted separatelyIn the first pass, in approach 2, Level_1_Cat becomes the variable we are training on and predicting..There is just 1 model in the first pass..In the next pass, two separate classification model are created for products in Apparel & Accessories and Health & Beauty..The Level_2_Cat becomes the variable we are training on for each of the models..For the 3rd pass, we would take distinct groups of combinations of Level_1_Cat and Level_2_Cat and build models for each group with Level_3_Cat as the variable on which we would train the modelIn approach 1, there would be just 1 model..In contrast, in approach 2, there would be as many models as there are unique groups of products at that pass level..If there are 10 distinct categories across products in level 1, 15 in level 2 and 20 in level 3, there would be 1 model in pass 1, 10 models in pass 2 and 150 (10*15) models in pass 3.. More details

Leave a Reply