It is convenient, offers a wide selection of products and we get great deals.

Retailers also benefit from the reach internet provides and get a chance to establish their brand.

As a result, the online fashion industry has seen tremendous growth in recent years.

However, shopping for clothes online is still tricky owing to wide sizing variations across different brands that make it tough for customers to identify proper fitting clothes.

For customers, it leads to a bad shopping experience as they might have to return the products and for retailers, this results in monetary losses.

Thus, automatically providing accurate and personalized fit guidance is critical for improving the online shopping experience and reducing product return rates.

PrerequisitesThis post would assume familiarity with the following concepts.

You can follow the associated links to learn about these before moving further.

Latent Factor ModelStochastic Gradient DescentOrdinal RegressionHinge LossGoalWe know that any clothing product comes in many different catalog sizes.

Many online retailers nowadays allow customers to provide fit feedback (e.

g.

Small, Fit or Large) on the purchased product during the product return process or when leaving reviews.

For distinction, we would refer to a product (e.

g.

a Northface Jacket) as a “parent product” and its different sizes (e.

g.

Small/Medium sized Northface Jacket) as “child products”.

Assuming we have transactional data where each purchase can be represented as a triple of (customer, child product, fit feedback), we can state our goal as following:Learn customers’ fit preferences and child products’ sizing related properties from the feedback customers provide on the purchased child products, so that customers can be served with recommendations of better fitting product sizes.

DatasetsWhile defining the problem in the previous section, we assumed we have transactional data of form (customer, child product, fit feedback).

For any online retailer, it is really easy to build this kind of dataset however, that was not the case for me.

After a lot of searching on the web, I came across ModCloth and RentTheRunWay websites that provide the data signals needed to address this problem.

These datasets contain self-reported fit feedback from customers as well as other side information like reviews, ratings, product categories, catalog sizes, customers’ measurements etc.

I extracted the datasets using python’s Selenium package and released them publicly as part of my research.

You can learn more about them on their Kaggle page.

Where it startedThe problem we are trying to tackle was coined as the Product Size Recommendation problem by researchers from Amazon India in 2017.

They focused on creating a model for recommending shoe sizes to customers.

Since the fitness of shoes is arguably judged along 1 or 2 dimensions (length and/or width), it was a good direction to start with from the modeling perspective.

Let’s understand their proposed model.

A Simple ModelFor every customer and child product, we consider a latent variable and say it denotes their true size.

The true size of a child product would be different from its catalog size assigned by the retailer due to sizing variation across different brands.

If we are able to learn the true sizes, sizing of all the child products would be on the same scale that in turn would make gauging fit easier.

Let us assumeu_c denote the true size of customer c and v_p denote the true size of child product p.

Intuitively, if there’s a transaction (c, p, Fit), then the true sizes u_c and v_p must be close, that is, |u_c−v_p| must be small.

On the other hand, if there’s a transaction (c, p, Small), then the customer’s true size u_c must be much larger than the child product’s true size v_p, or equivalently, |u_c−v_p| must be large.

Lastly, for (c, p, Large) type of transactions, v_p−u_c must be large.

To quantify the fit of a child product on a customer, a fit score is defined for each transaction t as:equation 1where w is a non-negative weight parameter.

Since this fit score is on a continuous scale and end-goal is to classify transactions based on this fit score into one of the three fit classes, researchers used ordinal regression for modeling.

Conceptually, we define two threshold parameters b_1 and b_2 that divide the continuous scale into three sections corresponding to three fit classes.

We can assign meaning to this division such that a fit score greater than b_2 corresponds to Small, a score less than b_1 corresponds to Large, and scores in between b_1 and b_2 correspond to Fit.

For each of the three segments, we can consider greater than threshold score in the positive class and less than a threshold score to be in the negative class.

Solving these three binary classification problems would then tell us which class a transaction belongs to.

To learn the latent variables u_c and v_p, we now just need two more things: a loss function to optimize and an optimization technique.

Amazon authors use the Hinge Loss for each of the binary classification problems in ordinal regression.

Hinge Loss is known to maximize the margin between classifier’s decision boundaries.

As a result, the objective function for any transaction t could be written as:equation 2The overall loss is simply the sum of losses for each transaction and is minimized when for any transaction t with a fit outcome Y_t , f_w(t) > b_2 for Y_t = Large, f_w(t) < b_1 for Y_t = Small and b_1 < f_w(t) < b_2 for Y_t = Fit.

Authors use Stochastic Gradient Descent to optimize the objective.

Predictions and RecommendationsSince recommending products is not possible in an offline evaluation of the model, authors consider the ability of the model to predict the fit outcome of unseen transactions as a proxy for the model’s recommendation performance.

To that end, they feed the learned latent features of customers and child products into standard classifiers like Logistic Regression Classifier and Random Forest Classifier to produce the fit predictions.

ChallengesAlthough the aforementioned model would work well on shoe dataset, it might not be flexible enough to address the following challenges:Clothing products like dresses and shirts have relatively more dimensions along which the fit is determined.

Furthermore, fit preference might vary across different product categories for each customer, for example, customers might prefer a jacket to be a little loose whereas a wet suit to be more form fitting.

Thus, a single latent feature for each child product and customer might not be enough to capture all the variability in the data.

Customers’ fit feedback is unevenly distributed as most transactions are reported as Fit, so it is difficult to learn when the purchase would not be Fit.

Standard classifiers are not capable of handling the label imbalance issue and results in biased classification, i.

e.

in this case, Small and Large classes will have poor estimation rate.

How can we improve?In our research work, we tackle the aforementioned challenges.

To address the first challenge, we consider multiple latent features for each customer and child product.

Intuitively, this enables us to capture customers’ fit preferences on various product aspects (like shoulders, waist etc.

).

To address the second challenge, we take the help of metric learning techniques with prototyping.

The following diagram presents an overview of the framework:Let us dive-in into the methodology.

Learning Semantics of FitTo model the semantics of fit, we decompose the fit feedback signal using a latent factor model formulation (which ultimately helps us extract more informative features from the data).

To that end, we define the fit score as:equation 3where subscript pp denotes parent product, u and v are k-dimensional latent features, α is a global bias term, ⊕ denotes concatenation and ⊙ denotes element-wise product.

The bias term b_t_pp captures the notion that certain parent products tend to be reported more unfit because of their inherent features/build whereas the bias termb_t_c captures the notion that certain customers are highly sensitive to fit while others could be more accommodating.

Although we will be able to learn good features from this formulation, one tricky thing is that the order between the fit scores for different catalog sizes of the same parent product is not guaranteed to be consistent.

This could render our model useless.

We would want that if a child product is Small (respectively Large) for a customer, all the smaller (larger) sizes of the corresponding parent product should also be Small (Large).

Noticing equation 3, we see that fit score for a parent product pp and customer c only varies based on child product’s latent features v_p.

So to resolve this issue, it would be enough to enforce that all the latent features of a child product p are strictly larger (smaller) than the latent features of next smaller (larger) catalog product p- (p+) if a smaller (larger) size exists.

Now that we have formulated the fitness of a transaction, we can write our objective function as follows:equation 4This is similar to equation 2 just that we have changed the definition of the fit score and added monotonicity constraints.

We can optimize this objective using Projected Gradient Descent which is similar to Stochastic Gradient Descent with a difference that after every update the constraints are enforced.

Handling Label ImbalanceTo handle the label imbalance issue (i.

e.

Fit labels being far more in data as compared to Small and Large), we resort to a metric learning technique combined with prototyping.

The goal of prototyping techniques is to create certain number of “prototypes” from the available data such that they are representative of the data.

A prototype could be some key data sample from the dataset or could be a combination of several data samples.

Usually, the number of prototypes created are way less than the number of data samples in the dataset.

Briefly, our proposed prototyping technique first alters the training data distribution by re-sampling from different classes, which is shown to be effective in handling label imbalance issues.

Subsequently, we employ the Large Margin Nearest Neighbor (LMNN) metric learning technique that improves the local data neighborhood by moving transactions having the same fit feedback closer and having different fit feedback farther, thus helping the k-NN method classify better.

Pictorially, the process could be depicted as:Metric Learning TechniqueThe goal of metric learning is to learn a distance metric D such that D(k,l)>D(k,m) for any training instance (k,l,m) where transactions k and l are in the same class and k and m are in different classes.

In this work, we use the LMNN metric learning approach that apart from bringing transactions of the same class closer, also aims at maintaining a margin between transactions of different classes.

This ultimately improves the classification.

Specifically, LMNN does this by:Identifying the target neighbors for each transaction, where target neighbors are those transactions that are desired to be closest to the transaction under consideration (that is, transactions from the same class).

Learning a linear transformation of the input space such that the resulting nearest neighbors of a transaction in the transformed space are indeed its target neighbors.

The final classification in LMNN is then given by applying k-NN in the transformed (metric) space.

The distance measure D used by LMNN is the Mahalanobis distance.

PrototypingOne caveat in LMNN is that it fixes the k target neighbors for each transaction before it runs.

This allows constraints to be defined locally.

However, this also makes the method very sensitive to the ability of the Euclidean distance to select relevant target neighbors.

To mitigate this limitation of Euclidean distances and tackle label imbalance issues, we develop a heuristic that provides a good representation for each class by reducing noise from outliers and other non-contributing transactions (like the ones which are too close to the centroid of their respective class or to already selected transactions) by carefully sampling transactions.

You can access the detailed algorithm from our research paper.

Experiments and ResultsExperimental SetupWe experimented and compared the following five methods:1-LV-LR: Method proposed in by Amazon for shoe size recommendation as described above.

K-LV-LR: Simple extension of 1-LV-LR where we consider latent features of each customer and child product to be K dimensional.

Everything else remains the same.

K-LF-LR: The proposed latent factor variation given in “Learning Semantics of Fit” section.

We use the learned factors directly into a Logistic Regression Classifier as features to produce the fit outcome.

K-LV-ML: This method is similar to K-LV-LR with a difference that it uses the proposed Metric Learning approach, instead of Logistic Regression, to produce the final fit outcome.

K-LF-ML: This is the method proposed by us.

These methods are designed to evaluate:The effectiveness of capturing fit semantics over true sizes.

Importance of learning good latent representations.

The effectiveness of the proposed metric learning approach in handling label imbalance issues.

ResultsWe gauge the performance of all the methods based on Average AUC metric.

Average AUC is nothing but the average of AUC scores for individual classes.

From the table, we observe our proposed model hugely improves upon the model proposed for shoe size recommendation (e vs a).

This could be attributed to learning fit semantics over true sizes.

We also observe that the improvements on ModCloth are relatively smaller than improvements on RentTheRunWay.

This could be due to ModCloth having relatively more cold products and customers (products and customers with very few transactions) compared to RentTheRunWay (see statistics on dataset page).

We also notice that metric learning approaches do not significantly improve performance when using representations from the K-LV method (d vs b).

This underscores the importance of learning good representations.

Finally, we see that K-LF-ML substantially outperforms all other methods on both datasets.

Besides learning good representations, good performance of K-LF-ML could also be ascribed to the ability of the proposed metric learning approach in handling label imbalance issues as depicted in the left-hand side graph above.

Furthermore, the right-hand side graph depicts how K-LF-ML performs in cold-start and warm-start scenarios.

For cold products, we notice that K-LF-ML consistently performs better than 1-LV-LR, although their performances are slightly worse overall.

As we consider products with more transactions, K-LF-ML improves quickly whereas the performance of 1-LV-LR improves significantly only when sufficiently many samples are given.

Concluding RemarksHopefully, this post gives a good overview of the current research going on in the Product Size Recommendation domain.

You can find the implementation of the methods described in this post at this Github Repo of mine.

This is a very practical problem to solve with good implications and has several open directions.

For example, one direction for further improvement could be to utilize the reviews to improve the interpretability of the model since currently, it is hard to understand what each latent dimension correspond to.

This is possible by integrating a language model with the latent factor model to assign a specific meaning to each latent dimension (denoted by the corresponding topic) as done in this paper.

Let me know in the comments if you have some other ideas for improvement and we can discuss.

.. More details