Data Structure Evaluation to Choose the Optimal Machine Learning MethodA set of examples on how analysis of data interdependencies helps save time while solving a regression task.
Sergey BurukinBlockedUnblockFollowFollowingFeb 27Image source: EvaneosThere is no single ML method.
To choose the one that suits your purposes, you as a developer need to understand the nature of the data that will be used in the project.
In this post, I’ll share my experience with machine learning system development, describing the steps of choosing the optimum prediction model.
I was lucky to conduct my research around a live project — the largest marketplace for designers and architects — Arcbazar.
This is a competition platform that allows customers who need their homes building or remodeling to get the desired architectural design in frames of their budget.
While it is cheaper than ordering from a design studio, in the platform, customers also may select from the great pool of design projects provided by participants from all over the world.
At the same time, Arcbazar gives designers of any professional level a chance to get recognition, show their creativity, and win a money prize.
ChallengeThe task was to create an AI-driven award suggestion system for the marketplace that will help customers to decide on the award for the designer-winner.
For each project, a customer is free to set any price that should not be less than the minimum required award.
As all people are different, we should take into account that a customer’s decision is based on a complex of subjective motives: budget, mood, time limits, needs, estimations, expectations, etc.
This means that the Machine Learning system should solve a kind of socio-psychological task.
Data PreparationThe platform has a big database of completed competitions where designers have got their awards.
This database became a source of knowledge for the Machine Learning system.
The database structure repeats the fields of the form that must be filled by a customer before he starts a designers competition.
This form has seven fields: three drop-down menus, three text fields (string type), and a numeric one (SET DEADLINE).
In the final field of the form (START YOUR COMPETITION), a client sets an award price.
The sum of money is a function of all the project features fields, and it varies in a quasi-continuous way.
In the Machine Learning theory, this type of tasks is called Regression.
The total of the fields in the form can be expressed as a linear equation:where:Y is an award amount;a, b, c — are variables that represent the project’s features from drop-down menus;d, e, f — are variables that represent text description fields;g — is a variable that represents a number of days;w1.
w7 — coefficients or parameters of the equation.
Analyzing the data of the form fields, I distinguished three classes:Structured drop-down menus (the first three plots);Unstructured description fields;A numeric field.
Values of drop-down lists usually have indexes.
I replaced the text values with these values indexes.
To minimize the time of calculation, in the first stage of the development, I replaced tree text fields with one numeric that had a total quantity of chars.
This assumption lets us keep a bigger dataset.
A Posteriori, that additional field had a small positive effect on how the model fits a dataset.
A transformed dataset can be expressed by a 5-variable formula:Step 2.
Choosing a Machine Learning MethodIn this step, I wanted to find the optimal Machine Learning method in a series of experiments using Scikit-learn library for the Python programming language.
During the tests, I changed a split ratio for the 5-features dataset varying it from 10% to 50% for a testing subset.
Also, all evaluations were made for the normalized and non-normalized data.
Normalization did not give a perceptible increase in model accuracy.
I started with the basic method for prediction of a continuous-valued attribute associated with an object.
This method is called Linear Regression.
But, the distribution of predicted awards and real awards had a coefficient of determination R-squared = 0.
In regression, this coefficient is a statistical measure of how well the regression predictions approximate to the real data points.
An R-squared of 1 indicates that the regression predictions perfectly fit the data.
Lasso Regression from the regression set of methods gave a very close distribution to the Linear Regression and R-squared of 0.
Making a step aside I decided to use an Artificial Neural Network.
Multi-Layer Perceptron even with 500 hidden layers showed the weakest result in this series of experiments with R-squared = 0.
mlpreg = MLPRegressor(hidden_layer_sizes=(500,), activation=’relu’, solver=’adam’, alpha=0.
001, batch_size=’auto’, learning_rate=’adaptive’, learning_rate_init=0.
5, max_iter=1000, shuffle=True, random_state=9, tol=0.
0001, verbose=False, warm_start=False, momentum=0.
9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.
999, epsilon=1e-08)However, our dataset has a structure that corresponds to the multi-level drop-down lists.
This visualization can help understand the dataset interconnections.
One client’s choice is a decision tree, many choices — decision forestA customer moves from a branch to a branch (on the illustration: levels 1–2–3) in his “decision tree” while choosing values from the drop-down lists.
That’s why Decision Tree Regression gave a bigger R-squared (0.
A more sophisticated development of Decision Tree Regressor, Random Forest Regressor from Sklearn ensembles, gave the best result of R-squared — 0.
ConclusionUnderstanding the cross-data interdependencies helps choose the most fitting algorithm.
The Decision Tree and Random Forest methods have the same basis which is much closer to the dataset nature than the other methods.
The set of data fields in the chosen data set is not enough to get a better fitting.
It creates a hypothesis that text description fields contain a hidden motive of a customer when setting an award.
In part 2 of the article, I will disclose my insights about upgrading the prediction system using Natural Language Processing techniques.