When Machine Learning Solutions Are Not Possible!Five Scenarios Every Data Scientist Should Consider before Proposing Machine Learning Solutions.
Rasoul BanaeeyanBlockedUnblockFollowFollowingJun 7IntroductionThere is a widespread belief among most of the practitioners that Machine Learning (ML) solutions always lead to business improvement.
Although ML-based approaches have brought unique capabilities to the businesses, there are some circumstances under which relying on ML solutions might have a negative impact, or even it might not be possible at all.
The main objective of this article is to discuss different use cases in which employing ML does not fully address the targeted business problem.
This article presents five scenarios and later introduces possible solutions to consider better solutions for each scenario.
Scenario 1: Insufficient Quantity of Training DataThe most straightforward reason not to use ML solutions is the inadequate quantity of data which hinders training accurate models.
Even worse, in some cases, inaccurate models generate totally randomize prediction or classification results and the whole business functionality might be under question.
Depending on the nature of the business problem, the different number of training instances might be needed; however, the rule of thumb is to ensure the dataset encompasses a minimum of 10k samples.
Solution:A possible mechanism to cope with this problem is to apply data augmentation techniques which are designed to make more data out of the existing ones.
In the case of texts one may select one of the options as follows:Synonym ReplacementRandom InsertionRandom SwapRandom Deletionetc.
And in case of images, the alternatives are:Horizontal/Vertical FlipHorizontal/Vertical ShiftRandom RotationRandom Re-Sizingetc.
Scenario 2: Lack of Significant Data PointsHaving a large dataset of data, the next consideration to use ML as a solution is to explore the data points (field, features, or attributes) to decide if they are significant enough to be used as input to the model.
The term significant signifies the discriminative power of data points as well as their independence and informativeness in defining the targeted desired outcome.
Feature selection is a crucial part since it directly affects the overall performance of the model in terms of accuracy and reliability.
Solution:A feasible solution for this issue is to generate different features by combining existing features together or by categorizing them into pre-defined sets of more meaningful sets.
For instance, a combination of distance and time can generate speed which in turn not only improves efficiency but may provide a more distinctive data point for training the model.
Scenario 3: Impracticality of Updating the ML ModelThe third probable obstacle towards using ML solutions might be infeasibility of implementing an updating model over frequent time intervals.
Having an initial accurate model might suffice to obtain reliable predictions for a short span of time, but over longer periods of time, the performance will be negatively affected by some factors like events, seasons, changes in demographic or geographic information of users.
All of these necessitates embedding a scheduled updating model which constantly re-train itself using up-to-date data.
The problem is that accessing updated data is not always possible since the source might be historical data obtained from one or more than one different sources.
Solution:The only solution to this problem is to ensure continuous ingestion of data into the database whether by means real-time user data, or off-line user-generated data.
Scenario 4: Lack of InterpretabilitySometimes even results of a highly accurate trained ML model might not be easily understandable by the end-users making its adoption very difficult or impossible.
For instance, a clustering model with a very minimized within-cluster sum of distances might assign labels of 1, 2, 3, 4, and 5 to different data samples in a large dataset but leaving them with no explanation about the labels’ meanings.
Or in case of decision trees, resultant sub-groups can not be easily interpreted.
Solution:Depending on the nature of selected ML algorithms, different techniques are used to ensure clear interpretability of the final results.
For instance, in case of clustering methods, it is recommended to keep the number of data points below four or five to make sure it is within human capability to visualize and/or interpret the underlying concept in each outputted cluster.
Scenario 5: Slowness in ExecutionAs opposed to traditional rule-based approaches, there are always a few more extra stages involved in the implementation of ML-based approaches: data preprocessing and/or cleaning, feature selection and/or transforming, model selection and design, training and testing, etc.
All these add to the overall execution time, and in the case of Deep Learning models, it is even more exacerbated due to the million-scale number of parameters in their design.
In some cases, even a delay of a few seconds at prediction time is very undesirable with end-users that might immediately discourage them from the usage of the application.
For instance, in most cases of real-time product/service recommendation systems, all the users expect a fast response time and this might be one of the most important factors to keep motivating them to use the application.
Solution:Firstly, the choice of the ML algorithm itself is the most influential factor contributing to the execution time.
As an example, a binary classifier implemented using Support Vector Machine is always so much faster at both training and prediction time in comparison with a Multilayer Neural Network.
Secondly, the choice of the ML framework affects the overall speed of the model.
The best framework normally can be determined by experimentation.
Different choices are TensorFlow, Keras, Torch, Caffe, Theano, Scikit-Learn, etc.
In ConclusionMachine Learning (ML) solutions, particularly those based on Deep Learning techniques, have revolutionized all aspects of business by their promising performances.
However, embedding ML-based features into services/products is not always feasible basically due to five main reasons:Insufficient Quantity of Training Data,Lack of Significant Data Points,The impracticality of Updating the ML Model,Lack of Interpretability, andSlowness in ExecutionIn this article, each item in this list is explained with an example and is followed by a solution to cope with the challenge.
It is important to consider all the possible challenging scenarios before proposing to employ ML-based solutions to address business problems as the early consideration will save a significant amount of time and resources of the organizations.
It is also worth highlighting that ML-based solutions are not always better than the rule-based and conventional solutions especially in cases where efficiency and speed matter more than accuracy and maintainability.
.. More details