The Iceberg secret in Machine LearningRevealing the secret of AI projects in the real worldIsak BosmanBlockedUnblockFollowFollowingMar 14I first came across the Iceberg secret a few years back when reading a post by Joel Spolsky.
Joel does a great job at revealing the secret that many software projects face.
Since then I have come to realize that the same principle or secret translates to machine learning projects.
In a nutshell, the Iceberg secret speaks to the apparent gap between technical and non-technical stakeholders when it comes to evaluating the quality and progress of building an AI-based solution.
Often the solution is judged on the visualization of the data or the scalar output of the predictions (the 10%), and little regard is given to the bulk of the work that is spent on the data preparation (the 90%)For the Data or ML Engineer, they understand the exponential value in carefully focussing on understanding the data, preparing the data and how important it is to ensure that a mature data pipeline is in place before any time is spent on the top 10%.
Managing this process is usually the linchpin of a project’s success.
Often the technical team is instructed: “We don’t need to spend much time on the data.
We can throw it into the AI.
Its AI after all.
Don’t bias the system.
AI will figure it out.
”The reality, on the other hand, is that most of the work is spent on:Designing a scalable and flexible data pipeline for adequate data ingestionReviewing the data to ensure you understand its characteristics and how to approach the solution in the context of the overall goalDesigning and implementing the cleaning, transforming and loading of the data for ML trainingImplementing a process that allows for continuous training, evaluation, and redeployment of the ML modelsThe secret is not that this challenge exists, but the secret is that your client or project manager has no idea that most of what they understand or hear about AI’s value in the media, is only as good as the data.
Your most valuable skill (the skill you will get paid the most for) is your ability to manage this misperception effectively.
You are probably wondering, how do I manage the iceberg effect.
Here are some strategies I find goes a long way.
Provide visualizations during the data exploration phaseTypically you will need to provide some documentation on your strategy for how you will approach the solution.
This is often presented to either the internal technical team or the client.
The audience is unlikely to understand what sparse data is or whether its a classification, clustering, regression or ranking problem.
However, they do understand charts and graphs.
By visualizing the existing data set in its raw state, you are doing yourself two big favors:It makes the data easier to reason about between both technical and non-technical stakeholdersIt allows you to showcase the improvements of the data once you have cleaned and transformed the data for ML trainingPut extra effort whenever visualizing the dataNow that you know the top 10% is all that truly matters it’s essential to make sure it looks great!.Whether its a screenshot or a demo put extra effort into the way you visualize the dataFor data scientist and ml engineers, matplotlib is excellent.
It’s easy to quickly visualize a confusion matrix, the error rates, or learning graph, etc.
Matplotlib is extremely powerful, and the community is mature which produces plenty of examples.
But for me, I like to use libraries with more aesthetically pleasing graphs when showcasing to non-technical team members.
Here are some examples of Pygal, Seaborn, and Pyplot.
PygalSeabornPlotlyI like to use plotly do display straight forward graphs.
I think it’s easy to make your graphs look great, but know that you are sacrificing some of the power of matplotlib.
Ultimately you should use the tool that you feel the most comfortable with but still looks great!Plan for explainabilityExplainability is hard.
It’s even harder when you use a deep learning based machine learning where the decision making is harder to prove.
Still, it’s important to represent what you can visually.
Track and visualize progressWhat is arguably more critical is visualizing your architecture and neural network design to display changes and improvements as you find better architectures.
Again it makes things easier to reason about without having to explain the more intricate details.
To visualize neural networks, you can use Tensorboard from Tensorflow or ANN-Visualizer library.
Below is a screenshot of what it producesOriginally published at www.
com on March 14, 2019.