But I love Open Source — Why would I need an AI platform?10 reasons why enterprises invest in AI platformsOscar D.
Lara YejasBlockedUnblockFollowFollowingJul 8It has never been easier to build a machine learning (ML) model.
A few lines of R or Python code will suffice for such endeavor and there’s a plethora of resources and tutorials online to even train a complex neural network.
The best part: this the case for virtually every single AI algorithm you could think of, while almost every new advance in AI research is usually accompanied by their corresponding open source implementation.
However, as I previously discussed in this this article, AI turns out to be much more than building models.
The journey, as a matter of fact, goes all the way from data collection, curation, exploration, feature engineering, model training, evaluation, and finally, deployment.
The Hadoop ecosystem seems to be the to-go choice from the open source to collect and aggregate different data sources.
Hive and HBase fit the bill for accessing, mixing, and matching multiple sources of data.
Now, for data preparation (i.
, curation, exploration, and feature engineering), Apache Spark can be really useful to slice and dice large datasets through SparkSQL, even leveraging in-memory processing to speed up response times.
Spark also gives you natural language processing (NLP) capabilities and feature extractors such as principal component analysis (PCA).
But probably the strongest suit of the open source is its massive variety of AI models.
scikit-learn, R, SparkML, TensorFlow, Keras, and PyTorch, among others, provide everything I — and any of my data scientist fellows — could ever dream of.
Finally, tools like Docker and Plumbr ease the deployment of machine learning models in the form of web services consumable through HTTP requests.
So wait a second… does that mean that one could build an enterprise-ready end-to-end AI system solely using the open source stack?.Not really.
This may be the case for building a proof of concept (POC).
Back in 2012, as part of my dissertation, I built a Human Activity Recognition system (including this mobile app) purely under the umbrella of the open source — thank you Java, Weka, Android, and PostgreSQL!.For the enterprise, nevertheless, the story is quite a bit different.
And don’t get me wrong.
I’m not only a big fan but an avid user of open source myself and I do realize there are so many fantastic tools; but at the same time, there are quite a few gaps.
Let me share some of the reasons why enterprises invest in AI platforms.
And, let me do so by telling you some of my most painful moments as an ML and AI practitioner.
Open source integration, up and running, and version updatesLast time I tried to install TensorFlow in my machine, it broke my Apache Spark configuration.
Few minutes later, when I thought I had fixed Spark, I realized TensorFlow was toast.
Again!Long story short: it took me an entire afternoon to get these two beasts up and running.
Imagine how many things could go wrong when so many other tools need to coexist within a Data Scientist’s working environment: Jupyter, R, Python, XGboost, Hadoop, Spark, TensorFlow, Keras, PyTorch, Docker, Plumbr, and the list goes on and on.
Now consider that all of these tools have new releases every other month so frequent updates will be needed.
Did someone say conflicts?.Ugh!.Let a platform handle that for me.
CollaborationNow I’m working with my team of three data scientists.
We’re using our favorite Jupyter+Python+Spark+TensorFlow environment and decided to create a GitHub repository to share our code and data assets.
But I’m using custom packages on my environment — which took me a while to install and configure — and my colleagues can’t access them.
They will have to go through the same pain I endured and install the packages on their own machines.
Okay, but then, how do we share our deployed models?.Creating Docker images or PMML files don’t sound sexy at all!.And how about sharing our predictions, experiments, and evaluations (confusion matrices, ROC curves, RMS, etc.
)?.Ugh!.Let the platform handle that for me.
Data virtualizationOne of the most common challenges in the enterprise is that datasets are scattered across multiple systems.
A typical solution is to copy data over into central data stores — there you have your data lake — for running your favorite analytics.
However, this approach is clearly costly and error prone.
Data virtualization allows to query data across many systems without any replication, simplifying the data collection pipeline.
The open source doesn’t offer a data virtualization solution.
So again, let’s have the platform handle that for us.
Governance and securityThis is a key concern for the enterprise and also an area where the open source leaves a huge void.
Assets need governance and I’m not only referring to the data but also to the code, models, predictions, environments, and experiments!.Should all data scientists know what criteria are used to approve/deny a loan to a customer?.Or what factors contribute to flagging a transaction as potentially fraudulent?.Certainly not!. More details