(Robot) data scientists as a serviceAutomating data science with symbolic regression and probabilistic programming.
Jacopo TagliabueBlockedUnblockFollowFollowingApr 12How to be lazy (data) scientists and live happily ever after“Progress isn’t made by early risers.
It’s made by lazy men trying to find easier ways to do something.
” — R.
HeinleinAccording to the gospel of Prediction Machines, the Artificial Intelligence revolution of recent years means mostly one thing: the economic cost of prediction is quickly decreasing, as libraries, cloud services and big data become widely available to practitioners, startups and enterprises.
While this story is certainly appealing as a “big picture” of the A.
adoption landscape, the reality of applied Artificial Intelligence/Machine Learning/Data Science (pick your favorite buzzword) in several industries is very different when seen from the ground.
While digital datasets are indeed today much more easily available than before, real-life use cases still require highly skilled workers to use time and knowledge to interpret data patterns in a meaningful way.
Take the following chart showing revenue data as a function of cloud expenses:Revenue data on the y-axis, as a function of cloud expenses on the x-axis (data points are machine generated, but they could easily come from any of our clients or many of our startup peers).
If the business needs to plan ahead and finalize the budget, understanding precisely the relation between revenues and expenses is obviously a very important topic.
How can we do it?Well, one option is to have your data scientist(s) take a look at the data and build an explanatory model; that usually works, but obviously enough these are just two among the many variables the business is interested in — and what about those variables lingering around in the data lake that we do not know about yet?.In other words, data scientists are great (a truly unbiased estimate!) but they don’t necessarily scale: inside enterprises, even producing “exploratory analysis” is still a lot of work.
A second option is cutting all the fancy stuff and just apply one model to all problems: to the man with LinearRegression, everything looks like a slope.
This is very convenient and it’s a respectable strategy in use in several AI-based solutions on the market:“Automated data science” often just means pre-defined models for widgets in your dashboard (excerpt from a real industry white paper).
The problem with this second approach is obvious though: not everything is a straight line, which means that some “insights” may be noise and some patterns will go undetected; in other words, one model does not fit all.
Revenues vs cloud expenditure again.
Not everything is a straight line, as the best fit here is a 2nd degree polynomial — the ROI for our (imaginary) infrastructure scale very well!Can’t we do better?.We think we can: what if we could write one program that writes other programs analyzing those variables for us?While the full solution used by our clients is well beyond the scope of this post, we will show how to combine ideas in probabilistic programming and symbolic regression to build a powerful meta-program that will write useful code for us (you can run the code written for this post here).
As the wise man said, any sufficiently advanced technology is indistinguishable from Elon Musk’s tweets.
A primer on symbolic regression“94.
7% of all statistics are made up.
” — Anonymous Data ScientistWe will spend no more than three minutes introducing the intuition beyond symbolic regression with a simple example (the reader familiar with it— or just not interested in the nerdy details—can safely skip to the next section).
Consider the following X-Y plot:The familiar image of a scatterplot: what is the relation between X and Y?Looking at the data, we can take out pencil and paper and start making some reasonable guesses on the relation between X and Y (even just limiting ourselves to simple polynomial options):Y = bX + a (linear)Y = cX^2 + bX + a (quadratic)We measure what is the best one and take what we have learned to produce even better estimates:Comparing two hypotheses: R-squared is 0.
65 and 0.
It seems that we can try an even higher-degree polynomial to achieve a better fit:R-squared for a third-degree polynomial is 0.
99 (it looks like overfitting but we swear it’s not).
It sounds like a reasonable strategy, doesn’t it?In a nutshell, symbolic regression is the automated version of what we did manually with few functions and two “generations”.
That is:start with a family of functions that could fit the dataset at hand;measure how well they are doing;take the best performing ones and change them to see if you can make them even better;repeat for N generations until satisfied.
Even with this toy example, it’s clear that fitting data patterns by intelligently exploring the space of possible mathematical functions has interesting upsides:we don’t have to specify many assumptions to start with, as the process will evolve better and better candidates;the results are readily interpretable (as we can produce insights such as “an increase of aX will lead to an increase of bY”), which means new knowledge is sharable across all business units.
As a downside, evaluating large populations of mathematical expressions can be time consuming — but that is not a problem for us: our robots can work at night and serve us predictions the next day (that’s what robots are for, right?).
The crucial observation for our purposes is that there is a fundamental trade-off between model expressivity, intelligent exploration and data fitting: the space of mathematical relations that could potentially explain the data is infinite — while complex models are more powerful, they are also prone to overfitting and, as such, should be considered after simpler ones fail.
Since relations are expressed in the language of math, why don’t we exploit the natural compositionality and expressivity of formal grammars to navigate this trade-off (yes, at Tooso we do love languages)?This is where we combine the intuition of symbolic regression —automatically evolving models to get better explanations — with the generative power of probabilistic programming.
Since models can be expressed as domain-specific languages, our regression task can be thought as a special instance of “Bayesian program synthesis”: how can a general program write specific “programs” (i.
mathematical expressions) to satisfactorily analyze unseen datasets?In the next section we will build a minimal formal language to express functions and show how operations on language structures translate to models that efficiently explore the infinite space of mathematical hypotheses (the faithful reader may recall that we solved in a similar fashion the “sequence game” introduced in a previous post).
In other words, it’s now time to build our army of robots.
[ Bonus technical point: symbolic regression is usually done with genetic programming as the main optimization technique; a population of functions is randomly initialized and then algorithmic fitness dictates the evolution of the group towards expressions well suited for the problem at hand.
We picked a probabilistic programming approach for this post as it nicely fits with some recent work on concept learning and lets us share directly in the browser some working code (a thorough comparison is beyond the scope of this article; for more comparisons and colored plots, see the Appendix at the end; while proof reading the article, we also discovered this very recent and pretty interesting “neural-guided” approach).
The non-lazy and Pythonic reader interested in genetic programming will find gplearn delightful: a good starting point is Jan Krepl’s data science-y tutorial.
]Building a robot scientist“Besides black art, there is only automation and mechanization.
” — F.
LorcaAs we have seen in the previous section, the challenge of symbolic regression is the vast space of possibilities we need to consider to make sure we are doing a good job in fitting the target dataset.
The key intuition to build our robot scientist is that we can impose a familiar, “linguistic” structure on this infinite hypotheses space, and let this prior knowledge guide the automated exploration of candidate models.
We first create a small language L for our automated regression tasks, starting from some atomic operations we may support:unary predicates = [log, round, sqrt]binary predicates = [add, sub, mul, div]Assuming we could pick variables (x0, x1, … xn), integers and floats as our “nouns”, L can generate an expression such as:add(1, mul(x0, 2.
5))fully equivalent to the more familiar:Y = X * 2.
5 + 1Plotting the familiar mathematical expression “Y = X * 2.
5 + 1”[ We skip over the language generation code as we discussed generative language models elsewhere.
For an overview of scientific problems through the lenses of probabilistic programming, start from the fantastic ProbMods site.
]Since we can’t directly place a prior over an infinite set of hypothesis, we will exploit the language structure to do it for us.
Since less (probabilistic) choices are needed to generate the linear expression:add(1, mul(x0, 2.
5))compared to the quadratic expression:add(add(1, mul(x0, 2.
5)), mul(mul(x0, x0), 1.
0)))the first is a more likely hypothesis before observation (i.
we obtain a prior favoring simplicity in the spirit of Occam razor).
A simple WebPPL snippet generating mathematical expressions probabilistically.
The final detail we need is how to measure the performances of our candidate expressions: sure, a linear expression is more likely than quadratic before data points, but what do we learn through observation?.Since we framed our task as a Bayesian inference, Bayes’ theorem suggests that we need to define a likelihood function that will tell us the probability of obtaining our data points if the underlying hypothesis is true (posterior ~= prior + likelihood).
As an example, consider the three datasets below:Three synthetic datasets to test likelihood without informative prior beliefs.
They have been generated by adding noise to the following functions:f(x) = 4 + 0 * x (constant)f(x) = x * 2.
5 (linear)f(x) = 2^x (exp)We can exploit the observe pattern in WebPPL to explore (without informative priors) how likelihood influences inference, knowing in advance what is the mathematical expression that generated the data.
A simple WebPPL snippet to test the likelihood of some generating functions against synthetic data.
As clear from the charts below, with as little as 25 data points the probability distribution over possible mathematical expressions is pretty concentrated on the correct value (also note that the constant parameter is narrowly distributed over the true value, 4, and the same holds true for the exponential example).
Original data (N=25), probability distribution over expressions, parameter estimation for the CONSTANT and EXP example (original data from WebPPL exported and re-plotted with Python).
Our final robot scientist is then assembled combining (language-based) priors with likelihood (if you’re interested in a small-and-hacky program that puts everything together, don’t forget to run the snippets here).
Let’s see now what our robots can do.
Putting our robot scientist to work“Humans turn me on.
” —Anonymous RobotNow that we can create robot scientists, it’s time to see what they can do on some interesting data patterns.
The chart below represents datasets built out of a simple language for mathematical expressions (such as the one described above), showing, for each case:a scatterplot with the target data points;the generating mathematical expression (i.
the truth);the expression selected by the robot scientist as the most likely to explain the data (please note that when running the code, you may get several entries for different, but extensionally equivalent expressions, such as x * 4 and 4 * x).
Four synthetic datasets (left), the underlying generator function (center, in red), and the best candidate according to the robot scientist (right, in blue).
Results are pretty encouraging, as the robot scientist always made a very reasonable guess on the underlying mathematical function relating X and Y in the test datasets.
As a finishing touch, it just takes a few more lines of code and some labelling to add a nice summary of the findings in plain English, so that the following data analysis:From data analysis to an English summary: we report model predictions at different percentiles since the underlying function may be (as in this case) non-linear.
gets automatically summarized as:According to the model '(4 ** x)':At perc.
25, an increase of 1 in cloud expenditure leads to an increase of 735.
6 in revenuesAt perc.
5, an increase of 1 in cloud expenditure leads to an increase of 9984.
8 in revenuesAt perc.
75, an increase of 1 in cloud expenditure leads to an increase of 79410.
5 in revenuesGoing from model selection to explanations in plain English is fairly straightforward (original here).
Not bad, uh?.It seems that our data science team can finally take a break and go on that deserved vacation while the robots work for them!While the non-lazy reader plays around some more with the code snippets and discovers all sorts of things that can go wrong with these robots v1.
0, we shall go back to our enterprise use cases and make some parting notes on how to leverage these tools in the real world.
What’s next: scaling prediction across enterprise data“The simple truth is that companies can achieve the largest boosts in performance when humans and machines work together as allies, not adversaries, in order to take advantage of each other’s complementary strengths.
DaughertyLet’s go back to our prediction problem: we had data on how cloud services impact the revenues of our company and we wanted to learn something useful from it.
Our X-Y chart: what can we learn from it?Sure, we could try and use a machine learning tool designed for this problem; if we buy into the deep learning hype, that has obvious downsides in terms of integration, generalization to unseen datasets and interpretation.
We could try and deploy internal resources, such as data scientists, with downsides in terms of time-to-ROI and opportunity costs.
Finally, we could try to prioritize speed and run a simple one-fit-all model, sacrificing accuracy and prediction power.
In this post, we outlined a very different path to address the challenge: by mixing statistical tools with probabilistic programming we obtain a tool general enough to produce interpretable and accurate models in a variety of settings.
We get the best out of automated A.
, while keeping the good part of data science done right — explainable results and modeling flexibility.
Science-wise, the above is obviously just a preliminary sketch on how to think outside the box: when moving from a POC to a full-fledged product, a natural extension is to include Gaussian processes in the domain-specific language (and, generally, exploit all the nice things we know about Bayesian program synthesis, in the spirit for example of the excellent Saad et al).
Product-wise, our experience with deploying these solutions with billion dollar companies has been both challenging and rewarding (as enterprise things often are).
Some of them were skeptical at first, after being burned by pre-made solutions heavily marketed today by big cloud providers as “automated AI” —as it turns out, those tools can’t solve anything but the simplest problems and still require non-trivial resources in time/learning/deployment etc.
But in the end, all of them embraced both the process and the results of our “program synthesis”: from automated prediction to data re-structuring, customers love the “interactive” experience of teaching machines and work with them through human-like concepts; results are easily interpretable and massive automation is achieved at scale through serverless micro-services (for our serverless template for WebPPL, see our devoted post with code).
At Tooso, we do believe the near-term A.
market belongs to products that enable collaboration between humans and machines, so that each party gets to do what it does best: machines can do the quantitative legwork on data lakes and surface the most promising paths for further analysis; humans can do high-level reasoning on selected problems and give feedback to the algorithms, in a virtuous loop generating increasingly more insights and data awareness.
All in all, as fun as it is to dream of evil robot armies (yes Elon, it’s you again), there is still plenty of future that definitely needs us.
See you, space cowboysIf you have questions or feedback, please share your A.
story with jacopo.
Don’t forget to follow us on Linkedin, Twitter and Instagram.
Appendix: comparing regression modelsLet’s consider a slightly more complex dataset, in which our target variable Y depends in some way on both X and Z:Y depends on both X and Z: what is the exact relation between them (we plotted the “true surface” as a visual aid)?To fit the data in a fairly straightforward, non-parametric way, we start with some battle-tested Decision Tree Regressor: decision trees probably won’t get you a NIPS talk, but they are very robust and widely adopted in practice, so it makes sense to start there:Fitting a decision tree to our dataset: good but not great.
Both qualitatively (the shape of the resulting surface, kind of “blocky”) and quantitatively (R-squared ~= 0.
84) results are good but not exceptional.
Fitting symbolic regression to our dataset: much better!Symbolic regression produced a smoother surface and a superior quantitative fit, with better out-of-sample data prediction; moreover, the output of the system is easily interpretable as it’s a standard expression:sub(add(-0.
999, X1), mul(sub(X1, X0), add(X0, X1)))which is:Y = (−0.
999 + Z) − ((Z−X) * (X+Z))(for a longer discussion, see the fantastic docs from gplearn).
.. More details