Random thoughts on my first ML deployment5 things I didn’t know six months ago and that’s better not to forget in the months to comeMassimo BelloniBlockedUnblockFollowFollowingMar 3A little bit of context: I’m currently working for a fast growing yet still medium-sized company that after having built a robust and widely used product has decided to start leveraging the data generated during the years to bring some value to end users.
The Data Science Team wasn’t very structured at the beginning of this quest and it is far to be over-structured now, six months later, even if here and there you can start noticing some changes that are the consequence of a natural trial and error process.
It’s possible then that some of the following points are obvious for more experienced data scientists working in structured companies with a proven track record of models brought to production.
But if I think to the six-months-ago-me I’d have been probably interested in reading a post like this, to avoid wasting a lot of time and CPU power in training useless models or in looking for that specific one, trained weeks before and that was lost under a pile of useless others, mainly due to the chaos living in the folder where I was keeping all my Jupyter Notebooks.
Focus on product, not (only) on performancesIn the early stages of all the projects developed in these months, I was focusing on having the higher on-paper performances and on engineering the most complex and performing features, just to discover later that those features weren’t available real-time or that the process of retrieving them was so costly that increase of performances wasn’t worth the effort.
The number of interactions a listing receives in the first hours after being published, for example, is a very good indicator for it to be a scam or not.
However, if the goal of the model is to detect scammers in the first minutes after uploading their listings, this feature is pretty much useless.
Another important suggestion is to test the model as soon as possible with live data coming from the scenario in which the product will be eventually used.
It is way better to quickly discover that the developed model isn’t reliable than working on something for months just to find out how useless it is when it’s already live.
I’ve already written about this topic here and there and I won’t rewrite the whole story, but it is not obvious that in a dataset collected over a multiple years period, all the sample follow the same distribution: dealing with this kind of complexity is what makes working with real data more challenging than with Kaggle-like datasets.
Don’t get me wrong: they are cool, and above all they are the only freely available resource for the data scientists that want to get their hands dirty with novel techniques.
The only misconception is that they are quite far to be a good representation of real world-datasets, that often require a lot of effort (really, a lot) just to come up with an usable bunch of samples while getting rid of dirty values and outliers.
Some time ago I read about a senior data scientist saying that the more senior he became the more the skills he developed were mainly dataset cleaning and preparation.
I’m starting to understand his words…2.
Having a Computer Science background is the keyA Machine Learning model, for how complex and deep it may become, is, in the end, a software artefact that will be probably exposed as a Web API on top of an already existing infrastructure.
Having a Computer Science background or, at least, a solid understanding of these topics, is crucial to build a data product that can be quickly deployed after a research phase lasted months.
As a rule of thumb: the more the model has been developed by people without a CS background, the more it will be hard to deploy it in production, possibly having to reduce the performances or to tweak the features just to fit with the final architecture.
At the stage of our business, having been able to sketch up a client-server architecture while training the models has been probably more important than having a ML-focused PhD.
It’s a bold sentence, of course, and probably it isn’t even valid anymore (new structure, new projects, new challenges); but in the early stages of a Data Team, with the strict requirement to bring some tangible value as soon as possible, being able to have a little bit of CS vision is fundamental to take models out of Jupyter Notebooks and make them available to users in weeks, not months.
Deploying ML is not black magicWhen I did my last University exam almost a year ago I was quite confident about my theoretical DS/ML skills, eager to have some hands on experience with real world data, while I was completely unaware and even a little bit scared about building actual data products powered by this complex technologies.
Is there life after Jupyter Notebooks?After some months I now see that the problem there was me thinking that behind a deployed ML model there is who-knows-what kind of technical complexity.
The more I dug into the field once I had something to actually deploy (and again, see point 1), the more I noticed that a) if there are a lot of stories and experiences about that very specific method that allows to increase accuracy on ImageNet of 0.
001%, there exists very few examples and guides on how to actually expose ML models to the world and b) that models, in the end, are just computer functions, often written in Python, and this kind of technologies have been deployed and used by millions of users for years.
A good lesson learned in our company has been to move data scientists as close as possible to senior engineers, that even if they possibly don’t know a lot about Machine Learning, have a lot of experience in building reliable and robust software products.
A Python script is a Python script, no matter what it does.
Give a structure to Jupyter NotebooksI won’t go in super-details about this now, but after weeks working on the same project 8 hours per day, the number of trials and sketched-up ideas start to increase a lot.
Jupyter Notebooks, with all the issues they might inherently generate, are the de-facto standard for the possibly very long research phase that every data science project has.
The longer and complicated a project gets, the more is the time wasted in looking for that specific model with that slightly different dataset trained weeks before.
If carefully tracking every single test and experiment is probably a little overkill and the Jupyter Notebook’s format itself isn’t so easy to be version controlled, starting to give a structure to folders and files has saved me a lot of time in the past months.
In my working routine, notebooks can be categorised mainly in 4 types:EXPLORATION: preliminary dataset analysis, with aggregate measures, plots and charts;CLEANING: dataset is cleaned, outliers are detected and possibly removed and some initial complex features are engineered;EXPERIMENTAL: where the models (even more than one algorithm per Notebook) are trained and tested and where more complex ideas are tried out (not more than one per notebook);DEPLOY: full pipeline, from dataset cleaning to predictions.
Ideally it can be run from start to end providing models and predictions.
Every Jupyter Notebook created must contain one of these keywords in the file name, with a brief description of the actual file content.
Once a notebook becomes too complex or starts to fall beyond the initial scope, it is reordered a little and a new one is created, possibly starting on top of some results of the previous one.
On the side: every dataset’s name must contain a date, and a brief explanation of the contained features.
Once I started to use this structure — unfortunately months after the first .
fit() I run in the company’s office — my efficiency has drastically increased and it has been also very easy to get back to models trained weeks before or to re-run the same pipelines with updated datasets or new features.
Take your timeFinally, and it is mainly an advice for the future me, but who knows: doing things takes time, the sooner you understand this, the better it is.
A single project requires multiple iterations, and, most importantly, building robust and reliable software is a long run, not a sprint.
Test the model with live data as soon as possible, but then iterate offline and try out different methods.
Not always (read, almost never) most complex solutions give better results, but building super complex models might be the only method you have to be sure about this.
If it is true that every project needs structures and deadlines, ML research, even when it is done with a strong product focus, needs time to evolve and mature: wasting time in useless branches just to discover how useless they are and to possibly do better in the future is a fundamental part of the process.
.. More details