This problem is finally starting to get some attention and I’m hopeful the broader issue of reproducibility in science can lead to new tools or approaches.
A lack of reproducible results leads to a deterioration of trust in data science methods: if you tell your clients one number one day and a completely different number for the same outcome at a later time, why would they rely on your predictions?Crumbling data science infrastructure: More people than ever before are using open-source data science tools (think Python, numpy, scikit-learn, pandas, matplotlib, scipy, pymc3) but the contributor and maintainer base is not growing accordingly.
Free and Open Source Software is great, but it’s hard for contributors to maintain the tools when they are doing it voluntarily on top of a full-time job (see this report for a great rundown of the problem).
There are many ways to contribute, from submitting bugs to fixing issues to simply donating money (go here right now and become a recurring donor to NumFocus if you want to help), and we need to start supporting the very foundations upon which our field stands.
This is a critical issue that unfortunately does not receive much attention and I’ve written previously on this topic because it’s crucial.
In addition to the above, I’m also concerned that formal education systems are not producing effective data scientists.
With students trained specifically in data science, there may be many data scientists who can write a superb Jupyter Notebook, but have no idea how to put a machine learning model into production.
There is an immense gap between getting a result once on a clean dataset in a Jupyter Notebook and running a model hundreds of times every day on real data with results served to clients.
I don’t know if this gap can be bridged in a university setting by the current curriculums.
As a final note, at Cortex, we have never been limited in what features we could build by data science — the techniques themselves are simple to implement with open-source tools — but rather by the intricacies required to get data, clean the data, format it currently, run the models and deliver those predictions (with explanations) to customers.
My advice would be to spend some serious time learning software engineering and data engineering alongside traditional data science statistics and programming courses.
Although the role of “data scientist” might not last that long — not because of automation but because of easier-to-use tools — the skills learned in data science are only becoming more relevant.
For better or worse, we live in a data-saturated, technology-driven world, and the capability to make sense of data and manipulate technology to do your bidding (instead of the other way round) is critical.
If you are a current data scientist, keep applying those skills and learning new ones.
For those studying to get into the field, recognize you are preparing not necessarily for the position of “data scientist”, but rather for a world in which the skills of data science will become increasingly valued.
I write articles on data science and occasionally other interesting topics.
You can reach me on Twitter @koehrsen_will.
If helping the environment and the bottom line appeals to you, then maybe reach out to us at Cortex.