The Struggles of a Data ScientistPaul MayBlockedUnblockFollowFollowingApr 24I’ve been a data scientist for around 4 years now, though it really depends on your definition as I always did many parts of it, but it wasn’t really a defined job at the time.
While I do find data science, along with reliability engineering, exciting and it is one of the most interesting jobs I’ve had at times I can see the cracks in the veneer of its label as one of the most satisfying jobs.
If you dig around the internet you will quickly find articles or posts that highlight there is a level of unhappiness or high churn rates.
Photo by JESHOOTS.
COM on UnsplashIs data science is a terrible job?In my opinion that is not true.
The newness and scarcity of the machine learning field is a big bonus in that good employers know your worth (average base pay in the UK is estimated at £42,000 per year) and give you freedoms (like remote working or flexible working hours) that can be a breath of fresh air to the more regimented office job.
Additionally the wealth of problems and uses of machine learning mean that your work can be immensely stimulating and rewarding.
So what is wrong?I think this newness and popularity is part of the cause of the problem.
With the rush to get in on the “artificial intelligence boom” companies are trying to hire data scientists but they don’t know what skills they need and therefore what people they need to hire.
They often fall into the following pitfalls.
They don’t hire data engineers to sort their dataI’ve been at several companies as a data scientist and the main issue I come across is that they don’t know what data they have and how to get it in the early days of being machine learning enabled.
Often data will be held inside each department within different databases (sometimes of completely different types like Historian for factory data and SQL for customer data) and there is no sharing of the contents outside those areas (in the modern age of GDPR this reluctance to share data between departments can be heightened due to fear they might be risking a breach).
Any data scientist joining such a place will find that they have nothing to do as they sit frustrated trying to get the most basic of data in an area they can ingest into their scripts.
The key point here is that the company needs a data engineer to pipeline their data so it can be easily accessed, this will mean that they can find what they have and really start to do some Exploratory Data Analysis on it and yield benefits long before they think about assigning it to a machine learning problem.
However, companies aren’t likely to hire a data engineer on their own as they don’t know they’re needed and they’re unlikely to shell out around £70,000 a year (in the UK data engineers have an average salary of £28,190 per year) in salaries for something they have no idea about.
It’s also worth noting it will cost the company more than £70,000 a year as they’ll need to pay labour costs not just the pure salary costs.
The problem here is that they have not prepped their company data ecosystem to enable a data scientist to thrive and earn their pay.
They don’t know what they wantI have seen several job advertisements that have been a long list of every data science tool, programming language and cloud system under the sun.
I have even seen one job advertisement that requested at least five years of TensorFlow experience.
For context TensorFlow was released in 2015, so hopefully you see the problem there.
Again, any data scientist joining these companies will find that they either have nothing to do or they have to do something for everyone (see below) and in the end never actually do any data science.
Even if you do have data science projects, having too many will mean you never really get any of them done well as data science is often a puzzle solving exercise and you need large blocks of time dedicated to a problem to solve it, without chopping and changing.
Don’t believe me?.Try completing a sudoku puzzle where every few minutes you get up and go read a few pages of a data science text book and then come back again.
It’ll take you ages to complete and understand both than if you did them one at a time.
The problem here is that there is no plan on what they want and how best your skills can be used to their fullest extent.
They think anything data is your domainThis is a common problem I think.
Because you have the word “data” in your title some companies will think that doing things like data cleaning, data engineering and deployment, but much worse you may end up being expected to be in charge of all data the company has (if they suffer from problem 1.
this is not fun at all) and to make sure it’s all GDPR compliant and secure.
You may also have to make excel spreadsheets and calculate daily statistics that management request.
None of these are your domain, there are specialised people who should be covering all of these and some of them are significantly cheaper than an experienced data scientist (a data analyst is around £30,000 a year, 71% of the data scientist average).
The problem here is the company is spending a large amount of money for someone who is over qualified (or not qualified in some areas) to do jobs that involve only a fraction of the skills learned over the eight months it can take a data scientist to be trained.
They don’t understand its not magicalThis I would say is part of the sensationalism of data science.
Look on news websites and you will see that it is being used to detect cancer or construct sophisticated facial recognition models.
Reading any of these and you could be forgiven to believe that a simple model of if customers are going to stop buying you products is trivial.
I’ve also had people ask why a model can’t be built without data and it just learns once it is deployed and become perfect in a week or so.
There is a common pervasive thought that machine learning can magically do anything and it will do it perfectly.
Data science does not work like this.
Data science is incredibly dependent on understanding the problem you need solved and also in knowing what data you have (for supervised learning the labels need to be really good if you want the best chance of success).
Without these you cannot build a data science solution to best address the problem.
Indeed I’ve seen companies demand that neural networks (because its sexy) must be used in their project, even though it is completely inappropriate or under performs compared to algorithms.
Also even with all the data they have the projects may also simply fail if they don’t contain the features you need to make a really good model.
Examples of failure of data would be:The feature is not there that would improve it drastically (e.
gender column for predicting buying feminine products at a retailer)The data is full of error that makes it impossible (e.
predicting the temperature of an industrial oven to within a degree but the training data is only accurate to five degrees)There is no data or it is too sparse to be used (e.
aggregates sales data for a takeaway to be used to predict hourly customer demand)Now none of these examples above definitely rule out the project not working (other features may compensate), but it may not work as well as required to meet the success requirements.
If that is the first time you’ve read requirements in this article this is one of the main issues of managing expectations of your bosses and setting out firm requirements and risks before starting (and not at the end as this article is doing).
The result?Photo by Gabriel Matula on UnsplashExtreme frustration.
Data science is often spending (in my experience) 80% of your time cleaning and exploring data for the problem set out and the remaining 20% is actually designing and building the machine learning solution.
The problems above will mean that 20% will shrink even more leaving you frustrated that you are doing the same job you left or a job that anyone can do but requires no machine learning.
All of these are a perfect storm for a data scientist to quickly become disillusioned and sad doing their job.
If I had to suffer any of these problems for any length of time I’d be wanting to look for a new job as I just wouldn’t be doing what I wanted to do, which is to solve real world problems with data.
The way you can avoid some of these is by having these sort of discussions at interview before you start.
Really understand if they have a plan of how to use you to the best of your abilities and also that they are set up to use you from the start (e.
you can start analysing data on day one).
Of course this is not solving the problem of companies really knowing how to use data scientists and hopefully this will evolve over time as more knowledge of data science is progressed.
The takeawayDon’t be disheartened by what I’ve written.
Data science can still be one of the most interesting and fulfilling jobs you can have these days, just be aware of what you are walking into and try to avoid the companies that won’t appreciate you.