Data Science-ish

False.

Even if those in that position of power were to read this article, there’s a slim chance that this is enough to touch moral compasses that are backed by decades of undisputed “experience”.

The presumed conclusions differ from what the data is able to illustrate and that same data has already been through rigorous cleansing, processing and interpretation.

This workflow of a data science project involves unique individuals pressured to deliver to that one gut-instinct (conclusion) at each of these various different stages.

However, since the data set completely transforms across this practice, tracing the point(s) of data-alchemy is practically a game of Chinese whisper.

Got outliers? Get out, liars.

The key stages in a data science practice include: defining the business problem, data acquisition, data preparation, data modelling and communication.

Therefore, it rests with the individual data scientists to realize responsibility for their work at each stage of the practice:1) Defining the business problemOne of the most underappreciated stages in a data science project is formulating a consensus on the outcomes and business value out of the project.

For the data scientist involved, knowledge in the particular field is imperative.

In a fast-paced business scenario, the opportunity to ask questions is limited, so it is crucial to know what kind of answers will be required.

Additionally, this first step of the process is the most difficult to later revisit as it involves external stakeholders.

The answers you receive pertaining to the objectives, expectations and available resources of the project, will determine the trajectory of the rest of the project pipeline.

By not asking the right questions, intentionally or otherwise, you have doomed the project to ill-equipped oversight from its very inception.

2) Data acquisition and warehousingOnce access to data resources has been scoped into the project definition, the data scientist is accountable for gathering, scraping and managing data from many sources.

Web servers, application programming interfaces (APIs), logs, online repositories and other databases are a few of the mediums involved.

The rudimentary extract, transform and load (ETL) processes involved can always be optimized for both speed and accuracy by advanced infrastructure tools such as Alteryx that provide automation and oversight of the entire process.

Furthermore, this step of the process is where a data scientist’s informed curiosity is best applied to dig deeper into the resources, beyond what is agreed upon.

The integrity of the data in the databases involved are also a responsibility of the data scientist.

This is the step in the pipeline in which doing “enough” is not really enough.

Sure.

3) Data preparation and exploratory data analysisAs addressed in the previous stage, if analytics is only as good as the data used, unused data would put the project at a stark disadvantage.

On the other hand, if data is abundant, there is a greater responsibility (not a Spider-man reference) towards data hygiene and pre-processing.

In cleaning the data, data scientists should be mindful of standard rules such as inconsistent data types, missing/duplicate values and misspelled attributes.

In pre-processing, the necessary techniques to prepare the data for modelling, such as encoding qualitative data, should be effectively applied.

If a data scientist at this point were to “clean” unfavorable data or “polish” the favorable, it is a malpractice that is often difficult to retrace.

In cleaning the data, data scientists should be mindful of standard rules such as inconsistent data types, missing/duplicate values and misspelled attributes.

In pre-processing, the necessary techniques to prepare the data for modelling, such as encoding qualitative data, should be effectively applied.

If a data scientist at this point were to “clean” unfavorable data or “polish” the favorable, it is a malpractice that is often difficult to retrace.

Parallel to this, feature engineering plays a key role at this point.

The selection and encoding of the appropriate variables for the next stages often times require contextual understanding and external research — an effort that may go beyond the technical domain of data science.

This determines the accuracy of the model in the next step and is another point in which the project deserves more effort than what is expected.

4) Data modelingThe accuracy of algorithms and classifiers such as k-nearest neighbors (kNN), decision trees or Naïve Bayes are subjective to the project.

It takes a significant effort in training and testing to determine the best performing model for the project.

The extent to which limited dedication so far in the pipeline can compromise the entire project is best represented through the “London Whale” JP Morgan debacle of 2012.

As reported by JP Morgan in their 129 page report, the $6 billion trading loss can be traced back to the flawed use of Microsoft Excel as a program for their VaR (Value at Risk) model.

“The model [run by one London-based quant] operated through a series of Excel spreadsheets, which had to be completed manually, by a process of copying and pasting data from one spreadsheet to another”.

Both human error and lack of oversight can be attributed to why the model didn’t enable the bank to act — though the decision-makers had the opportunity to choose a better, more efficient environment for modelling and minimize this risk.

“Oh, excel-lent.

”5) Communication and beyondEffective communication to non-technical stakeholders is so crucial that it has developed into a field of its own within data science.

In the field of information design and visualization, the story that your project is trying to put across can be (even accidentally) manipulated.

Presenting too much data, distorting or cluttering the presentation, hiding data or even inaccurately annotating text within the visualizations are all opportunities for unethical practices towards project success.

When the data doesn’t give you the low-down but it gives you the high.

This final phase of the data science project is where a large part of the immoral manipulation occurs.

If they haven’t earlier, it is at this stage where all practitioners realize whether or not their project meets “expectations”.

Often times the lead of the project or an intermediary in client communications is directly involved and reviews the project (with their limited context of the raw data).

If you somehow manage to be the 1% who at this point in the project maintained a full ethical practice, the chances are you will still be told to reverse-engineer all the previous steps to make your results look “better” — by going back to step 1 and altering the data a little.

So what now?As a data scientist, it is your duty to deliver solutions that bring added value to business and society.

Holding back your 100% is as unethical as any data science malpractice.

Spend more time on a data science attitude rather than a data science aptitude.

In their defense, data scientists today are immensely pressured.

The supply of data scientists are increasing, as are standards in both industry and academia.

In industry, with decreasing client budgets, your project should satisfy client goals or somebody else would — at a lower budget.

In academia, researchers are constantly competing for tenured positions, funding and to simply sustain their careers — all dependent on a satisfying story to tell.

Industry and academia are also becoming more codependent, sharing research and making data insights and statistics even harder to validate.

Little big data.

Hence, in such an era where data drives predictive strategy, a worldwide looming data dilemma that questions the truth, isn’t too far from now.

Climate change is inevitable and non-believers have the benefit of seeing its impacts all around them.

If you are now able to notice any of the above flaws from this brief overview, you too have that benefit.

.. More details

Leave a Reply