Programming Best Practices For Data ScienceDiogo RibeiroBlockedUnblockFollowFollowingMar 20It’s data science, not rocket scienceData scientists and chief data officers are the hot hire these days, and government agencies at all levels are working to get more out of their rapidly growing troves of data.
Determining how to approach all that data, however, can be daunting.
“To solve business problems, develop new products and services and optimize processes,” TDWI’s Dave Stodder wrote, “organizations increasingly need analytics insights produced by data science teams with a diverse set of technical skills and business knowledge who are also good communicators.
” To make the most of such investments, the report recommends:1.
Identify your key business drivers for data scienceBefore getting started, an organization must ask what real data science efforts can provide that traditional business intelligence and analytics are not.
If there are gaps, it’s critical to hire personnel with real “knowledge of and curiosity about the business” to help fill them.
Create an effective teamIt takes more than curiosity, however.
And hiring a multitalented superstar — which is “like chasing unicorns” to begin with — can leave an agency with a one-off, artisanal operation whose creator then leaves for greener pastures.
“[A] wiser course is to develop a stable team that brings together the talents of multiple experts.
Emphasize communications skills“Organizations that use data science successfully almost universally point to communication as a key ingredient to their success,” TDWI found.
Organizations should “make it a priority as they evaluate candidates for data science teams.
Expand the impact through visualization and storytelling“Data science thrives in an analytics culture,” Stodder wrote, but “not all personnel… are going to be part of data science teams, nor should they be.
” Finding ways to help non-statisticians grasp the insights in the data is critical to getting real value out of the investment.
Give the data scientists all the dataWhile traditional analytics often focus on a carefully defined set of structured data, data science has the potential to draw value from the vast masses of unstructured data that most organizations create.
“Data scientists need to work closely with data at every step so they know what they have,” Stodder wrote — and they need to have as much of it as possible.
Pave the way for operationalizing the analyticsDescriptive analytics are useful, but predictive analytics are far more valuable — and prescriptive analytics offer the most potential benefit by far.
To make this possible, “data science teams can move away from uncoordinated, artisanal model development and toward practices that can include quality feedback sessions to correct flaws.
Improve governance to avoid data science “creepiness”Both data science teams and top leadership “must be cognizant of the right balance between what they can achieve … and what is tolerable — and ethical — from the public’s perspective.
”Difference between data science and data analysis.
Data Science: You have a question, you’re trying to get to an answer and you don’t necessarily know at the beginning if it’s going to work.
“You have a question, you don't know if you can find an answer”.
Examples of data science could be image recognition tasks or models for prediction.
A question is proposed, we have data, the data science task is experimentation in finding a possible solution.
Data Analysis: You have a question, which you know is answerable.
You are applying known methods to answer the question.
“You are answering a question”Examples of data analysis could be quantifiable metrics based on sales, ie from various channels over time frames.
Or it could be metrics of the quality of a data set.
ie, missingness or population statistics.
Managing a projectAgile project management (SCRUM) is said to be more suited to data analysis than data science.
If you know what the end result should be, then agile is a good practice to implement the solution.
The agile management technique is a way to develop the process of finding the methodology in order to achieve the solution.
(there are sprints and estimation values attached to the tasks, as they are known).
When the project is more of an experiment and evaluates the type of project, then agile practices might not be best suited.
We could run the experimentation in a sprint typestyle.
Working out:(1) the exact set of all experiments at the beginning would be difficult without some initial results/evaluation(2) assigning estimates of complexity to these experiments would not be so easy either.
Data ScienceIn the discipline of data science, it is important to ‘frame the problem’.
This is were a lot of the work should go.
Data science can tell us what to expect or what might happen, but it often cannot tell us why.
To understand why you have to talk to people.
We need to embrace the importance of human relationships.
They are very important for the data systems we are building.
Selection biasThere should be a section in every project report for selection bias.
In doing so it gets you to critically think about this area.
Selection bias can happen in many areas for many reasons.
An interesting example is survivorship bias.
A historical example of this bias was the analysis of bullet holes in returning planes in World War II.
The areas with the most bullet holes were reinforced.
But, the sample of planes analyzed was bias to only returning planes with damage, not plane that was shot down and did not return.
EthicalThe ethical considerations are also a very important area which should be reported on.
It is important to critically think about this area and assess the impact of the system's capabilities.
Ethical considerations need to be a part of the product design and planning process.
Data science is an emerging discipline.
It will most likely evolve over time.
Follow interesting problems, people and technologies into the future of what data science will become.
The data science life cycle is generally comprised of the following components:data retrievaldata cleaningdata exploration and visualizationstatistical or predictive modelingWhile these components are helpful for understanding the different phases, they don’t help us think about our programming workflow.
Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script.
In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical/predictive modeling.
But there’s a better way!.In this post, I’ll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.
Prototype mindsetIn the prototype mindset, we’re interested in quickly iterating and trying to understand some properties and truths about the data.
Create a new Jupyter notebook and add a Markdown cell that explains:Any research you did on Lending Club to better understand the platformAny information on the data set you downloadedIn general, the code in the prototyping mindset should focus on:UnderstandabilityMarkdown cells to describe our observations and assumptionsSmall pieces of code for the actual logicLots of visualizations and countsMinimal abstractionsMost code shouldn’t be in functions (should feel more object-oriented)Let’s say we spent another hour exploring the data and writing markdown cells that describe the data cleaning we did.
We can then switch over to the production mindset and make the code more robust.
Production mindsetIn the production mindset, we want to focus on writing code that will generalize to more situations.
In general, the production mindset should focus on:Healthy abstractionsCode should generalize to be compatible with similar data sourcesCode shouldn’t be so general that it becomes cumbersome to understandpipeline stabilityReliability should match how frequently its run (daily? weekly? monthly?)Switching between mindsetsLet’s say we tried to run the function for all of the data sets from Lending Club and Python returned errors.
Some potential sources for errors:Variance in column names in some of the filesVariance in columns being dropped because of the 50% missing value thresholdDifferent column types based on pandas type inference for that fileIn those cases, we should actually switch back to our prototype notebook and investigate further.
When we’ve determined that we want our pipeline to be more flexible and account for specific variations in the data, we can re-incorporate that back into the pipeline logic.
Here are a few ways to make the pipeline more flexible, in decreasing priority:Use optional, positional, and required argumentsUse if / then statements along with Boolean input values within the functionsUse new data structures (dictionaries, lists, etc.
) to represent custom actions for specific datasetsThis pipeline can scale to all phases of the data science workflow.