In terms of the 6 dimensions of data quality?What is the level of cleaning required?What do the various fields mean?Are there areas in which bias could be an issue?Our Take on the Six Dimensions of Data QualityUnderstanding the aspects of your data, such as its overall size, can aid you in deciding how to go about your analyses; for smaller data you may wish to do all of your analyses in memory — using tools like Python, Jupyter, and Pandas, or R; for larger data you may be better off moving it into an indexed, SQL database (for larger data still, Hadoop and/or Apache Spark become options).
What is also particularly fun about this stage is that — if you have a clear line of sight to your goal — then, as you gain a better understanding of the data you can determine which aspects of it are most important for the analyses; these are areas in which most of your effort can be directed first.
This is especially helpful in projects where there are strict time constraints.
Some useful tools/methodologies for the ‘understand’ stage:Workshops/brainstorming sessionsPythonJupyter Notebook (allows for sharing of documents containing live code, equations, and visualizations)Numpy and Pandas (Python libraries)Matplotlib and Seaborn (Python visualization libraries that can help with the viewing of missing data)R (a programming language geared towards statistics)Using Python to Visualize Missing Data with a Heatmap (Yellow is indicative of Missing Data)Code Snippet for Heatmap (‘df’ stands for ‘DataFrame ‘— a Pandas data structure)(The above heatmap was generated with random data)ProcessThis ‘process’ stage is all about getting your data into a state that is ready for analyses.
The words ‘cleaning’, ‘wrangling’ and ‘munging’ come to mind.
A useful phenomenon to put to you here is the Pareto Principle — or ‘80/20 Rule’:“for many events, roughly 80% of the effects come from 20% of the causes” — Vilfredo ParetoThe Pareto Principle or 80/20 RuleThe ‘process’ stage can often take up the most time; in light of the Pareto Principle, it is important to prioritize what aspects of the data you devote most time to; you want to focus on what you think is the most important first, and come back to secondary fields only if necessary and if there is time to do so.
During this stage, we may do any or all of the following:Combine all data into a single, indexed database (we use PostgreSQL)Identify and remove data that is of no relevance to the defined project goalIdentify and remove duplicatesEnsure that important data is consistent in terms of format (dates, times, locations)Drop data that is clearly not in-line with reality, these are outliers that are unlikely to be real dataFix structural errors (typos, inconsistent capitalization)Handle missing data (NaNs and nulls — either by dropping or interpolation, depending on the scenario)The purpose of this stage is really to make your life easier during the analyses stage; processing data usually takes a long time and can be relatively tedious work, but the results are well worth the effort.
Some useful tools/methodologies for the ‘process’ stage:MySQL, SQLite or PostgreSQLPythonNumpy and Pandas (Python libraries)Matplotlib and Seaborn (Python visualization libraries that can help with the viewing of missing data)NLTK (Natural Language Processing Toolkit — another Python library)AnalyzeThis stage is concerned with the actual analyses of the data; it is the process of inspecting, exploring and modelling data — to find patterns and relationships that were previously unknown.
In the data value chain, this stage (along with the previous stage) is where the most significant value is added to the data itself.
It is the transformative stage that changes the data into (potentially) usable information.
In this stage you may want to visualize your data quickly, attempting to identify specific relationships between different fields.
You may want to explore the disparity of fields by location, or over time.
Ideally, in the identify stage, you would have come up with several questions relating to what you would like to get out of this data, and perhaps have even stated several hypotheses — this is then the stage where you implement models to confirm or reject these hypotheses.
During this stage, we may do any to all of the following:If there is time-based data, explore whether there exist trends in certain fields over time — usually using a time-based visualization software such as Superset or GrafanaIf there is location-based data, explore the relationships of certain fields by area — usually using mapping software such as Leaflet JS, and spatial querying (we use PostgreSQL with PostGIS)Explore correlations (r values) between different fieldsClassify text using natural language processing methods (such as the bag of words model)Implement various machine learning techniques in order to identify trends between multiple variables/fields — regression analyses can be usefulIf there are many variables/fields, dimensionality reduction techniques (like Principle Component Analyses) can be used to reduce these to a smaller subset of variables that retain most of the informationDeep learning and neural networks have much potential, especially for much larger, structured datasets (though we have not yet made substantial use of this)The analyses stage is really the stage where the rubber meets the road; it also illustrates the more sexy side of data science.
Visualizing the Distribution of Two Variables Using Seaborn’s JointplotCode Snippet for JointplotSome useful tools/methodologies for the ‘analyses’ stage:(Note that we are leaving the visualization tools for the last section)mySQL, SQLite or PostgreSQL (for querying, including spatial querying — for SQLite, see SpatiaLite)JetBrains DataGrip (Pycharm IDE)Datasette (a tool for exploring and publishing data)Jupyter Notebook (allows for sharing of documents containing live code, equations, and visualizations)SciPy (Python library for advanced calculations)NumPy & Pandas (Python data analyses/manipulation libraries)Scikit-Learn (Python machine learning library)Tensor Flow (Python machine learning library generally used for deep learning and neural networks)Keras (Python library for fast experimentation with neural networks)ConcludeThis stage is concerned with drawing solid, valuable conclusions from the results of the analyses phase.
This is the phase in which you can formulate clear answers to your questions; it is the phase in which you can either prove or disprove your hypotheses.
It is also the stage in which you can use your conclusions, to generate actionable items to aid in the pursuit of the goal (if appropriate).
We usually aim to create a list of conclusions or ‘findings’ that have come out of the analyses and a subsequent list of recommended actions based on these findings.
The actions should be listed with your target audience in mind: they want to know succinctly what was found and what they can do with/about it.
In this phase we may do any to all of the following:Cross check findings with original questions (‘identify’ phase) and determine what we have answeredReject or accept the various hypotheses from the ‘identify’ phasePrioritize conclusions/findings, which ones are most important to communicate to stakeholders — which are of most significance?Attempt to weave conclusions together into some form of storyIdentify follow up questionsIdentify high-priority areas in which action will yield the most valuable results (Pareto Principle)Develop recommendations and/or actions based on conclusions (especially in high-priority areas)Some useful tools/methodologies for the ‘conclude’ stage:Workshops/brainstorming sessionsMicrosoft Office (Excel, PowerPoint, Word)The Pareto Principle (80/20 rule)CommunicateArguably the most important step in the Data Scientific Method is the ‘communicate’ phase; this is the phase in which you ensure that your client/audience/stakeholders understand the conclusions that you have drawn from their data.
They should also be presented with these in such a way that they can act on them — so if you do not recommend actions, the conclusions should then be presented so as to stimulate ideas for action, within them.
This is the phase in which you package your findings and conclusions in beautiful, easy-to-understand visualizations, presentations, reports and/or applications.
A Geographic Visualization Using Apache SupersetIn this phase we may do any to all of the following:If there is time-based data, create sexy time-series visualizations using packages like Grafana or SupersetIf there is spatial data, create sexy map visualizations using packages like Leaflet JS, Plotly or SupersetCreate statistical plots using D3.
js, Matplotlib or SeabornEmbed various visualizations into dashboards, and ensure these are shareable/portable (whether hosted or built as an application) — Superset is a great way to do this within an organizationDevelop interactive visualizations using D3.
js or PlotlyDevelop interactive applications or SPAs (Single Page Applications) using web technologies such as Angular, Vue.
Really though, to realize any benefits of this data, something should be done with the information obtained from it!Like the Scientific Method, ours is an iterative process, which should incorporate action…So, to modify our diagram slightly:The Data Scientific Method with Feedback and Action LoopOftentimes we may also go through the six stages, and there won’t be time for action before we iterate once more.
We may communicate findings that immediately incite further questions — and we may then dive right into another cycle.
Over the long term, though, action is essential for making the entire exercise a valuable one.
In our organization, each new data science project consists of several of these cycles.
Communication of results often sparks new discussions and opens up new questions and avenues for exploration; if our conclusions result in actions which yield favorable results?.We know we are doing something right.
“Without data, you’re just another person with an opinion.
” — W.
Edwards DemingFor the original version of this article, click here.
Thanks to Jaco Du Plessis for putting together the original steps for the Data Scientific Method (View on his GitHub account).