For example, log reports of an error being accessible 5 hours after the error happened is not acceptable.
Integrity: This concept speaks to well-designed data and systems.
As an example, in a relational database, it means no orphaned records or lack of linkages between semantically linked data.
Accuracy: This means that what we store should be accurate enough to reflect real life values.
If we store the wrong birthday for a person, there’s an accuracy problem.
Standardization: This dimension is still subject to discussion, but in my opinion, decisions taken during database design should be consistent and follow standards (either using common standards or your own variations).
In relational models, the different ways of normalization are standards.
Often times, you end up consciously denormalizing data (to increase performance, as an example), but that decision should be taken using objective arguments that can be incorporated into your own standard.
Other details like date formats are matter here.
How do we measure these dimensions?Unfortunately, there’s no single answer or silver bullet to this question.
My recommendation is usually to create an algorithm based on the reality of the project.
The algorithm can take inputs and give a score to each dimension, for example, between 0 and 1.
There are multiple ways to process large chunks of data and calculate these scores, or you may even choose a representative part of the database to process.
Heuristics and common sense are extremely valuable here.
Doing it programatically makes sure that you can run the same tests over and over again, comparing the results.
Ideally, in a professional process, you should come up with an output like the following:Data Score DiagnosisAfter taking this crucial first step, we can start assigning a weight to each dimension.
Weighting is important because it makes data more relevant to our real-world situation: some data quality criteria may be important to achieving our business goals, while others may not matter much to us.
This analysis, done in conjunction with stakeholders, should help highlight the pathway to an informed decision.
It should relate to your priorities and clarify your most important next steps.
What’s next?After narrowing your focus, you can decide if you have a real problem.
If you do, you can work on each dimension one at a time, employing whatever techniques best suit the situation.
One example of this could be using Python’s Panda library to read and process either from files or a relational database, create transformations that will increase one of the data quality dimensions, then push the fresh data again to the original database.
Another approach could be trying to mitigate future data problems by tracking down the problem to an algorithm in your code, or even realizing that your database design itself needs adjustments.
We’ll explore some of these techniques in a following article.
But as a conceptual model, this should serve as the basis for any systematized data quality efforts.
Ackoff, “From Data to Wisdom,” Journal of Applied Systems Analysis 16 (1989): 3–9.
Harland Cleveland, “Information as Resource,” The Futurist, December 1982, 34–39.
Arkady Maydanchim, “Data Quality Assessment”, September 15, 2007Bernard Marr, Forbes, “How much data do we create every day?”Jeff Desjardins, WeForum, “How much data is generated each day?”.. More details