A company which doesn’t know what it doesn’t know, is doomed, through its own inertia and ignorance, to continue down a sub-optimal path into becoming either a poor performer, or, worse, going out of business.
Yes, poor data quality can have such dire consequences, and a well-considered data architecture will help you avoid them.
What is Data Architecture?“Data architecture is where the rubber meets the sky.
”– Neil Snodgrass, Data Architecture Consultant, The Hackett GroupEven among IT practitioners, there is a general misunderstanding (or perhaps more accurately, a lack of understanding) of what Data Architecture is, and what it provides.
In general, Data Architecture is a master plan of the enterprise data locations, data flows, and data availability.
It is a conceptual infrastructure to support data quality, data stewardship, data integration, data migration, and system collaboration.
This infrastructure embodies a set of guidelines and standards which ensure that the data assets are managed appropriately, and that they conform to sanctioned principles for stewardship and quality.
Data Architecture is the discipline of designing, creating, and maintaining this infrastructure.
It must accommodate the data and information needs of the company and do so in a manner which promotes high reliability and easy data integration among applications and data repositories.
The most visible and tangible product of effective Data Architecture is a reporting environment thatProvides a single version of the corporate “truth”Allows business analysts to discover new insights, andAllows business executives and corporate decision makers to derive corporate strategies and actionable tactics from their data.
Such a reporting environment usually entails one or more data warehouses, and one or more departmental or “competency” data marts.
The architecture describes how data flows from corporate transactions, through the various layers of transformation and integration, through operational data stores, all the way to the decision-support applications that query the data warehouse or some other data structure optimized for reporting and analytics.
It is an infrastructure that, when properly implemented, (i.
follows the architecture and conforms to the corporation’s suite of “best practices”) guarantees the three benefits of the reporting environment described above.
As the humorous quote at the beginning of this section indicates, Data Architecture often seems somewhat nebulous as there is no physical manifestation (like an executable program manifests programming code, or like a relational database manifests an entity relationship data model).
Data Architecture has no programmatic instantiation and exists only as standards, policies, and corporate “best practices.
” It resides only in the artifacts (text documents and graphic diagrams) which describe it, and in the “tribal knowledge” of the enterprise.
The artifacts which describe it are the blueprint of the architecture, and serve a similar function for building reliable systems as a building architect’s blueprint serves for building a house.
A corporation’s Data Architecture is a mirror of the data and information generated and captured by the enterprise in order to do its business.
It describes the business rules and the concepts which are critical for the enterprise to operate efficiently.
It offers a “seal of approval” on the reliability of the data, and guarantees that corporate decision makers can make well-informed, fact-based decisions on policies and strategies.
It provides for a sanctioned plan for stewardship of the data assets of the corporation, and details how data gets created, how it moves through the enterprise, and how it gets consumed.
Indeed, Data Architecture influences everything in the enterprise which “touches” the data.
It motivates data polices, influences corporate goals, enables strategies for achieving those goals, and validates the tactics which implement those strategies.
It encompasses all systems and programs in which data originates, in which data is transformed and/or cleansed, and to which data is migrated, or with which data is integrated.
By standardizing data definitions, data formats, and the acceptable storage, integration, and usage of the data, the architecture prepares the environment for data management, and it is by invigorating these standards that the powerful benefits of the Data Architecture (high data quality and unquestionable data reliability) are enabled.
Also, by dictating how data gets integrated, migrated, cleansed, and transformed, Data Architecture provides a plug-and-play framework for data warehousing.
A Typical Data Architecture EnvironmentWhat are the artifacts and deliverables of Data Architecture?Since Data Architecture is a conceptual and abstract discipline, it has no simple representation that one can point to and say, “That’s Data Architecture.
” Data architecture serves and encompasses everything a company captures and maintains, in the realm of data and information (see Figure 1).
Having such a broad scope and impact, and such a high level of abstraction, it requires some seasoned imagination to conceive and understand what it is all about.
The one artifact that comes closest to capturing the essence of Data Architecture is a high-level data-flow diagram (Figure 2).
But data flow is only one aspect of a complete architecture.
There must be rules about how data flows or migrates through the information systems, and there must be a crystal clear understanding throughout the IT realm of which subject areas and concepts are important to the company’s business model.
In addition there must be an enterprise-wide agreement as to the semantics of those concepts in all possible contexts (within the business model).
— Data Flow DiagramSince a fundamental goal of the architecture is to have absolutely unquestionable data quality and reliability, semantic clarity is the first step; but disciplined stewardship of the data, the concepts, and the business rules is the only way to move forward, past that first step, to achieve a robust and effective architecture.
In order to complete the picture, and implement the type of data environment which an ideal Data Architecture provides, there must be:Inspired analysis and design of the overall architectureCorporate sanction of the architecture’s goalsEnforced compliance with the architecture’s rulesArtifacts of ArchitectureThe following deliverables and artifacts of the Data Architecture are designed to ensure that these three principles are delivered to the information systems which are destined to utilize the architecture.
This is not a mandatory or an all-inclusive list.
It is simply a recommended methodology, and does not preempt a different approach utilizing other documents and principles to achieve the desired environment.
Business Concept DefinitionsHaving corporate sanctioned definitions for the concepts which animate a company’s business model is the single most important element of Data Architecture.
None of the major benefits of the architecture will accrue without them.
Yet business concept definitions are often overlooked (or worse, purposely ignored) because (to many IT practitioners) it seems painfully like “documentation for documentation’s sake”.
Nothing within the realm of enterprise data could be further from the truth.
Business Concept DocumentSemantic clarity is mandatory for getting the full utility and all of the collateral benefits of enterprise Data Architecture.
Unless all systems and programs agree on a single definition for each and every critical business concept, then there can not be any reliable data migration, data integration, data cleansing, or data warehousing.
Analysts and executives who query the data warehouse(s) would have little or no reason for confidence in the accuracy of the information which is presented to them.
Data Stewardship AgreementsStewardship is a vital element of any Data Architecture.
Data stewards ensure the quality, accessibility, and protection of the data, and define the data standards (data definitions, concept definitions, data formats, and data domains).
They are the guardians and maintainers of the Data Architecture.
They ensure that there is a single data store of record (DSOR) for the vertical stripe of data which they are stewarding, and they prohibit non-conforming data silos from participating in the architecture.
Stewardship agreements are corporate documents that grant stewardship responsibilities to a person, initiative, or department, and need the advice and consent of the CIO or a CIO designate.
Stewards are typically positioned at a high or mid-level of corporate responsibility, e.
Director or Manager.
Data Sharing AgreementsData sharing agreements are corporate documents that describe the data, where it is located, who protects it, and who can access it.
Most data should be freely available throughout the enterprise.
But some sensitive data needs to be restricted.
The data sharing agreement, signed by all interested parties describes who can access the restricted data, when it is available, and how the access is accomplished.
Even data that is not sensitive needs to be certified as “sharable.
” Entities within the enterprise that want access to the DSOR for a concept need to be certified as conforming to the standards maintained for that concept (see Data Standards, below).
Data Usage Models (Stewardship Matrix)Anyone who has been in Information Systems very long has heard of, and probably used, a diagram known as a CRUD matrix.
CRUD stands for ©reate, (R)ead, (U)pdate, and (D)elete, and details the data usage for an application, a system, or an initiative.
The Data Usage Model (sometimes called a Stewardship Matrix) extends the old-fashioned CRUD matrix so that one can, at a glance, not only see how each application interacts with a given concept, but which application data store is the data store of record (DSOR) for each concept.
The system which has the DSOR for a concept inherits the stewardship responsibilities for that business concept, and is obliged to:1.
Get enterprise-wide agreement of a definition for that concept2.
Document all of the business rules that pertain to the concept3.
Determine who (which systems and employee types) can see and use that data (via Data Sharing agreements discussed above), and4.
Maintain the integrity of the concept (by setting enterprise-wide data definitions, data formats, and data domains for the concept).
Stewardship Matrix (Data Planning Model)Data Standards (Definition, Format, and Domain)Data definitions are often captured in modeling tools like Erwin, and then propagated to the physical database in the form of comments on tables, columns, and relationships.
They quite frequently can come directly from the Business Concept Definition document (see above).
The DSOR for a concept contains the sanctioned definitions which relate to the concept and its attributes.
Similarly, the DSOR should be considered the sanctioned format for the data attributes for a concept, and for the valid domain values for that concept.
An important criterion in data sharing is to make sure that all parties which want to use the data must define that data in exactly the same way — in entity and attribute definitions, in format, and in domain values.
This is crucial to having certifiably correct reports, and a high level of certifiable data quality.
Where definitions, formats, or domains are different, it is hard to rationalize that both sides of the data sharing are, indeed, talking about the same concept, and before a sharing agreement can be executed and sanctioned by the enterprise (with signatures of appropriate parties) one side or the other must change and conform to the other (or both sides can change and use a negotiated settlement to remediate the differences).
Data Flow DiagramsMany in Information Systems think of data flow diagrams (DFD) as being equivalent to Data Architecture — as being The Architecture.
DFDs are a vital tool for conveying the scope and boundaries of the architecture, but, (as we hope we have demonstrated in this white paper) they are only a tool, and only one of many.
DFDs describe how data flows throughout the enterprise — from creation of the data, through various layers of refinement, cleansing, and transformation, to the consumption of the data on reports, executive dashboards, or display screens.
They are a key to documenting the overall architecture, and are a very useful starting place for the data mapping used by cleansing initiatives or for ETLs which load the data warehouse.
Conceptual ModelsConceptual models are diagrams that summarize all of the critical and interesting concepts which are inherent in the business, and the relationships among them.
A very high-level conceptual model diagrammatically details only the subject areas (e.
Finance, Human Resources, Products, etc.
) of interest, and the relationships between subject areas and concepts.
This type of model is called, naturally, a Subject Area Model.
The next lower level of detail is captured by a concept model (sometimes called a data planning model) which depicts each interesting concept and the relationships among the concepts.
One method of portraying this model is with an un-attributed entity relationship (ER) model.
Indeed, most (if not all) of these business concepts will end up being fully-attributed entities in one or more logical models which support one or more transactional systems.
The relationships between concepts in this type of model conform naturally enough to the concept of relationships in ER modeling.
Another very effective technique for conceptual modeling is a formal modeling notation known as Object-Role Modeling (ORM).
Object-Role modeling was designed for this purpose, and allows useful insights into the concepts and relationships which might be overlooked using the traditional ER modeling notation.
ORM is sometimes eschewed as being too tedious, but this is due mostly to a lack of good graphical tools designed to support the technique.
Conceptual ModelLogical ModelsIf you have undertaken the discipline of creating conceptual models, you will find that the logical models evolve from the conceptual ones quite naturally.
The major concepts become entities, and many of the minor ones become attributes for those entities.
Physical ModelsPhysical models are dependent on the choice of DBMS used, and are in the domain of the DBAs.
Whereas the physical representation is definitely an artifact of the architecture, its main purpose is to document where (what DBMS, what database, and how the concepts and entities had to be modified (if at all) in order to become a column in a table.
The physical residence of business concepts is an important piece of information for Data Sharing Agreements.
Data Warehouse ArtifactsData warehouses have many artifacts and deliverables.
All of the artifacts and deliverables mentioned here for Data Architecture will be utilized in building a data warehouse.
Metadata Standards and MaintenanceMetadata is the sum of all of the corporate knowledge about the corporation’s business processes and the data that qualifies and quantifies it.
There are two types of metadata: technical and business.
Technical metadata is used by Information Technology practitioners to standardize, categorize, and define the data structures used to capture information in databases.
Technical metadata describes the physical properties of the data, how it relates to other data, and mappings between sources and destinations of data that is moving through the system(s).
It is invaluable for standardizing the data formats, definitions and domains across systems.
Business metadata is used to guide the system users (data consumers) through the data and the problems they are trying to solve with it.
It provides, on a fundamental level, basic description information for the data fields.
At a more robust level, it provides the foundation for understanding the content and source of the information.
The business metadata provides a conceptual context for the technical metadata, and is often undocumented, only to remain as “tribal knowledge.
” Accurately capturing and standardizing business metadata is always an important challenge for Data Architecture.
What can we expect from implementing Data Architecture?At the very least, Data Architecture provides a high-level map of the data topology for an enterprise.
It describes how the data originates, where it resides, where it migrates, what transformations are applied to it to cleanse and standardize it, and what it means (the semantics).
This information, alone, is “worth its weight in gold” by allowing management as well as technicians to understand the data karma?.of the enterprise.
At its best, it goes way beyond this simple documentation, and becomes an active principle that lives within the data, energizing and leveraging it in a multitude of ways.
The data becomes an organic corporate asset that invigorates and motivates the enterprise, and provides a clear path to the realization of the corporate vision, goals, and strategies.
To someone that has never experienced a robust and inspired Data Architecture in action, this may sound a little like poetic license or hyperbole.
But it truly is not.
Metaphors aside, corporate personnel who discover the synergistic benefits of Data Architecture for the first time, are often amazed at how they ever functioned without it.
With a well-considered data architecture, data that once was suspect or needed “tweaking” in order to balance the books, becomes as reliable as “Old Faithful.
” Analysts who once complained that the reliability of the data made their analysis contrived and incomplete, become ardent converts — often clamoring for more bandwidth to allow their heuristics to discover all of the exciting possibilities that are contained in their newly conformed data warehouse.
Data warehouse developers who previously spent many hours of overtime trying to shoe-horn data from legacy systems into the warehouse, happily discover that ETLs and data maps become self-revealing, and the data warehouse is found to be the software equivalent of “plug-and-play.
” Executives who had struggled to find meaning in their daily, weekly and monthly reports, now discover nuggets of information which inspire new visions, and blaze new trails to outsmart and outmaneuver the competition.
Because of guaranteed data reliability and the framework which enables death-defying data transformations, Data Architecture can have a positive impact on virtually every operational function, every department, and every profit center.
The artifacts describe how this should happen: Data Stewards enable semantic clarity and enforce the standards.
Data analysts and planners set the policies and discover the vision.
Program and project managers instantiate the ideals.
Data integrators become empowered to fold all data into a single vocabulary, whether they are dealing with existing disparate systems, new system development, or third-party packaged system.
And everyone throughout the enterprise finds a new appreciation and respect for the data that pulses through the architecture’s veins.
Optimal Data Architectures are flexible and can be implemented in stages.
The key is to have a high-level plan which accounts for the goals and aspirations of the enterprise.
Once that is in place, the benefits of Data Architecture can be prioritized and implemented in a seamless, phased-in approach that accommodates the specific needs of any organization.
 Quoted from an editorial comment by Jan Popkin in SDTimes magazine, May 15, 2002 Larry P.
English is a noted advocate and lecturer on data quality, and has written “The Bible” on the subject, Improving Data Warehouse and Business Information Quality, Wiley.
In this book Mr.
English states that “Quality is free.
It’s not a gift, but it is free.
What costs money are the unquality things — all the actions that involve not doing jobs right the first time.
” And, “Every penny you don’t spend on doing things wrong, over, instead, becomes half a penny right on the bottom line.
If you concentrate on making quality certain you can probably increase your profit by an amount equal to 5 to 10 percent of your sales.
” A data warehouse can be built without Enterprise Data Architecture, but it is highly inadvisable.
Likewise, a data architecture can exist for an enterprise that is not doing any data warehousing, but it provides the optimal benefit to the corporation when it establishes the blueprint for integrating disparate enterprise data into a data warehouse.
At the time this paper was written, Ralph C.
“Rusty” Alderson was a Senior Consultant with Third Coast Software Foundry, Austin, Texas, specializing in Data Architecture and data-related issues.
He is retired now.