Important Topics
- Business Definitions
- Metrics
- Calculation Base
- Aggregations
- Reporting
- Issue Tracking
The Quality of Data ensures that any decision is taken it is based on correct data and not only decisions but also any regulatory information is delivered, it represents the reality.
Quality of Data is represented either by Indicators (DQI) or Analyzers (DQA) and it has a business definition describing all relevant details like: Business Terms related to, Metrics measured, Performance Indicators, Base of Calculation and depending on client's industry market specific details that ensures the correctness of the data.
When defining a data quality framework we take into account the enterprise architecture, organizational levels of improvements as well as the technologies. All important facts from above were to be taken into account for each design.
Business Definition is always the starting point of any framework. Represents the bases for business or technical specifications to defined what we measure, where we measure it and the criticality of the measurement. Of course each industry with its specifics, the definition must be adaptable for enhancements describing or organizing the definition or results.
Part of the standard practices, the Metrics answers the question of what we measure. Usually defined by completeness, consistency, uniqueness, timeliness, and so on, we have to always make sure it does not get confused with quality type (Indicator, Key Performance, Analyzer) or subtype (Business or Technical). This is why we take semantics into account not only from best practices naming conventions but also from internal client's organization understanding. In the end, a framework is the bases for education and a common ground for various stakeholders discussions.
Quality can be measured across the entire organization on various levels of data state. Calculation Base provides answers to questions like: where do we measure the quality? This must be relevant for the architecture location, data subset definition. We recommend that this base of calculation must be supported by the model or the glossary of terms in order to ensure reusability and performance for the calculation. As an example, please consider the following categories: Clients -- Active Clients -- Active Clients in Default/Active Clients with Exposure. We can see a pattern where first you have all clients, then enhanced with the information of Active/Inactive and later each client with it's exposure next to it and/or the Default status.
Now, let's just say we define and measure as well the data quality, and we have to read the results. This is sometimes the place where a good framework makes a difference from a bad one. Knowing what you measure, we have to define how to aggregate it. There are places when a record can be erroneous for multiple reasons or maybe a definition is measured upon multiple bases, we need to always be careful how to aggregate and what granularity must be taken into account, is it number of errors or number of erroneous records.
Another important fact to take into account in the Framework definition is ROI. Records with issues and the aggregation of the results (as described previously) are the base for reporting either through dashboards expressing overall quality status or even going into drill-downs to specific records. It is always important the framework to support as well the reporting phase in order to ensure the correct context of an information display.
Often the results are linked with individual issues aised through ticketing systems, and the Data Quality Framework needs to support integration as well as life-cycle tracking of issues or ROIs. It is important to understand when an issue occurred, if it is new or well known since some time now, or if it is propagated across layers and upstream consumers.
Performance - One of the greatest challenge that we faced in all projects was the performance of the calculation itself. Most of the time, if the architecture and client's infrastructure helps, we can make use of the database resources available through balancing the resources wisely and optimizing schedulers, but the biggest performance win is given by the reusability of assets and separation between ROIs and Aggregations.
Reusability - As a performance prerequisite, by having assets well-defined and governed, we can reuse them for multiple actions at once. Either it is a Base of Calculation that supports multiple calculations at once or even a definition that can support subset of requirements, it ensures a wise and effective usage of the available resources.
Aggregation - Limiting the number of actions when calculating and identifying records with issues is important to ensure that the lowest granular level is achieved. All aggregations can be then parallelized and prioritized in scheduling to ensure a good time-to-market for the decision-making information or later reporting requirements.
Defining the Data Quality Framework from the beginning based on the principles mentioned above we were able to introduce to clients information about their current quality, trend information about quality improvements over time, decision-making information to stop or proceed with processing or reporting, and even automation for fixing or improving the quality. Nevertheless, in our experience we realized that this kind of solution must be supported by other streams of a complex organization, like Modelers, Architects, and it always succeeded by being effective if we have a good Metadata Management next to it.