Data quality in storage

The quality of the data in the warehouse is an important prerequisite for obtaining valuable information. Poor quality leads to a negative chain reaction in the long run.
First, the credibility of the information provided is lost. People are starting to use Business Intelligence applications less, the potential of applications remains unclaimed.
As a result, further investment in the analytical project is called into question.

Responsibility for data quality

The aspect related to improving the quality of data is mega-important in BI projects. However, it is not the privilege of only technical specialists.
The quality of the data is also affected by aspects such as

Corporate culture

  • Are the workers themselves interested in producing good quality?
  • If not, why not? There may be a conflict of interest.
  • Maybe there are corporate rules that define those responsible for quality?

Processes

  • What data is created at the end of these chains?
  • Maybe the operating systems are configured in such a way that you need to "get out" to reflect this or that situation in reality.
  • Do operating systems perform data validation and reconciliation themselves?

Everyone in the organization is responsible for the quality of data in reporting systems.

Definition and meaning

Quality is the proven satisfaction of customer expectations.

But the quality of the data does not contain a definition. It always reflects the context of use. The data warehouse and BI system serve different purposes than the operating system from which the data is taken.

For example, on an operating system, the customer attribute might not be a required field. In storage, this attribute can be used as a dimension and is mandatory. Which, in turn, introduces the need to fill in the default values.

Requirements for data storage are constantly changing and they are usually higher than those for operating systems. But it can also be the other way around, when the storage does not need to store detailed information from the operating system.

To make data quality measurable, its standards must be described. People who use information and figures for their work should be involved in the description process. The result of this involvement may be a rule, following which one can tell at a glance at the table whether there is an error or not. This rule needs to be written in the form of a script/code for subsequent verification.

Data Quality Improvement

It is not possible to clean up and fix all hypothetical errors in the process of loading data into storage. Good data quality can only be achieved through the close work of all participants. People who enter data into operating systems need to learn what actions lead to errors.

Data quality is a process. Unfortunately, many organizations do not have a strategy for continuous improvement. Many limit themselves to just storing data and do not use the full potential of analytical systems. As a rule, when developing data warehouses, 70-80% of the budget goes to the implementation of data integration. The control and improvement process remains unfinished, if at all.

Tools

The use of software tools can help in the process of automating the improvement and monitoring of data quality. For example, they can fully automate the technical verification of storage structures: the format of the fields, the presence of default values, the compliance with the requirements of table field names.

It can be more difficult to check the content. As storage requirements change, so can the interpretation of data. The tool itself can turn into a huge project that needs support.

Council

Relational databases, in which stores are usually designed, have a wonderful ability to create views (views). They can be used to quickly check data if you know the specifics of the content. Each instance of finding an error or problem in the data can be captured as a database query.

Thus, the content knowledge base will be formed. Of course, such requests should be fast. As a rule, maintenance of views takes less human time than tools organized on tables. The view is always ready to display the result of the validation.
In the case of important reports, the view may contain a column with the destination. It makes sense to use the same BI tools to report on the state of data quality in the storage.

Example

The query is written for the Oracle database. In this example, the tests return a numeric value that can be interpreted as needed. The T_MIN and T_MAX values ​​can be used to adjust the level of the alarm. The REPORT field was once used as a message in a commercial ETL product that didn't know how to properly send emails, so rpad is a "crutch".

In the case of a large table, you can add, for example, AND ROWNUM <= 10, i.e. if there are 10 errors, then this is enough for alarm.

CREATE OR REPLACE VIEW V_QC_DIM_PRODUCT_01 AS
SELECT
  CASE WHEN OUTPUT>=T_MIN AND OUTPUT<=T_MAX
  THEN 'OK' ELSE 'ERROR' END AS RESULT,
  DESCRIPTION,
  TABLE_NAME, 
  OUTPUT, 
  T_MIN,
  T_MAX,
  rpad(DESCRIPTION,60,' ') || rpad(OUTPUT,8,' ') || rpad(T_MIN,8,' ') || rpad(T_MAX,8,' ') AS REPORT
FROM (-- Test itself
  SELECT
    'DIM_PRODUCT' AS TABLE_NAME,
    'Count of blanks' AS DESCRIPTION,
    COUNT(*) AS OUTPUT,
    0 AS T_MIN,
    10 AS T_MAX
  FROM DIM_PRODUCT
  WHERE DIM_PRODUCT_ID != -1 -- not default value
  AND ATTRIBUTE IS NULL ); -- count blanks

The materials of the book were used in the publication
Ronald Bachmann, Dr. Guido Kemper
Raus aus der BI-Falle
Wie Business Intelligence zum Erfolg wird


Source: habr.com

Add a comment