DataGovernance on your own

Hey Habr!

Data is a company's most valuable asset. Almost every company with a digital bias claims this. It’s hard to argue with this: not a single major IT conference is now held without discussing approaches to managing, storing and processing data.

Data comes to us from outside, it is also formed within the company, and if we talk about the data of a telecom company, then for internal employees it is a storehouse of information about the client, his interests, habits, location. With proper profiling and segmentation, promotional offers shoot most effectively. However, in practice, not everything is so rosy. The data that companies store may be hopelessly outdated, redundant, repetitive, or no one knows about their existence, except for a narrow circle of users. ¯_(ツ)_/¯

DataGovernance on your own
In a word, data must be managed effectively - only in this case it will become an asset that brings real benefits and profits to the business. Unfortunately, there are quite a few complexities that need to be overcome to address data management issues. They are mainly due to both the historical heritage in the form of “zoos” of systems, and the lack of unified processes and approaches to managing them. But what does it mean to "manage data"?

That's what we'll talk about under the cut, as well as how the opensource stack helped us.

The concept of strategic data management Data Governance (DG) is already quite well known in the Russian market, and the goals achieved by businesses as a result of its implementation are understandable and clearly declared. Our company was no exception and set itself the task of implementing the concept of data management.

So where did we start? To begin with, we have formed key goals for ourselves:

  1. Ensure the availability of our data.
  2. Ensure transparency of the data life cycle.
  3. Give company users consistent, consistent data.
  4. Give company users verified data.

To date, the software market is represented by a dozen tools of the DataGovernance class.

DataGovernance on your own

But after a detailed analysis and study of solutions, we fixed a number of critical remarks for ourselves:

  • Most manufacturers offer a comprehensive set of solutions that is redundant for us and duplicates the existing functionality. Plus, expensive in terms of resources, integration into the current IT landscape.
  • The functionality and interface are designed for technologists, not end business users.
  • Low survival rate of products and lack of successful implementations in the Russian market.
  • The high cost of software and further maintenance.

The above criteria and recommendations regarding software import substitution for Russian companies convinced us to go in the direction of our own development on the opensource stack. Django, a free and open-source framework written in Python, was chosen as the platform. And so we have identified for ourselves the key modules that will contribute to the goals voiced above:

  1. Register of reports.
  2. Business glossary.
  3. Module describing technical transformations.
  4. Module describing the life cycle of data from the source to the BI tool.
  5. Data quality control module.

DataGovernance on your own

Register of reports

According to the results of internal research in large companies, when solving data-related tasks, employees spend 40-80% of their time searching for them. Therefore, we set ourselves the task of making open information about existing reports that were previously available only to customers. Thus, we reduce the time for the formation of new reports and ensure the democratization of data.

DataGovernance on your own

The register of reports has become a single reporting window for internal users from various regions, departments, divisions. It consolidates information on information services created in several corporate repositories of the company, and there are many of them in Rostelecom.

But the registry is not just a dry list of developed reports. For each report, we provide the information the user needs to explore it on their own:

  • brief description of the report;
  • depth of data availability;
  • customer segment;
  • visualization tool;
  • name of the corporate repository;
  • business functional requirements;
  • link to the report;
  • link to the application for access;
  • implementation status.

Usage level analytics is available for reports, and reports get to the top of the list based on log analytics by the number of unique users. And that's not it. In addition to general characteristics, we also provided a detailed description of the attribute composition of reports with examples of values ​​and calculation methods. Such detail already immediately gives the user an answer whether the report is useful for him or not.

The development of this module was an important step in the democratization of data and significantly reduced the time needed to find the required information. In addition to reducing the search time, the number of calls to the support team for advice has also decreased. It is impossible not to mention one more useful result that we have achieved by developing a single register of reports - preventing the development of duplicate reports for different structural units.

Business glossary

You all know that even within the same company, businesses speak different languages. Yes, they use the same terms, but they mean completely different things. A business glossary is designed to solve this problem.

For us, a business glossary is not just a reference book with a description of terms and calculation methodology. This is a full-fledged environment for the development, coordination and approval of terminology, building the relationship of terms with other information assets of the company. Before getting into the business glossary, the term must go through all the stages of coordination with business customers and the data quality center. Only after that it becomes available for use.

As I wrote above, the uniqueness of this tool is that it allows you to make links from the level of a business term to specific user reports in which it is used, as well as to the level of physical database objects.

DataGovernance on your own

This was made possible through the use of glossary term identifiers in the detailed description of reports from the registry and the description of physical database objects.

More than 4000 terms have now been defined and agreed upon in the Glossary. Its use simplifies and speeds up the processing of incoming change requests in the company's information systems. If the required indicator has already been implemented in any report, then the user will immediately see a set of ready-made reports where this indicator is used, and will be able to decide on the effective reuse of the existing functionality or its minimal refinement without initiating new requests for the development of a new report.

Module for describing technical transformations and DataLineage

What are these modules, you ask? It is not enough just to implement the Report Registry and the Glossary, it is also necessary to land all business terms on the physical database model. Thus, we were able to complete the process of forming the data life cycle from source systems to BI visualization through all layers of the data warehouse. In other words, build a DataLineage.

We developed an interface based on the format previously used in the company for describing the rules and logic of data transformation. Through the interface, all the same information is entered as before, but the definition of the term identifier from the business glossary has become a prerequisite. This is how we build the connection between the business and physical layers.

Who needs it? What did not suit the old format with which they worked for several years? How much has the effort to create requirements increased? We had to deal with such questions in the process of implementing the tool. Here the answers are quite simple - we all need this, the data office of our company and our users.

Indeed, the employees had to reorganize, at first this led to a slight increase in labor costs for the preparation of documentation, but we figured out this issue. Practice, identification and optimization of problem areas have done their job. We have achieved the main thing - we have improved the quality of the developed requirements. Mandatory fields, unified directories, input masks, built-in checks - all this made it possible to significantly improve the quality of transformation descriptions. We moved away from the practice of passing scripts as development requirements, sharing knowledge that was available only to the development team. The formed metadatabase reduces the time for regression analysis by several times, provides the ability to quickly assess the impact of changes on any of the layers of the IT landscape (reports, showcases, aggregates, sources).

And what about ordinary users of reports, what are the advantages for them? Thanks to the ability to build a DataLineage, our users, even those far from SQL and other programming languages, quickly receive information about the sources and objects on the basis of which a particular report is generated.

Data quality control module

Everything that we talked about above in terms of ensuring data transparency is not important without understanding that the data that we give to users is correct. One of the important modules of our Data Governance concept is the data quality control module.

At the current stage, this is a catalog of checks on selective entities. The immediate goal for product development is to expand the list of checks and integrate with the register of reports.
What will it give and to whom? The end user of the registry will have access to information on the planned and actual dates of report readiness, the results of completed checks with dynamics, and information on the sources loaded into the report.

For us, the data quality module integrated into workflows is:

  • Prompt formation of customer expectations.
  • Making decisions on the further use of data.
  • Obtaining a preliminary set of problem points at the initial stages of work for the development of regular quality controls.

Of course, these are the first steps in building a full-fledged data management process. But we are confident that only by purposefully engaging in this work, actively implementing DataGovernance tools in the workflow, we will provide our customers with information content, a high level of trust in data, transparency in obtaining it, and increase the speed of introducing new functionality.

DataOffice team

Source: habr.com

Add a comment