Big and small data tester: trends, theory, my story

Hello everyone, my name is Alexander, and I'm a Data Quality Engineer who checks data for quality. This article will talk about how I came to this and why in 2020 this area of ​​​​testing was on the crest of a wave.

Big and small data tester: trends, theory, my story

Global trend

Today's world is experiencing another technological revolution, one aspect of which is the use of accumulated data by all kinds of companies to spin their own flywheel of sales, profits and PR. It seems that it is the availability of good (quality) data, as well as skillful brains that can make money from them (correctly process, visualize, build machine learning models, etc.), have become the key to success for many today. If 15-20 years ago, large companies were mainly engaged in intensive work with the accumulation of data and their monetization, today this is the lot of almost all sane people.

In this regard, a few years ago, all job search portals around the world began to overflow with Data Scientists vacancies, as everyone was sure that by getting such a specialist on their staff, they could build a supermodel of machine learning, predict the future and make a "quantum jump" for the company. Over time, people realized that this approach almost never works, since far from all the data that falls into the hands of such specialists is suitable for training models.

And requests from Data Scientists began: “Let's buy more data from these and those…”, “We do not have enough data…”, “We need some more data and preferably high-quality…”. Based on these requests, numerous interactions began to be built between companies that own one or another set of data. Naturally, this required the technical organization of this process - to connect to the data source, download them, check that they are loaded in full, etc. The number of such processes began to grow, and today we have a huge need for another kind of specialists - Data Quality engineers - those who would monitor the data flow in the system (data pipelines), the quality of data at the input and output, draw conclusions about their sufficiency, integrity and other characteristics.

The trend for Data Quality of engineers came to us from the USA, where in the midst of the raging era of capitalism, no one is ready to lose the battle for data. Below I have provided screenshots from two of the most popular job search sites in the US: www.monster.com и www.dice.com - which display data as of March 17, 2020 on the number of posted vacancies received, for the keywords: Data Quality and Data Scientist.

www.monster.com

Data Scientists – 21416 vacancies
Data Quality – 41104 vacancies

Big and small data tester: trends, theory, my story
Big and small data tester: trends, theory, my story

www.dice.com

Data Scientists – 404 vacancies
Data Quality - 2020 vacancies

Big and small data tester: trends, theory, my story
Big and small data tester: trends, theory, my story

Obviously, these professions do not compete with each other in any way. With screenshots, I just wanted to illustrate the current situation in the labor market in terms of requests for Data Quality engineers, who are now much more needed than Data Scientists.

In June 2019, EPAM, responding to the needs of the modern IT market, singled out Data Quality as a separate practice. Data Quality Engineers in the course of their daily work manage data, check its behavior in new conditions and systems, control the relevance of data, its sufficiency and relevance. With all this, in a practical sense, Data Quality engineers really devote little time to classic functional testing, BUT it strongly depends on the project (I will give an example below).

The duties of a Data Quality Engineer are not limited to routine manual/automatic checks for "nulls, counts and sums" in database tables, but require a deep understanding of the customer's business needs and, accordingly, the ability to transform the available data into usable business information.

Data Quality Theory

Big and small data tester: trends, theory, my story

In order to most fully imagine the role of such an engineer, let's figure out what Data Quality is in theory.

Data Quality - one of the stages of Data Management (the whole world that we will leave to you for independent study) and is responsible for analyzing data according to the following criteria:

Big and small data tester: trends, theory, my story
I think it’s not worth deciphering each of the points (in theory they are called “data dimensions”), they are quite well described in the picture. But the testing process itself does not imply strict copying of these features into test cases and their verification. In Data Quality, as in any other type of testing, it is necessary, first of all, to build on data quality requirements agreed with project participants who make business decisions.

Depending on the Data Quality project, an engineer can perform different functions: from an ordinary tester-automator with a superficial assessment of data quality to a person who conducts their deep profiling according to the above criteria.

A very detailed description of the Data Management, Data Quality and related processes is well described in a book called "DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition". I highly recommend this book as an introduction to this topic (you will find a link to it at the end of the article).

My history

In the IT industry, I have gone from Junior QA in product companies to Lead Data Quality Engineer at EPAM. After about two years as a tester, I had a firm conviction that I did absolutely all types of testing: regression, functional, stress, stability, security, UI, etc. - and tried a large number of testing tools, having worked while in three programming languages: Java, Scala, Python.

Looking back, I understand why my skill set has been so diverse - I've been involved in data projects, big and small. This is what brought me into the world of a lot of tools and opportunities for growth.

To appreciate the variety of tools and opportunities for gaining new knowledge and skills, just look at the picture below, which shows the most popular of them in the world of "Data & AI".

Big and small data tester: trends, theory, my story
This kind of illustration is produced annually by one of the well-known venture capitalists Matt Turck, a native of software development. Here link to his blog and venture capital firmwhere he works as a partner.

I grew professionally especially quickly when I was the only tester on a project, or at least at the beginning of a project. It is at this moment that you have to be responsible for the entire testing process, and you have no opportunity to retreat, only forward. At first it was scary, but now all the advantages of such a test are obvious to me:

  • You start communicating with the whole team like never before, since there is no proxy for communication: neither the test manager nor fellow testers.
  • Immersion in the project becomes incredibly deep, and you have information about all the components both in general and in detail.
  • Developers don't look at you as "that test guy who doesn't know what he's doing", but rather as an equal, producing incredible value for the team with his autotests and anticipation of bugs in a particular product node.
  • As a result, you are more efficient, more qualified, more in demand.

As the project grew, in 100% of cases I became a mentor for new testers who came to it, taught them and passed on the knowledge that I had learned myself. At the same time, depending on the project, I did not always receive the highest level of auto testing specialists from the management and there was a need to either train them in automation (for those who wish), or create tools for their use in everyday activities (tools for generating data and loading them into the system , a tool for performing load testing/quick stability testing, etc.).

Example of a specific project

Unfortunately, due to non-disclosure obligations, I cannot talk in detail about the projects I worked on, but I will give examples of typical Data Quality Engineer tasks on one of the projects.

The essence of the project is to implement a platform for preparing data for training based on machine learning models. The customer was a large pharmaceutical company from the USA. Technically it was a cluster Kubernetesrising to AWS EC2 instances, with several microservices and the underlying Open Source project from EPAM - Legion, adapted to the needs of a particular customer (now the project has been reborn into odahu). ETL processes were organized using apache airflow and moved data from SalesForce customer systems in AWS S3 buckets. Next, a docker image of a machine learning model was deployed to the platform, which was trained on fresh data and, using the REST API interface, issued predictions that were of interest to the business and solved specific problems.

Visually, it looked something like this:

Big and small data tester: trends, theory, my story
There was plenty of functional testing on this project, and given the speed of feature development and the need to maintain the pace of the release cycle (two-week sprints), it was necessary to immediately think about automating testing of the most critical system nodes. Most of the Kubernetes-based platform itself was covered by autotests implemented on Robot Framework + Python, but they also needed to be supported and extended. In addition, for the convenience of the customer, a GUI was created to manage machine learning models deployed to the cluster, as well as the ability to specify from where and where to transfer data for model training. This extensive addition entailed an expansion of automated functional checks, which were mostly done through REST API calls and a small number of end-2-end UI tests. Around the equator of this whole movement, a manual tester joined us, who did a great job with acceptance testing of product versions and communicating with the customer about the acceptance of the next release. In addition, due to the arrival of a new specialist, we were able to document our work and add some very important manual checks that were difficult to automate right away.

And finally, after we achieved stability from the platform and the GUI add-on over it, we started building ETL pipelines using Apache Airflow DAGs. Automated data quality check was carried out by writing special Airflow DAGs that checked the data based on the results of the ETL process. As part of this project, we were lucky, and the customer gave us access to anonymized data sets, on which we tested. We checked the data line by line for type compliance, the presence of broken data, the total number of records before and after, the comparison of the transformations performed by the ETL process by aggregation, changing the names of columns, and so on. In addition, these checks were scaled to different data sources, for example, in addition to SalesForce, also on MySQL.

The final data quality checks were carried out already at the S3 level, where they were stored and were in a ready-to-use state for training machine learning models. To get data from the final CSV file located on the S3 Bucket and validate it, a code was written using boto3 client.

Also on the part of the customer there was a requirement to store part of the data in one S3 Bucket, and part in another. This also required writing additional checks that control the reliability of such sorting.

Generalized experience on other projects

An example of the most generalized list of activities of a Data Quality engineer:

  • Prepare test data (valid invalid large small) through an automated tool.
  • Load the prepared dataset into the original source and check its readiness for use.
  • Launch ETL processes for processing a data set from the source storage to the final or intermediate one using a certain set of settings (if possible, set configurable parameters for the ETL task).
  • Verify the data processed by the ETL process for its quality and compliance with business requirements.

At the same time, the main focus of the checks should be not only on the fact that the data flow in the system, in principle, worked out and reached the end (which is part of functional testing), but for the most part on checking and validating data for compliance with expected requirements, identifying anomalies and other things.

Tools

One of the techniques for such data control can be the organization of chain checks at each stage of data processing, the so-called "data chain" in the literature - control of data from the source to the point of final use. Such checks are most often implemented by writing validating SQL queries. It is clear that such queries should be as lightweight as possible and check individual pieces of data quality (tables metadata, blank lines, NULLs, Errors in syntax - other attributes required to check).

In the case of regression testing, which uses ready-made (unchanged, slightly changed) data sets, the autotest code can store ready-made templates for checking data for compliance with quality (descriptions of expected table metadata; string sample objects that can be randomly selected during the test, etc. ).

Also, during testing, you have to write test ETL processes using frameworks such as Apache Airflow, Apache Spark or even a black-box cloud tool like GCP Dataprep, GCP Dataflow And so on. This circumstance makes the test engineer dive into the principles of operation of the above tools and even more effectively both conduct functional testing (for example, ETL processes existing on the project) and use them to check data. In particular, Apache Airflow has ready-made operators for working with popular analytical databases, for example GCP BigQuery. The most basic example of its use has already been outlined. hereso I won't repeat myself.

In addition to ready-made solutions, no one forbids you to implement your techniques and tools. This will not only be beneficial for the project, but also for the Data Quality Engineer himself, who will thereby improve his technical outlook and coding skills.

How it works on a real project

A good illustration of the last paragraphs about "data chain", ETL and ubiquitous checks is the following process from one of the real projects:

Big and small data tester: trends, theory, my story

Here, different data (naturally, prepared by us) enters the input “funnel” of our system: valid, invalid, mixed, etc., then they are filtered and get into the intermediate storage, then they are again awaited by a series of transformations and placed in the final storage , which, in turn, will be used for analytics, building data marts and searching for business insights. In such a system, we, without functionally checking the work of ETL processes, focus on the quality of data before and after transformations, as well as on the output to analytics.

To summarize the above, regardless of the places where I worked, everywhere I was involved in Data projects that combined the following features:

  • Only through automation can some cases be tested and a release cycle acceptable for the business can be achieved.
  • The tester on such a project is one of the most respected members of the team, as it brings great benefits to each of the participants (acceleration of testing, good Data Scientist data, early detection of defects).
  • It doesn’t matter if you work on your own hardware or in the clouds — all resources are abstracted into a cluster like Hortonworks, Cloudera, Mesos, Kubernetes, etc.
  • Projects are built on a microservice approach, distributed and parallel computing prevails.

I note that when testing in the field of Data Quality, a tester shifts his professional focus to the product code and the tools used.

Distinctive Features of Data Quality Testing

In addition, for myself, I have identified the following (I will immediately make a reservation VERY generalized and extremely subjective) distinctive features of testing in Data (Big Data) projects (systems) and other areas:

Big and small data tester: trends, theory, my story

Useful links

  1. Theory: DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition.
  2. Training center EPAM 
  3. Recommended materials for a beginner Data Quality Engineer:
    1. Free course on Stepik: Introduction to databases
    2. Course on LinkedIn Learning: Data Science Foundations: Data Engineering.
    3. Article:
    4. Video:

Conclusion

Data Quality is a very young promising direction, to be a part of which means to be a part of a start-up. Once in Data Quality, you will plunge into a large number of modern in-demand technologies, but most importantly, you will have huge opportunities for generating and implementing your ideas. You will be able to use the approach of continuous improvement not only on the project, but also for yourself, continuously developing as a specialist.

Source: habr.com

Add a comment