When you have Sber scales. Using Ab Initio with Hive and GreenPlum

Some time ago, we faced the question of choosing an ETL tool for working with BigData. The previously used Informatica BDM solution did not suit us due to limited functionality. Its use has been reduced to a framework for running spark-submit commands. There were not many analogues on the market that, in principle, were able to work with the amount of data that we deal with every day. In the end, we chose Ab Initio. During pilot demonstrations, the product showed very high data processing speed. There is almost no information about Ab Initio in Russian, so we decided to talk about our experience on HabrΓ©.

Ab Initio has many classic and unusual transformations, the code of which can be extended using its own PDL language. For a small business, such a powerful tool is likely to be redundant, and most of its features may be expensive and unclaimed. But if your scale is close to that of Sber, then you might be interested in Ab Initio.

It helps the business to globally accumulate knowledge and develop the ecosystem, and the developer to improve their skills in ETL, improve knowledge in the shell, provides the opportunity to master the PDL language, gives a visual picture of the loading processes, and simplifies development due to the abundance of functional components.

In the post I will talk about the capabilities of Ab Initio and give comparative characteristics of its work with Hive and GreenPlum.

  • Description of the MDW framework and work on its reconfiguration for GreenPlum
  • Ab Initio Performance Comparison with Hive and GreenPlum
  • Ab Initio working with GreenPlum in Near Real Time mode


The functionality of this product is very wide and requires a lot of time to learn. However, with proper work skills and the right performance settings, the results of data processing are very impressive. Using Ab Initio for a developer can give him an interesting experience. It's a new take on ETL development, a hybrid between a visual environment and download development in a script-like language.

Business develops its ecosystems and this tool is more useful than ever. With Ab Initio, you can accumulate knowledge about your current business and use this knowledge to expand old and open new businesses. Alternatives to Ab Initio can be called from the visual development environments of Informatica BDM and from non-visual environments - Apache Spark.

Description of Ab Initio

Ab Initio, like other ETL tools, is a set of products.

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

Ab Initio GDE (Graphical Development Environment) is an environment for the developer in which he sets up data transformations and connects them with data streams in the form of arrows. Moreover, such a set of transformations is called a graph:

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

The input and output connections of functional components are ports and contain fields calculated inside the transformations. Several graphs connected by flows in the form of arrows in the order of their execution are called a plan.

There are several hundred functional components, which is a lot. Many of them are highly specialized. The possibilities of classical transformations in Ab Initio are wider than in other ETL tools. For example, Join has multiple outputs. In addition to the result of connecting datasets, you can get records of input datasets at the output, the keys of which could not be connected. You can also get rejects, errors and the log of the transformation operation, which can be read in the same column as a text file and processed by other transformations:

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

Or, for example, you can materialize the data receiver in the form of a table and read data from it in the same column.

There are original transformations. For example, the Scan transformation has the same functionality as analytical functions. There are transformations with telling names: Create Data, Read Excel, Normalize, Sort within Groups, Run Program, Run SQL, Join with DB, etc. Graphs can use run-time parameters, including passing parameters from the operating system or to the operating system . Files with a ready-made set of parameters passed to the graph are called parameter sets (psets).

As expected, Ab Initio GDE has its own repository called EME (Enterprise Meta Environment). Developers have the opportunity to work with local versions of the code and check in their developments to the central repository.

It is possible during the execution or after the execution of the graph to click on any stream connecting the transformations and look at the data passed between these transformations:

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

It is also possible to click on any stream and see the tracking details - how many parallels the transformation worked in, how many lines and bytes were loaded into which of the parallels:

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

It is possible to divide the execution of the graph into phases and mark that some transformations must be performed first (in the zero phase), the next in the first phase, the next in the second phase, etc.

For each transformation, you can choose the so-called layout (where it will be performed): without parallels or in parallel streams, the number of which can be specified. At the same time, temporary files that Ab Initio creates during transformations can be placed both in the server file system and in HDFS.

In each transformation, based on the default template, you can create your own PDL script, which is a bit like a shell.

With PDL, you can extend the functionality of transformations and, in particular, you can dynamically (at runtime) generate arbitrary code snippets depending on runtime parameters.

Also in Ab Initio, integration with the OS through the shell is well developed. Specifically, Sberbank uses linux ksh. You can exchange variables with the shell and use them as graph parameters. You can call the execution of Ab Initio graphs from the shell and administer Ab Initio.

In addition to Ab Initio GDE, the delivery includes many other products. Has its own Co>Operation System with a claim to be called an operating system. There is Control>Center, where you can schedule and monitor download flows. There are products for doing development at a more primitive level than Ab Initio GDE allows.

Description of the MDW framework and work on its reconfiguration for GreenPlum

Together with its products, the vendor supplies the MDW (Metadata Driven Warehouse) product, which is a graph configurator designed to help with typical tasks of populating data warehouses or data vaults.

It contains custom (project-specific) metadata parsers and out-of-the-box code generators.

When you have Sber scales. Using Ab Initio with Hive and GreenPlum
At the input, MDW receives a data model, a configuration file for setting up a connection to a database (Oracle, Teradata or Hive) and some other settings. The project-specific part, for example, deploys the model to the database. The boxed part of the product generates graphs and configuration files for them by loading data into the model tables. This creates graphs (and psets) for several modes of initializing and incremental work on updating entities.

In the cases of Hive and RDBMS, different graphs are generated for initializing and incremental data updates.

In the case of Hive, the incoming delta data is joined by Ab Initio Join with the data that was in the table before the update. Data loaders in MDW (both in Hive and RDBMS) not only insert new data from the delta, but also close the validity periods of the data whose primary keys received the delta. In addition, you have to rewrite the unchanged part of the data. But you have to do this, because Hive does not have delete or update operations.

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

In the case of RDBMS, the graphs for incremental data update look more optimal, because RDBMS have real update capabilities.

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

The incoming delta is loaded into an intermediate table in the database. After that, the delta joins with the data that was in the table before the update. And this is done by SQL forces through the generated SQL query. Then, using the delete+insert SQL commands, inserting new data from the delta into the target table and closing the periods of data relevance for the primary keys of which the delta was received.
Unchanged data does not need to be overwritten.

Thus, we came to the conclusion that in the case of Hive, MDW should go for rewriting the entire table, because Hive does not have an update function. And nothing better than a complete rewriting of data during an update has been invented. In the case of RDBMS, on the contrary, the creators of the product saw fit to trust the connection and updating of the tables to the use of SQL.

For a project at Sberbank, we created a new reusable implementation of the database loader for GreenPlum. This was done based on the version that MDW generates for Teradata. It was Teradata, and not Oracle, who came up best and closest to this, because is also an MPP system. The ways of working, as well as the syntax of Teradata and GreenPlum turned out to be close.

Examples of critical differences for MDW between different RDBMS are as follows. In GreenPlum, unlike Teradata, when creating tables, you need to write a clause

distributed by

In Teradata they write

delete <table> all

, and in GreenPlum they write

delete from <table>

In Oracle, in order to optimize, they write

delete from t where rowid in (<соСдинСниС t с Π΄Π΅Π»ΡŒΡ‚ΠΎΠΉ>)

, and in Teradata and GreenPlum they write

delete from t where exists (select * from delta where delta.pk=t.pk)

We also note that for Ab Initio to work with GreenPlum, it was necessary to install the GreenPlum client on all nodes of the Ab Initio cluster. This is because we connected to GreenPlum at the same time from all the nodes in our cluster. And in order for the reading from GreenPlum to be parallel and for each parallel Ab Initio thread to read its portion of data from GreenPlum, it was necessary to place the construction understood by Ab Initio in the β€œwhere” section of SQL queries

where ABLOCAL()

and determine the value of this construction by specifying the transformation parameter reading from the database

ablocal_expr=Β«string_concat("mod(t.", string_filter_out("{$TABLE_KEY}","{}"), ",", (decimal(3))(number_of_partitions()),")=", (decimal(3))(this_partition()))Β»

, which compiles to something like

mod(sk,10)=3

, i.e. you have to tell GreenPlum an explicit filter for each partition. For other databases (Teradata, Oracle), Ab Initio can do this parallelization automatically.

Ab Initio Performance Comparison with Hive and GreenPlum

An experiment was conducted at Sberbank to compare the performance of MDW-generated graphs in relation to Hive and in relation to GreenPlum. As part of the experiment, in the case of Hive there were 5 nodes on the same cluster as Ab Initio, and in the case of GreenPlum there were 4 nodes on a separate cluster. Those. Hive had some edge over GreenPlum in terms of hardware.

Two pairs of graphs were considered that perform the same task of updating data in Hive and in GreenPlum. At the same time, the graphs generated by the MDW configurator were launched:

  • initializing load + incremental loading of randomly generated data into Hive table
  • initializing load + incremental loading of randomly generated data into the same GreenPlum table

In both cases (Hive and GreenPlum), downloads were run in 10 parallel streams on the same Ab Initio cluster. Ab Initio saved intermediate data for calculations in HDFS (in terms of Ab Initio, MFS layout using HDFS was used). One line of randomly generated data occupied 200 bytes in both cases.

The result is this:

Winter:

Initializing load in Hive

Rows inserted
6 000 000
60 000 000
600 000 000

The duration of the init
downloads in seconds
41
203
1 601

Incremental loading in Hive

The number of lines in
target table at the beginning of the experiment
6 000 000
60 000 000
600 000 000

Number of delta rows applied to
target table during the experiment
6 000 000
6 000 000
6 000 000

Incremental duration
downloads in seconds
88
299
2 541

Green Plum:

Initializing boot in GreenPlum

Rows inserted
6 000 000
60 000 000
600 000 000

The duration of the init
downloads in seconds
72
360
3 631

Incremental download in GreenPlum

The number of lines in
target table at the beginning of the experiment
6 000 000
60 000 000
600 000 000

Number of delta rows applied to
target table during the experiment
6 000 000
6 000 000
6 000 000

Incremental duration
downloads in seconds
159
199
321

We see that the speed of the initialization load in both Hive and GreenPlum depends linearly on the amount of data and, for reasons of better hardware, it is somewhat faster for Hive than for GreenPlum.

Incremental loading in Hive also linearly depends on the amount of previously loaded data in the target table and is quite slow as the amount grows. This is due to the need to overwrite the target table completely. This means that applying small changes to huge tables is not a good use case for Hive.

Incremental loading in GreenPlum weakly depends on the amount of previously loaded data in the target table and is quite fast. It turned out thanks to SQL Joins and the GreenPlum architecture, which allows the delete operation.

So, GreenPlum injects the delta using the delete + insert method, and Hive does not have delete or update operations, so the entire data array had to be completely rewritten during incremental updates. The comparison of cells highlighted in bold is most indicative, as it corresponds to the most frequent use of resource-intensive downloads. We see that GreenPlum beat Hive in this test by 8 times.

Ab Initio working with GreenPlum in Near Real Time mode

In this experiment, we will test the ability of Ab Initio to update the GreenPlum table with randomly generated data chunks in near real time. Consider the GreenPlum dev42_1_db_usl.TESTING_SUBJ_org_finval table to work with.

We will use three Ab Initio graphs to work with it:

1) Count Create_test_data.mp – creates files with data in HDFS for 10 rows in 6 parallel threads. The data is random, its structure is organized for insertion into our table

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

2) Graph mdw_load.day_one.current.dev42_1_db_usl_testing_subj_org_finval.pset - graph generated by MDW by initializing data insertion into our table in 10 parallel threads (test data generated by graph (1) is used))

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

3) Graph mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset - graph generated by MDW for incremental updating of our table in 10 parallel threads using a portion of fresh incoming data (delta) generated by graph (1)

When you have Sber scales. Using Ab Initio with Hive and GreenPlum

Run the following script in NRT mode:

  • generate 6 test strings
  • perform an initial load insert 6 test rows into an empty table
  • repeat incremental load 5 times
    • generate 6 test strings
    • perform an incremental insert of 6 test rows into the table (in this case, the old data is given a valid_to_ts expiration time and more recent data is inserted with the same primary key)

Such a scenario emulates the mode of real operation of a certain business system - a fairly large portion of new data appears in real time and immediately flows into GreenPlum.

Now let's see the log of the script:

Start Create_test_data.input.pset at 2020-06-04 11:49:11
Finish Create_test_data.input.pset at 2020-06-04 11:49:37
Start mdw_load.day_one.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:49:37
Finish mdw_load.day_one.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:50:42
Start Create_test_data.input.pset at 2020-06-04 11:50:42
Finish Create_test_data.input.pset at 2020-06-04 11:51:06
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:51:06
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:53:41
Start Create_test_data.input.pset at 2020-06-04 11:53:41
Finish Create_test_data.input.pset at 2020-06-04 11:54:04
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:54:04
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:56:51
Start Create_test_data.input.pset at 2020-06-04 11:56:51
Finish Create_test_data.input.pset at 2020-06-04 11:57:14
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:57:14
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 11:59:55
Start Create_test_data.input.pset at 2020-06-04 11:59:55
Finish Create_test_data.input.pset at 2020-06-04 12:00:23
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:00:23
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:03:23
Start Create_test_data.input.pset at 2020-06-04 12:03:23
Finish Create_test_data.input.pset at 2020-06-04 12:03:49
Start mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:03:49
Finish mdw_load.regular.current.dev42_1_db_usl_testing_subj_org_finval.pset at 2020-06-04 12:06:46

It turns out this picture:

Graph
Start time
Finish time
Length

Create_test_data.input.pset
04.06.2020 11: 49: 11
04.06.2020 11: 49: 37
00:00:26

mdw_load.day_one.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 11: 49: 37
04.06.2020 11: 50: 42
00:01:05

Create_test_data.input.pset
04.06.2020 11: 50: 42
04.06.2020 11: 51: 06
00:00:24

mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 11: 51: 06
04.06.2020 11: 53: 41
00:02:35

Create_test_data.input.pset
04.06.2020 11: 53: 41
04.06.2020 11: 54: 04
00:00:23

mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 11: 54: 04
04.06.2020 11: 56: 51
00:02:47

Create_test_data.input.pset
04.06.2020 11: 56: 51
04.06.2020 11: 57: 14
00:00:23

mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 11: 57: 14
04.06.2020 11: 59: 55
00:02:41

Create_test_data.input.pset
04.06.2020 11: 59: 55
04.06.2020 12: 00: 23
00:00:28

mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 12: 00: 23
04.06.2020 12: 03: 23
00:03:00

Create_test_data.input.pset
04.06.2020 12: 03: 23
04.06.2020 12: 03: 49
00:00:26

mdw_load.regular.current.
dev42_1_db_usl_testing_subj_org_finval.pset
04.06.2020 12: 03: 49
04.06.2020 12: 06: 46
00:02:57

We see that 6 increment rows are processed in 000 minutes, which is quite fast.
The data in the target table is distributed as follows:

select valid_from_ts, valid_to_ts, count(1), min(sk), max(sk) from dev42_1_db_usl.TESTING_SUBJ_org_finval group by valid_from_ts, valid_to_ts order by 1,2;

When you have Sber scales. Using Ab Initio with Hive and GreenPlum
You can see the correspondence of the inserted data to the moments when the graphs were launched.
This means that you can run in Ab Initio incremental data loading into GreenPlum with a very high frequency and observe the high speed of inserting this data into GreenPlum. Of course, it will not work to run once a second, since Ab Initio, like any ETL tool, takes time to β€œbuild up” when launched.

Conclusion

Now Ab Initio is used in Sberbank to build a Unified Semantic Data Layer (ESS). This project involves building a single version of the state of various banking business entities. Information comes from various sources, replicas of which are being prepared on Hadoop. Based on the needs of the business, a data model is prepared and data transformations are described. Ab Initio uploads information to the ECC and the uploaded data is not only of interest to the business in itself, but also serves as a source for building data marts. At the same time, the functionality of the product allows using various systems (Hive, Greenplum, Teradata, Oracle) as a receiver, which makes it possible to prepare data for business in various formats required by it without much effort.

The possibilities of Ab Initio are wide, for example, the attached MDW framework makes it possible to build technical and business historical data out of the box. For developers, Ab Initio makes it possible to β€œnot reinvent the wheel”, but to use the many available functional components, which in fact are libraries that are needed when working with data.

The author is an expert of the professional community of Sberbank SberProfi DWH/BigData. The SberProfi DWH/BigData professional community is responsible for developing competencies in such areas as the Hadoop ecosystem, Teradata, Oracle DB, GreenPlum, as well as BI tools Qlik, SAP BO, Tableau, etc.

Source: habr.com

Add a comment