How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

Any operation with big data requires large computing power. A typical move of data from a database to Hadoop can take weeks or cost as much as an airplane wing. Don't want to wait and spend? Balance the load across different platforms. One way is pushdown optimization.

I asked Aleksey Ananyev, a leading Russian trainer for the development and administration of Informatica products, to talk about the pushdown optimization feature in Informatica Big Data Management (BDM). Have you ever learned how to work with Informatica products? Most likely it was Aleksey who told you the basics of PowerCenter and explained how to build mappings.

Alexey Ananiev, Head of Training DIS Group

What is pushdown?

Many of you are already familiar with Informatica Big Data Management (BDM). The product is able to integrate big data from different sources, move it between different systems, provides easy access to it, allows you to profile it, and much more.
In the right hands, BDM can work wonders: tasks will be completed quickly and with minimal computing resources.

Do you want that too? Learn how to use BDM's pushdown feature to spread computing workload across different platforms. The pushdown technology allows you to turn the mapping into a script and choose the environment in which this script will run. The possibility of such a choice allows you to combine the strengths of different platforms and achieve their maximum performance.

To configure the script execution environment, you need to select the pushdown type. The script can be run entirely on Hadoop or partially distributed between source and target. There are 4 possible pushdown types. Mapping can not be turned into a script (native). Mapping can be performed as much as possible on the source (source) or completely on the source (full). Mapping can also be turned into a Hadoop script (none).

Pushdown optimization

The listed 4 types can be combined in different ways - to optimize pushdown for the specific needs of the system. For example, it is often more appropriate to retrieve data from a database using its own capabilities. And to transform the data - by the forces of Hadoop, so that the database itself is not overloaded.

Let's consider the case when both the source and the destination are in the database, and the platform for executing the transformations can be chosen: depending on the settings, it will be Informatica, the database server, or Hadoop. Such an example will most accurately understand the technical side of the operation of this mechanism. Naturally, in real life, this situation does not arise, but it is best suited for demonstrating the functionality.

Let's take a mapping to read two tables in a single Oracle database. And let the reading results be written to a table in the same database. The mapping scheme will be like this:

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

In the form of mapping on Informatica BDM 10.2.1, it looks like this:

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

pushdown type - native

If we choose the pushdown native type, then the mapping will be performed on the Informatica server. The data will be read from the Oracle server, transferred to the Informatica server, transformed there and transferred to Hadoop. In other words, we will get a normal ETL process.

type pushdown-source

When choosing the source type, we get the opportunity to distribute our process between the database server (DB) and Hadoop. When the process is executed with this setting, queries will be sent to the database to fetch data from tables. And the rest will be done as steps on Hadoop.
The execution scheme will look like this:

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

Below is an example of setting up the runtime environment.

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

In this case, the mapping will be performed in two steps. In its settings, we will see that it has turned into a script that will be sent to the source. Moreover, the joining of tables and data transformation will be performed in the form of an overridden query at the source.
In the picture below, we see the optimized mapping on the BDM, and the redefined query on the source.

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

The role of Hadoop in this configuration will be reduced to managing the flow of data - orchestrating them. The query result will be sent to Hadoop. After the read is completed, the file from Hadoop will be written to the receiver.

pushdown type - full

When you select the full type, the mapping will completely turn into a database query. And the query result will be sent to Hadoop. A diagram of such a process is presented below.

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

An example setup is shown below.

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

As a result, we will get an optimized mapping similar to the previous one. The only difference is that all the logic is transferred to the receiver in the form of redefining its insertion. An example of an optimized mapping is shown below.

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

Here, as in the previous case, Hadoop plays the role of a conductor. But here the source is read in its entirety, and then the data processing logic is executed at the receiver level.

pushdown type is null

Well, the last option is the pushdown type, within which our mapping will turn into a Hadoop script.

The optimized mapping will now look like this:

How to move, upload and integrate very large data cheaply and quickly? What is pushdown optimization?

Here, the data from the source files will first be read by Hadoop. Then, by its own means, these two files will be combined. After that, the data will be converted and uploaded to the database.

Understanding the principles of pushdown optimization, you can organize many processes of working with big data very effectively. So, quite recently, one large company unloaded large data from storage into Hadoop in just a few weeks, which had previously been collected for several years.

Source: habr.com

Add a comment