Extracting data from SAP HCM to non-SAP data warehouses

As you know, SAP offers a full range of software, both for maintaining transactional data and for processing this data in analysis and reporting systems. In particular, the SAP Business Warehouse (SAP BW) platform is a toolkit for data storage and analysis with broad technical capabilities. With all its objective advantages, the SAP BW system has one significant drawback. This is a high cost of data storage and processing, especially noticeable when using cloud-based SAP BW on Hana.

But what if you start using some non-SAP and preferably an OpenSource product as a repository? We at X5 Retail Group chose GreenPlum. This, of course, solves the issue of cost, but at the same time, questions immediately appear that, when using SAP BW, were solved almost by default.

Extracting data from SAP HCM to non-SAP data warehouses

In particular, how to take data from source systems, which are mostly SAP solutions?

"HR Metrics" was the first project in which it was necessary to solve this problem. Our goal was to create a repository of HR data and build analytical reporting in the direction of working with employees. At the same time, the main data source is the SAP HCM transactional system, in which all personnel, organizational and payroll activities are maintained.

Data extraction

SAP BW for SAP systems has standard data extractors. These extractors can automatically collect the necessary data, track its integrity, and determine change deltas. Here, for example, is the standard data source for employee attributes 0EMPLOYEE_ATTR:

Extracting data from SAP HCM to non-SAP data warehouses

The result of extracting data from it for one employee:

Extracting data from SAP HCM to non-SAP data warehouses

If necessary, such an extractor can be modified to suit your own requirements, or you can create your own extractor.

The first idea was to reuse them. Unfortunately, this turned out to be an impossible task. Most of the logic is implemented on the SAP BW side, and it was not possible to painlessly separate the extractor at the source from SAP BW.

It became obvious that it would be necessary to develop our own mechanism for extracting data from SAP systems.

Data storage structure in SAP HCM

To understand the requirements for such a mechanism, we first need to determine what kind of data we need.

Most of the data in SAP HCM is stored in flat SQL tables. Based on this data, SAP applications visualize organizational structures, employees and other HR information to the user. For example, this is how the organizational structure looks like in SAP HCM:

Extracting data from SAP HCM to non-SAP data warehouses

Physically, such a tree is stored in two tables - in hrp1000 objects and in hrp1001 the links between these objects.

Objects "Department 1" and "Department 1":

Extracting data from SAP HCM to non-SAP data warehouses

Communication between objects:

Extracting data from SAP HCM to non-SAP data warehouses

There can be a huge number of types of objects, as well as types of connections between them. There are both standard connections between objects, and customized ones for your own specific needs. For example, the standard relationship B012 between an organizational unit and a staff position indicates the head of a department.

Showing a manager in SAP:

Extracting data from SAP HCM to non-SAP data warehouses

Storage in a DB table:

Extracting data from SAP HCM to non-SAP data warehouses

Employee data is stored in pa* tables. For example, data on personnel activities for an employee is stored in the table pa0000

Extracting data from SAP HCM to non-SAP data warehouses

We have made the decision that GreenPlum will collect "raw" data, ie. just copy them from SAP tables. And already directly in GreenPlum they will be processed and converted into physical objects (for example, Department or Employee) and metrics (for example, average headcount).

About 70 tables were defined, the data from which must be transferred to GreenPlum. After that, we began to work out a way to transfer this data.

SAP offers a fairly large number of integration mechanisms. But the easiest way is that direct access to the database is prohibited due to licensing restrictions. Thus, all integration flows must be implemented at the application server level.
The next problem was the lack of data about deleted records in the SAP database. When you delete a row in the database, it is physically deleted. Those. the formation of a delta of changes over the time of change was not possible.

Of course, SAP HCM has mechanisms for fixing data changes. For example, for subsequent transfer to recipient systems, there are change pointers that fix any changes and on the basis of which Idoc is formed (an object for transfer to external systems).

An example of an IDoc for changing infotype 0302 for an employee with personnel number 1251445:

Extracting data from SAP HCM to non-SAP data warehouses

Or logging data changes in the DBTABLOG table.

An example of a log for deleting an entry with the key QK53216375 from the hrp1000 table:

Extracting data from SAP HCM to non-SAP data warehouses

But these mechanisms are not available for all the necessary data, and their processing at the application server level can consume quite a lot of resources. Therefore, the mass inclusion of logging on all necessary tables can lead to a noticeable degradation of system performance.

The next big problem was clustered tables. Time evaluation and payroll data in the RDBMS version of SAP HCM is stored as a set of logical tables per employee per payroll. These logical tables are stored as binary data in the pcl2 table.

Payroll Cluster:

Extracting data from SAP HCM to non-SAP data warehouses

Data from clustered tables cannot be read as an SQL command, but the use of SAP HCM macros or special function modules is required. Accordingly, the reading speed of such tables will be quite low. On the other hand, such clusters store data that is needed only once a month - the final payroll and time estimates. So the speed in this case is not so critical.

Evaluating the options with the formation of a data change delta, we also decided to consider the option with a full upload. The option of transferring gigabytes of unchanged data between systems every day cannot look pretty. However, it also has a number of advantages - there is no need to both implement the delta on the source side and implement the embedding of this delta on the receiver side. Accordingly, the cost and implementation time are reduced, and the reliability of integration is increased. At the same time, it was determined that almost all changes in SAP HR occur within a horizon of three months before the current date. Thus, it was decided to stop at a daily full upload of data from SAP HR N months before the current date and a monthly full upload. The N parameter depends on the specific table
and ranges from 1 to 15.

The following scheme was proposed for data extraction:

Extracting data from SAP HCM to non-SAP data warehouses

The external system generates a request and sends it to SAP HCM, where this request is checked for completeness of data and permissions to access tables. In case of a successful check, SAP HCM runs a program that collects the necessary data and transfers it to the Fuse integration solution. Fuse determines the required topic in Kafka and passes the data there. Further, the data from Kafka is transferred to the Stage Area GP.

In this chain, we are interested in the issue of extracting data from SAP HCM. Let's dwell on it in more detail.

SAP HCM-FUSE interaction diagram.

Extracting data from SAP HCM to non-SAP data warehouses

The external system determines the time of the last successful request to SAP.
The process can be triggered by a timer or other event, including setting a timeout for waiting for a response with data from SAP and initiating a retry request. Then it generates a delta query and sends it to SAP.

The request data is passed to the body in json format.
http method: POST.
Request example:

Extracting data from SAP HCM to non-SAP data warehouses

The SAP service controls the request for completeness, compliance with the current SAP structure, and the availability of permission to access the requested table.

In case of errors, the service returns a response with the appropriate code and description. In case of successful control, it creates a background process for sampling, generates and synchronously returns a unique session id.

The external system logs the error in the event of an error. In case of a successful response, it passes the session id and the name of the table on which the request was made.

The external system registers the current session as open. If there are other sessions on this table, they are closed with a warning logged.

The SAP background job generates a cursor with the specified parameters and a data packet of the specified size. Batch size - the maximum number of records that the process reads from the database. The default value is 2000. If there are more records in the database selection than the used packet size, after the transmission of the first packet, the next block is formed with the corresponding offset and incremented packet number. The numbers are incremented by 1 and are sent strictly sequentially.

Next, SAP passes the packet to the input of the web service of the external system. And it is the system that performs control of the incoming packet. A session with the received id must be registered in the system and it must be in an open status. If the package number > 1, the system should log the successful receipt of the previous package (package_id-1).

In case of successful control, the external system parses and saves the table data.

Additionally, if the final flag is present in the package and the serialization was successful, the integration module is notified that the session processing was completed successfully and the module updates the session status.

In case of a control/parsing error, the error is logged and packets for this session will be rejected by the external system.

Similarly, in the opposite case, when the external system returns an error, it is logged and the transmission of packets stops.

To request data on the SAP HCM side, an integration service was implemented. The service is implemented on the ICF framework (SAP Internet Communication Framework - help.sap.com/viewer/6da7259a6c4b1014b7d5e759cc76fd22/7.01.22/en-US/488d6e0ea6ed72d5e10000000a42189c.html). It allows you to query data from the SAP HCM system for specific tables. When forming a data request, it is possible to specify a list of specific fields and filtering parameters in order to obtain the necessary data. At the same time, the implementation of the service does not imply any business logic. Algorithms for calculating delta, query parameters, integrity control, etc. are also implemented on the side of the external system.

This mechanism allows you to collect and transmit all the necessary data in a few hours. This speed is on the verge of acceptable, so this decision is considered by us as a temporary one, which allowed us to close the need for an extraction tool on the project.
In the target picture, to solve the problem of data extraction, options are being developed for using CDC systems such as Oracle Golden Gate or ETL tools such as SAP DS.

Source: habr.com

Add a comment