Application of low-code in analytical platforms

Dear readers, good day!

The task of building IT platforms for data accumulation and analysis sooner or later arises for any company whose business is based on an intellectually loaded model of service provision or the creation of technically complex products. Building analytical platforms is a complex and labor-intensive task. However, any task can be simplified. In this article, I want to share my experience of using low-code tools to help create analytical solutions. This experience was acquired during the implementation of a number of projects in the Big Data Solutions direction of Neoflex. Neoflex's Big Data Solutions since 2005 has been dealing with building data warehouses and lakes, solving problems of optimizing the speed of information processing and working on a data quality management methodology.

Application of low-code in analytical platforms

No one will be able to avoid the conscious accumulation of weakly and/or strongly structured data. Perhaps even if we are talking about small businesses. Indeed, when scaling a business, a promising entrepreneur will face the issues of developing a loyalty program, want to analyze the effectiveness of points of sale, think about targeted advertising, and be puzzled by the demand for accompanying products. In the first approximation, the problem can be solved "on the knee". But with the growth of business, the arrival to the analytical platform is still inevitable.

However, in what case can the tasks of data analytics develop into tasks of the “Rocket Science” class? Perhaps at the moment when it comes to really big data.
To make Rocket Science easier, you can eat the elephant piece by piece.

Application of low-code in analytical platforms

The greater the discreteness and autonomy of your applications / services / microservices, the easier it will be for you, your colleagues and the whole business to digest the elephant.

Almost all of our clients have come to this postulate by rebuilding the landscape based on the engineering practices of DevOps teams.

But even with a “separate, elephantine” diet, we have a good chance of “oversaturating” the IT landscape. At this point, you should stop, exhale and look to the side. low-code engineering platform.

Many developers are intimidated by the prospect of a career dead end when they move away from writing code directly and towards dragging arrows in low-code UIs. But the appearance of machine tools did not lead to the disappearance of engineers, but brought their work to a new level!

Let's figure out why.

Data analysis in the field of logistics, the telecom industry, in the field of media research, the financial sector, is always associated with the following issues:

  • The speed of automated analysis;
  • Ability to conduct experiments without affecting the main data production stream;
  • Reliability of the prepared data;
  • Change tracking and versioning;
  • Data proof, Data lineage, CDC;
  • Fast delivery of new features to the production environment;
  • And the notorious: the cost of development and support.

That is, engineers have a huge number of high-level tasks that can be performed with sufficient efficiency only by clearing their minds from low-level development tasks.

The preconditions for developers to move to a new level are the evolution and digitalization of business. The value of the developer is also changing: there is a significant shortage of developers who can immerse themselves in the essence of the concepts of an automated business.

Let's draw an analogy with low-level and high-level programming languages. The transition from low-level languages ​​towards high-level languages ​​is a transition from writing "direct directives in the language of iron" towards "directives in the language of people." That is, adding some layer of abstraction. In this case, the transition to low-code platforms from high-level programming languages ​​is a transition from “directives in the language of people” towards “directives in the language of business”. If there are developers who are saddened by this fact, then they have been saddened, perhaps since the birth of Java Script, which uses array sorting functions. And these functions, of course, have software implementation under the hood by other means of the same high-level programming.

Therefore, low-code is just the appearance of another level of abstraction.

Applied experience using low-code

The topic of low-code is quite broad, but now I would like to talk about the application of "low-code concepts" using the example of one of our projects.

The Big Data Solutions division of Neoflex specializes to a greater extent in the financial sector of the business, building data warehouses and lakes and automating various reporting. In this niche, the use of low-code has long been the standard. Other low-code tools include tools for organizing ETL processes: Informatica Power Center, IBM Datastage, Pentaho Data Integration. Or Oracle Apex, which is the environment for rapid development of interfaces for accessing and editing data. However, the use of low-code development tools is not always associated with the construction of narrowly focused applications on a commercial technology stack with a pronounced dependence on the vendor.

With the help of low-code platforms, you can also organize the orchestration of data flows, create data-science platforms or, for example, data quality check modules.

One of the applied examples of the experience of using low-code development tools is the Neoflex collaboration with Mediascope, one of the leaders in the Russian media research market. One of the tasks of this company's business is the production of data, on the basis of which advertisers, Internet sites, TV channels, radio stations, advertising agencies and brands make decisions about advertising purchases and plan their marketing communications.

Application of low-code in analytical platforms

Media research is a technologically loaded business area. Recognition of a video sequence, collection of data from devices that analyze viewing, measurement of activity on web resources - all this implies that the company has a large IT staff and tremendous experience in building analytical solutions. But the exponential growth in the amount of information, the number and variety of its sources makes the IT data industry constantly progress. The simplest solution to scaling the already functioning analytical platform Mediascope could be an increase in IT staff. But a much more efficient solution is to speed up the development process. One of the steps leading in this direction may be the use of low-code platforms.

At the time of the start of the project, the company already had a functioning product solution. However, the implementation of the solution on MSSQL could not fully meet the expectations for scaling the functionality while maintaining an acceptable cost of improvement.

The task before us was truly ambitious - Neoflex and Mediascope had to create an industrial solution in less than a year, provided that the MVP was released already within the first quarter from the date of commencement of work.

The Hadoop technology stack was chosen as the foundation for building a new data platform based on low-code computing. HDFS has become the data storage standard using parquet files. To access the data in the platform, Hive is used, in which all available storefronts are presented in the form of external tables. Loading data into the storage was implemented using Kafka and Apache NiFi.

The lowe-code tool in this concept was used to optimize the most labor-intensive task in building an analytical platform - the task of calculating data.

Application of low-code in analytical platforms

The Datagram low-code tool was chosen as the main mechanism for data mapping. Neoflex Datagram is a tool for developing transformations and data flows.
Using this tool, you can do without writing Scala code "manually". Scala code is generated automatically using the Model Driven Architecture approach.

The obvious advantage of this approach is the acceleration of the development process. However, in addition to speed, there are also the following advantages:

  • Viewing the content and structure of sources / destinations;
  • Tracing the origin of data flow objects to individual fields (lineage);
  • Partial execution of transformations with viewing of intermediate results;
  • Viewing the source code and adjusting it before execution;
  • Automatic validation of transformations;
  • Automatic data loading 1 in 1.

The entry threshold for low-code solutions for generating transformations is quite low: a developer needs to know SQL and have experience with ETL tools. At the same time, it is worth mentioning that code-driven transformation generators are not ETL tools in the broad sense of the word. Low-code tools may not have their own environment for executing code. That is, the generated code will be executed on the environment that was on the cluster even before the installation of the low-code solution. And this, perhaps, is another plus for low-code karma. Since, in parallel with the low-code command, a “classic” command can work that implements the functionality, for example, on pure Scala code. Pulling the improvements of both teams into production will be simple and seamless.

Perhaps it should also be noted that in addition to low-code, there are also no-code solutions. And in essence they are different things. Low-code to a greater extent allows the developer to interfere with the generated code. In the case of Datagram, it is possible to view and edit the generated Scala code, no-code may not provide such an opportunity. This difference is very significant not only in terms of solution flexibility, but also in terms of comfort and motivation in the work of data engineers.

Solution architecture

Let's try to figure out exactly how the low-code tool helps to solve the problem of optimizing the speed of developing the data calculation functionality. First, let's analyze the functional architecture of the system. In this case, the data production model for media research is an example.

Application of low-code in analytical platforms

The sources of data in our case are very heterogeneous and diverse:

  • People meters (TV meters) are software and hardware devices that read the user behavior of the respondents of the TV panel - who, when and which TV channel was watched in the household participating in the study. The information provided is a stream of broadcast viewing intervals linked to a media package and a media product. Data at the stage of loading into Data Lake can be enriched with demographic attributes, geostrate, time zone, and other information necessary to analyze the TV viewing of a particular media product. The measurements made can be used to analyze or plan advertising campaigns, assess the activity and preferences of the audience, draw up a broadcast grid;
  • Data can come from monitoring systems for streaming television and measuring the viewing of video content on the Internet;
  • Measuring tools in a web environment, including both site-centric and user-centric counters. The data provider for Data Lake can be a research bar browser add-on and a mobile app with a built-in VPN.
  • Data can also come from sites that consolidate the results of filling out online questionnaires and the results of telephone interviews in company surveys;
  • Additional enrichment of the data lake can occur by downloading information from the logs of partner companies.

The implementation of as is loading from source systems to primary staging of raw data can be organized in various ways. If low-code is used for these purposes, automatic generation of download scripts based on metadata is possible. In this case, there is no need to go down to the level of development of source to target mappings. To implement automatic loading, we need to establish a connection with the source, and then define a list of entities to be loaded in the loading interface. The creation of the directory structure in HDFS will happen automatically and will correspond to the data storage structure in the source system.

However, in the context of this project, we decided not to use this feature of the low-code platform due to the fact that Mediascope has already independently begun work on the production of a similar service based on the Nifi + Kafka bundle.

It should be noted right away that these tools are not interchangeable, but rather complementary. Nifi and Kafka are able to work both in direct (Nifi -> Kafka) and in reverse (Kafka -> Nifi) bundle. For the media research platform, the first version of the bundle was used.

Application of low-code in analytical platforms

In our case, Naifai needed to process various types of data from source systems and send them to the Kafka broker. At the same time, messages were sent to a specific Kafka topic using PublishKafka Nifi processors. Orchestration and maintenance of these pipelines is done in a visual interface. The Nifi tool and the use of the Nifi + Kafka bundle can also be called a low-code approach to development, which has a low threshold for entering Big Data technologies and speeds up the application development process.

The next step in the implementation of the project was to bring the format of a single semantic layer of detailed data. If an entity has historical attributes, the calculation is performed in the context of the partition in question. If the entity is not historical, then it is optionally possible to either recalculate the entire contents of the object, or completely refuse to recalculate this object (due to the absence of changes). At this stage, keys are generated for all entities. The keys are stored in the Hbase directories corresponding to the master objects, containing the correspondence between the keys in the analytical platform and the keys from the source systems. Consolidation of atomic entities is accompanied by enrichment with the results of preliminary calculation of analytical data. The framework for calculating the data was Spark. The described functionality of bringing data to a single semantics was also implemented on the basis of mappings of the Datagram low-code tool.

The target architecture needed to provide SQL access to data for business users. Hive was used for this option. Registration of objects in Hive is performed automatically when you enable the "Registr Hive Table" option in the low-code tool.

Application of low-code in analytical platforms

Payroll management

Datagram has an interface for building a workflow design. Mappings can be run using the Oozie scheduler. In the stream developer interface, it is possible to create schemes for parallel, sequential, or condition-dependent data transformations. There is support for shell scripts and java programs. It is also possible to use the Apache Livy server. Apache Livy is used to run applications directly from the development environment.

If the company already has its own process orchestrator, it is possible to use the REST API to embed mappings into an existing flow. For example, we had a fairly successful experience of embedding Scala mappings into orchestrators written in PLSQL and Kotlin. The REST API of a low-code tool implies the presence of such operations as generating an executable year based on the mapping design, calling the mapping, calling the sequence of mappings, and, of course, passing parameters to the URL to start the mappings.

Along with Oozie, it is possible to organize the flow of calculation using Airflow. Perhaps I will not dwell on the comparison of Oozie and Airflow for a long time, but I will simply say that in the context of the work on the media research project, the choice fell in the direction of Airflow. The main arguments this time were a more active community developing the product, and a more developed interface + API.

Airflow is also good because it uses the beloved Python by many to describe the calculation processes. And in general, there are not so many open source workflow management platforms. Launching and monitoring the execution of processes (including those with a Gantt chart) only add points to Airflow's karma.

The configuration file format for running low-code solution mappings has become spark-submit. This happened for two reasons. First, spark-submit allows you to directly run a jar file from the console. Secondly, it can contain all the necessary information to configure the workflow (which makes it easier to write scripts that form the Dag).
The most common Airflow workflow element in our case was the SparkSubmitOperator.

SparkSubmitOperator allows you to run jars - packed Datagram mappings with pre-formed input parameters for them.

It should be mentioned that each Airflow task runs on a separate thread and knows nothing about other tasks. In this connection, interaction between tasks is carried out using control operators, such as DummyOperator or BranchPythonOperator.

Together, the use of the Datagram low-code solution in conjunction with the universalization of configuration files (forming Dag) led to a significant acceleration and simplification of the process of developing data loading flows.

Showcase calculation

Perhaps the most intellectually demanding step in the production of analytical data is the step of building a storefront. In the context of one of the research company's data calculation streams, at this stage, the conversion to the reference broadcast takes place, taking into account the correction for time zones with reference to the broadcast grid. It is also possible to correct for the local broadcast grid (local news and advertising). Among other things, this step performs a breakdown of the intervals of continuous viewing of media products based on the analysis of the intervals of viewing. Immediately there is a “weighting” of the view values ​​based on information about their significance (calculation of a correction factor).

Application of low-code in analytical platforms

A separate step in the preparation of showcases is data validation. The validation algorithm is coupled with the use of a number of mathematical science models. However, the use of a low-code platform allows you to break a complex algorithm into a number of separate visually readable mappings. Each of the mappings performs a narrow task. As a result, intermediate debugging, logging and visualization of data preparation stages are possible.

It was decided to discretize the validation algorithm into the following sub-stages:

  • Construction of regression dependences of television network viewing in the region with the viewing of all networks in the region for 60 days.
  • Calculation of studentized residuals (deviations of actual values ​​from those predicted by the regression model) for all regression points and for the settlement day.
  • A selection of anomalous region-TV network pairs, where the studentized balance of the settlement day exceeds the norm (specified by the operation settings).
  • Recalculation of the corrected studentized balance for anomalous region-TV network pairs for each respondent who watched the network in the region with the determination of the contribution of this respondent (the amount of change in the studentized balance) while excluding the viewing of this respondent from the sample.
  • Search for candidates whose exclusion brings the studentized balance of the settlement day back to normal.

The above example is a confirmation of the hypothesis that a data engineer should already have too much in his head ... And, if this is really an “engineer” and not a “coder”, then the fear of professional degradation when using low-code tools he must finally retreat.

What else can low-code do?

The scope of a low-code tool for batch and stream data processing without the need to write code in Scala manually does not end there.

The use of low-code in the development of data lakes has already become a standard for us. Probably, we can say that solutions on the Hadoop stack repeat the path of development of classical DWH based on RDBMS. Low-code tools on the Hadoop stack can solve both data processing tasks and the tasks of building final BI interfaces. Moreover, it should be noted that BI can mean not only the representation of data, but also their editing by business users. We often use this functionality when building analytical platforms for the financial sector.

Application of low-code in analytical platforms

Among other things, with the help of low-code and, in particular, Datagram, it is possible to solve the problem of tracking the origin of data flow objects with atomicity to individual fields (lineage). To do this, the low-code tool implements pairing with Apache Atlas and Cloudera Navigator. In essence, the developer needs to register a set of objects in Atlas dictionaries and refer to the registered objects when building mappings. The mechanism for tracking the origin of data or analyzing the dependencies of objects saves a lot of time if it is necessary to make improvements to the calculation algorithms. For example, when building financial statements, this feature allows you to more comfortably survive the period of legislative changes. After all, the more qualitatively we realize the interform dependence in the context of the objects of the detailed layer, the less we will encounter “sudden” defects and reduce the number of reworks.

Application of low-code in analytical platforms

Data Quality & Low-code

Another task implemented by the low-code tool on the Mediascope project was the task of the Data Quality class. A feature of the implementation of the data verification pipeline for the project of a research company was the absence of an impact on the performance and speed of the main flow of data calculation. For the possibility of orchestrating independent data validation flows, the already familiar Apache Airflow was used. As each step of data production was ready, a separate part of the DQ pipeline was launched in parallel.

It is good practice to monitor the quality of data from the moment it is born in an analytics platform. Having information about metadata, we can check compliance with the basic conditions from the moment the information enters the primary layer - not null, constraints, foreign keys. This functionality is implemented on the basis of automatically generated mappings of the data quality family in Datagram. Code generation in this case is also based on model metadata. On the Mediascope project, the interface was with the metadata of the Enterprise Architect product.

By pairing the low-code tool and Enterprise Architect, the following checks were automatically generated:

  • Checking for the presence of "null" values ​​in fields with the "not null" modifier;
  • Checking for the presence of duplicates of the primary key;
  • Checking the foreign key of an entity;
  • Checking the uniqueness of a string by a set of fields.

For more complex checks of data availability and validity, a mapping was created with Scala Expression, which takes as input an external Spark SQL check code prepared by Zeppelin analysts.

Application of low-code in analytical platforms

Of course, auto-generation of checks should be approached gradually. Within the framework of the described project, this was preceded by the following steps:

  • DQ implemented in Zeppelin notebooks;
  • DQ embedded in mapping;
  • DQ in the form of separate massive mappings containing a whole set of checks for a separate entity;
  • Generic parameterized DQ mappings that accept information about metadata and business checks as input.

Perhaps the main advantage of creating a parameterized checks service is the reduction in the time it takes to deliver the functionality to the production environment. New quality checks can bypass the classic pattern of delivering code indirectly through development and test environments:

  • All metadata checks are automatically generated when the model is changed in EA;
  • Data availability checks (determining the presence of any data at a point in time) can be generated based on a directory that stores the expected timing of the appearance of the next portion of data in the context of objects;
  • Business data validations are created by analysts in Zeppelin notebooks. From where they are sent straight to the setup tables of the DQ module on the production environment.

There are no risks of direct shipment of scripts for production as such. Even with a syntax error, the maximum that threatens us is the failure to perform one check, because the flow of data calculation and the flow of launching quality checks are divorced from each other.

In fact, the DQ service is permanently running on the production environment and is ready to start its work at the moment the next piece of data appears.

Instead of a conclusion

The advantage of using low-code is obvious. Developers do not need to develop an application from scratch. And the programmer freed from additional tasks gives the result faster. Speed, in turn, frees up an additional resource of time to resolve optimization issues. Therefore, in this case, you can count on the availability of a better and faster solution.

Of course, low-code is not a panacea, and magic will not happen by itself:

  • The low-code industry is going through a “strengthening” stage, and so far there are no uniform industrial standards in it;
  • Many low-code solutions are not free, and their acquisition should be a conscious step, which should be taken with full confidence in the financial benefits of using them;
  • Many low-code solutions don't always play well with GIT/SVN. Either they are inconvenient to use in case of hiding the generated code;
  • When expanding the architecture, it may be necessary to refine the low-code solution - which, in turn, provokes the effect of “attachment and dependence” on the provider of the low-code solution.
  • A proper level of security is possible, but very labor-intensive and difficult to implement in low-code engines. Low-code platforms should be chosen not only on the principle of finding benefits from their use. When choosing, it is worth asking questions about the availability of access control functionality and delegation / escalation of identification data to the level of the entire IT landscape of the organization.

Application of low-code in analytical platforms

However, if all the disadvantages of the chosen system are known to you, and the benefits from its use, nevertheless, are in the dominant majority, then proceed to the small code without fear. Moreover, the transition to it is inevitable - as any evolution is inevitable.

If one developer on a low-code platform can do their job faster than two developers without low-code, then this gives the company a head start in every way. The entry threshold for low-code solutions is lower than for "traditional" technologies, and this has a positive effect on the issue of staff shortage. When using low-code tools, it is possible to accelerate the interaction between functional teams and make faster decisions about the correctness of the chosen path of data-science-research. Low-level platforms can be the reason for the digital transformation of an organization, since the solutions produced can be understood by non-technical specialists (in particular, business users).

If you have tight deadlines, loaded business logic, lack of technological expertise, and you need to speed up time to market, then low-code is one of the ways to meet your needs.

The importance of traditional development tools cannot be denied, however, in many cases, the use of low-code solutions is the best way to increase the efficiency of the tasks being solved.

Source: habr.com

Add a comment