Sber.DS is a platform that allows you to create and implement models even without code

Ideas and meetings about what other processes can be automated arise in businesses of various sizes every day. But in addition to the fact that a lot of time can be spent on creating a model, you need to spend it evaluating it and checking that the result is not random. After implementation, any model must be monitored and periodically checked.

And these are all the stages that you need to go through in any company, regardless of its size. If we are talking about the scale and legacy of Sberbank, the number of fine-tunings increases exponentially. By the end of 2019, Sberbank had already used more than 2000 models. It is not enough just to develop a model, it is necessary to integrate with industrial systems, develop data marts for building models, and ensure control of its operation on a cluster.

Sber.DS is a platform that allows you to create and implement models even without code

Our team is developing the Sber.DS platform. It allows you to solve machine learning problems, speeds up the process of testing hypotheses, in principle simplifies the process of developing and validating models, and also controls the result of the model in PROM.

In order not to deceive your expectations, I want to say in advance that this post is an introductory one, and under the cut, for a start, it is told about what is basically under the hood of the Sber.DS platform. We will tell the story about the life cycle of a model from creation to implementation separately.

Sber.DS consists of several components, the key ones being the library, development system, and model execution system.

Sber.DS is a platform that allows you to create and implement models even without code

The library controls the life cycle of the model from the moment the idea to develop it appears to its implementation in the PROM, monitoring and decommissioning. Many features of the library are dictated by the rules of the regulator, for example, reporting and storage of training and validation samples. In fact, this is a register of all our models.

The development system is intended for visual development of models and validation methods. The developed models undergo primary validation and are delivered to the execution system to perform their business functions. Also, in the execution system, the model can be put on the monitor in order to periodically launch validation methods to control its operation.

There are several types of nodes in the system. Some are designed to connect to various data sources, others - to transform the source data and enrich it (markup). There are many nodes for building various models and nodes for their validation. The developer can load data from any sources, transform, filter, visualize intermediate data, split it into parts.

The platform also contains ready-made modules that can be dragged onto the project area. All actions are performed using a visualized interface. In fact, you can solve the problem without a single line of code.

If the built-in capabilities are not enough, then the system provides the ability to quickly create your own modules. We have made an integrated development mode based on Jupyter Kernel Gateway for those who create new modules from scratch.

Sber.DS is a platform that allows you to create and implement models even without code

The Sber.DS architecture is built on microservices. There are many opinions about what microservices are. Some people think that it is enough to split the monolithic code into parts, but they still go to the same database. Our microservice must communicate with another microservice only via the REST API. No workarounds for accessing the database directly.

We try to keep services from becoming very large and sluggish: a single instance should not consume more than 4-8 gigabytes of RAM and should be able to horizontally scale requests by launching new instances. Each service communicates with others only via the REST API (Open API). The team responsible for the service is required to keep the API backwards compatible until the last client that uses it.

The core of the application is written in Java using the Spring Framework. The solution was originally designed for rapid deployment in the cloud infrastructure, so the application is built using a containerization system Red Hat OpenShift (Kubernetes). The platform is constantly evolving, both in terms of increasing business functionality (new connectors, AutoML are added), and in terms of technological efficiency.

One of the "chips" of our platform is that we can run the code developed in the visual interface on any Sberbank model execution system. Now there are already two of them: one on Hadoop, the other on OpenShift (Docker). We do not stop there and create integration modules to run code on any infrastructure, including on-premise and in the cloud. In terms of the possibilities for effective integration into the Sberbank ecosystem, we also plan to support work with existing runtime environments. In the future, the solution can be flexibly integrated β€œout of the box” into any landscape of any organization.

Those who have ever tried to maintain a solution that runs Python on Hadoop in PROM know that it is not enough to prepare and deliver a custom python environment to each datanode. A huge number of C / C ++ libraries for machine learning that use Python modules will not let you rest in peace. We must not forget to update packages when adding new libraries or servers, while maintaining backward compatibility with already implemented model code.

There are several approaches to how to do this. For example, prepare several frequently used libraries in advance and implement them in PROM. Cloudera's Hadoop distribution usually uses parcel. Also now in Hadoop there is an opportunity to run docker- containers. In some simple cases it is possible to deliver the code along with the package python.eggs.

The bank takes the security of running third-party code very seriously, so we make the most of the new features of the Linux kernel, where a process running in an isolated environment Linux namespace, you can restrict, for example, access to the network and local disk, which greatly reduces the possibility of malicious code. Each department's data areas are protected and only available to the owners of that data. The platform ensures that data from one domain can only get into another domain through a data publishing process with control at all stages from accessing sources to landing data in the target mart.

Sber.DS is a platform that allows you to create and implement models even without code

This year we plan to complete the MVP of running models written in Python/R/Java on Hadoop. We set ourselves the ambitious task of learning how to run any user environment on Hadoop, so as not to limit the users of our platform in any way.

In addition, as it turned out, many DS specialists are excellent at mathematics and statistics, make cool models, but are not very well versed in big data transformations, and they need the help of our data engineers to prepare training samples. We decided to help our colleagues and create convenient modules for typical transformation and preparation of features for models on the Spark engine. This will allow more time to be devoted to developing models and not waiting for data engineers to prepare a new dataset.

We have people with knowledge in different areas: Linux and DevOps, Hadoop and Spark, Java and Spring, Scala and Akka, OpenShift and Kubernetes. Next time we will talk about the library of models, how the model goes through the life cycle within the company, how validation and implementation take place.

Source: habr.com

Add a comment