🥇Backend, machine learning and serverless - the most interesting from the July conference Habr

The Habr conference is not a debut story. Previously, we held rather large events Toaster for 300-400 people, but now we have decided that small thematic meetings will be relevant, the direction of which you can also set - for example, in the comments. The first conference of this format was held in July and was dedicated to back-end development. The participants listened to reports on the features of the transition from the backend to ML and the device of the Quadrupel service on the State Services portal, and also took part in a round table dedicated to Serverless. For those who could not attend the event in person, in this post we tell the most interesting.

From back-end development to machine learning

What do data engineers do in ML? How are the tasks of a backend developer and an ML engineer similar and different? What is the path you need to go to change the first profession to the second? This was told by Alexander Parinov, who went into machine learning after 10 years of backend.

Alexander Parinov

Today Alexander works as a computer vision architect at X5 Retail Group and contributes to Open Source projects related to computer vision and deep learning (github.com/creafz). His skills are confirmed by his participation in the top 100 of the Kaggle Master world ranking (kaggle.com/creafz), the most popular platform for machine learning competitions.

Why Switch to Machine Learning

A year and a half ago, Jeff Dean, head of Google Brain, Google's deep learning artificial intelligence research project, described how half a million lines of code in Google Translate was replaced with a Tensor Flow neural network of just 500 lines. After training the network, the quality of the data increased and the infrastructure became simpler. It would seem that this is our bright future: we no longer have to write code, it is enough to make neurons and shower them with data. But in practice, everything is much more complicated.

ML infrastructure at Google

Neural networks are only a small part of the infrastructure (small black box in the picture above). Many more auxiliary systems are required to receive data, process it, store it, check the quality, etc., we need an infrastructure for training, deploying machine learning code in production, testing this code. All these tasks are just like what back-end developers do.

Machine learning process

What is the difference between ML and backend

In classical programming, we write code and this dictates the behavior of the program. In ML, we have a small model code and a lot of data that we throw in the model. Data in ML is very important: the same model trained with different data can show completely different results. The problem is that almost always the data is scattered and lies in different systems (relational databases, NoSQL databases, logs, files).

Data versioning

ML requires versioning not only code, as in classical development, but also data: it is necessary to clearly understand what the model was trained on. To do this, you can use the popular Data Science Version Control library (dvc.org).

Data markup

The next task is data markup. For example, mark all the items in the picture or say which class it belongs to. This is done by special services like Yandex.Toloka, the work with which is greatly simplified by the presence of an API. Difficulties arise due to the “human factor”: you can improve the quality of data and minimize errors by assigning the same task to several performers.

Visualization in Tensor Board

Logging of experiments is necessary to compare the results, to choose the best model according to some metrics. For visualization, there is a large set of tools - for example, Tensor Board. But there are no ideal ways to store experiments. In small companies, they often get by with an excel-tablet, in large ones they use special platforms for storing results in a database.

There are many platforms for machine learning, but none of them covers even 70% of the needs

The first problem that we face when we bring the trained model to production is related to the favorite tool of data scientists - Jupyter Notebook. There is no modularity in it, that is, the output is such a “footcloth” of code that is not divided into logical pieces - modules. Everything is mixed up: classes, functions, configurations, etc. This code is difficult to version and test.

How to deal with it? You can put up with it, like Netflix, and create your own platform that allows you to run these laptops right in production, transfer data to their input and get the result. You can force developers who roll the model into production to rewrite the code normally, break it into modules. But with this approach, it is easy to make mistakes, and the model will not work as intended. Therefore, the ideal option is to prohibit the use of Jupyter Notebook for model code. If, of course, data scientists agree to this.

Model as a black box

The easiest way to get a model into production is to use it as a black box. You have some kind of model class, you have been given the weights of the model (the parameters of the neurons of the trained network), and if you initialize this class (call the predict method, feed a picture to it), then you will get some kind of prediction at the output. What happens inside doesn't matter.

Separate server process with model

You can also raise a separate process and send it through the RPC queue (with pictures or other initial data. At the output, we will receive predicts.

An example of using a model in Flask:

@app.route("/predict", methods=["POST"])
def predict():
image = flask.request.files["image"].read()
image = preprocess_image(image)
predictions = model.predict(image)
return jsonify_prediction(predictions)

The problem with this approach is performance limitation. Let's say we have Phyton code written by data scientists that slows down, and we want to squeeze out maximum performance. To do this, you can use tools that convert the code to native or convert it to another framework tailored for production. There are such tools for each framework, but ideal ones do not exist, you will have to finish it yourself.

The infrastructure in ML is the same as in a regular backend. There are Docker and Kubernetes, only for Docker you need to set the runtime from NVIDIA, which allows processes inside the container to access the video cards in the host. Kubernetes needs a plugin so that it can manage servers with video cards.

Unlike classical programming, in the case of ML, there are many different moving elements in the infrastructure that need to be checked and tested - for example, data processing code, model training pipeline and production (see diagram above). It is important to test the code that connects different pieces of pipelines: there are many pieces, and problems very often arise at module boundaries.

How AutoML Works

AutoML services promise to select the best model for your purposes and train it. But you need to understand: data is very important in ML, the result depends on their preparation. Markup is done by people, which is fraught with errors. Without strict control, garbage can turn out, but the process cannot be automated yet, verification is needed from specialists - data scientists. This is where AutoML breaks down. But it can be useful for architecture selection - when you have already prepared the data and want to run a series of experiments to find the best model.

How to get into machine learning

Getting into ML is easiest if you develop in Python, which is used in all deep learning frameworks (and regular frameworks). This language is almost mandatory for this field of activity. C++ is used for some computer vision tasks - for example, in control systems for unmanned vehicles. JavaScript and Shell - for visualization and strange things like running neurons in the browser. Java and Scala are used when working with Big Data and for machine learning. R and Julia are loved by people who do mathematical statistics.

It is most convenient to get practical experience for a start on Kaggle, participation in one of the platform's competitions gives more than a year of studying theory. On this platform, you can take someone's posted and commented code and try to improve it, optimize it for your goals. Bonus - rank on Kaggle affects your salary.

Another option is to go as a back-end developer in the ML team. There are many machine learning startups out there now where you can gain experience by helping colleagues solve their problems. Finally, you can join one of the data scientist communities - Open Data Science (ods.ai) and others.

Additional information on the topic was posted by the speaker at the link https://bit.ly/backend-to-ml

"Kvadrupel" - a service of targeted notifications of the portal "Gosuslugi"

Evgeny Smirnov

The next speaker was Evgeny Smirnov, head of the e-government infrastructure development department, who spoke about the Quadrupel. This is a targeted notification service of the Gosuslugi portal (gosuslugi.ru), the most visited state resource on the Runet. The daily audience is 2,6 million, in total, 90 million users are registered on the site, of which 60 million are confirmed. The load on the portal API is 30 thousand RPS.

Technologies used in the Gosuslug backend

"Kvadrupel" is a targeted notification service, with the help of which the user receives a service offer at the most suitable moment for him by setting up special informing rules. The main requirements in the development of the service were flexible settings and adequate time for mailings.

How Quadrupel works

The diagram above shows one of the rules for the operation of the "Quadrupel" using the example of a situation with the need to replace a driver's license. First, the service looks for users whose expiration date ends in a month. They are exposed to a banner with an offer to receive the appropriate service and an e-mail message is sent. For those users whose term has already expired, the banner and email are changed. After a successful exchange of rights, the user receives other notifications - with a suggestion to update the data in the identity.

From a technical point of view, these are groovy scripts in which the code is written. At the input - data, at the output - true / false, matched / did not match. There are more than 50 rules in total - from determining the user's birthday (the current date is the user's date of birth) to complex situations. Every day, according to these rules, about a million matches are determined - people who need to be notified.

Quadrupel Notification Channels

Under the hood of the Quadrupel is a database that stores user data and three applications:

Worker designed to update data.
Rest API picks up and gives the banners themselves to the portal and to the mobile application.
Scheduler launches work on recounting banners or mass mailing.

To update data, the backend is event-driven. Two interfaces - rest or JMS. There are a lot of events, before saving and processing they are aggregated so as not to make unnecessary requests. The database itself, the plate in which the data is stored, looks like a key value store - the user's key and the value itself: flags indicating the presence or absence of relevant documents, their validity period, aggregated statistics on the order of services by this user, and so on.

After saving the data, a task is set in JMS so that the banners are immediately recalculated - this must be immediately displayed on the web. The system starts up at night: tasks are thrown into JMS with user intervals, for which rules must be recalculated. This is picked up by processors involved in recalculation. Further, the processing results go to the next queue, which either saves the banners in the database or sends tasks to the service to notify the user. The process takes 5-7 hours, it is easily scalable due to the fact that you can always either add processors or raise instances with new processors.

The service works well enough. But the amount of data is growing as there are more users. This leads to an increase in the load on the database - even taking into account the fact that the Rest API looks at the replica. The second point is JMS, which, as it turned out, is not very suitable due to the large memory consumption. High risk of queue overflow with JMS crash and processing stop. It is impossible to raise JMS after that without clearing the logs.

It is planned to solve problems using sharding, which will balance the load on the base. There are also plans to change the data storage scheme, and change JMS to Kafka, a more fault-tolerant solution that will resolve memory problems.

Backend-as-a-Service Vs. serverless

From left to right: Alexander Borgart, Andrey Tomilenko, Nikolai Markov, Ara Israelyan

Backend as a Service or Serverless Solution? In the discussion of this topical issue at the round table participated:

Ara Israelian, CTO CTO and Founder of Scorocode.
Nikolai Markov, Senior Data Engineer at Aligned Research Group.
Andrey Tomilenko, head of the RUVDS development department.

The conversation was moderated by senior developer Alexander Borgart. We present the debate, in which the listeners also participated, in an abbreviated version.

— What is Serverless in your understanding?

Andrei: This is a calculation model - a Lambda function that must process data so that the result depends only on the data. The term came either from Google, or from Amazon and its AWS Lambda service. It is easier for the provider to process such a function by allocating a pool of capacities for this. Different users can be counted independently on the same servers.
Nicholas: Simply put, we are transferring some part of our IT infrastructure, business logic to the cloud, to outsourcing.
Ara: On the part of the developers - a good attempt to save resources, on the part of marketers - to earn more money.

- Serverless - the same as microservices?

NicholasA: No, Serverless is more of an architecture organization. A microservice is an atomic unit of some kind of logic. Serverless is an approach, not a "separate entity".
Ara: The Serverless function can be packaged in a microservice, but from this it will cease to be Serverless, cease to be a Lambda function. In Serverless, a function only starts running when it is requested.
AndreiA: They differ in lifetime. We launched the Lambda function and forgot about it. It worked for a couple of seconds, and the next client can process its request on another physical machine.

What scales best?

Ara: When scaled out, Lambda functions behave exactly like microservices.
Nicholas: How many replicas you set - there will be as many of them, there are no problems with scaling in Serverless. I made a replica set in Kubernetes, launched 20 instances “somewhere”, and you got 20 anonymous links back. Forward!

- Is it possible to write a backend on Serverless?

Andrei: Theoretically, but it makes no sense. Lambda functions will rest against a single repository - we need to ensure warranty. For example, if a user has made a certain transaction, then the next time he accesses it, he should see: the transaction was made, the funds were credited. All Lambda functions will block on this call. In fact, a bunch of Serverless functions will turn into a single service with one database bottleneck.

- In what situations does it make sense to use a serverless architecture?

Andrei: Tasks that do not require shared storage - the same mining, blockchain. Where you need to count a lot. If you have a lot of computing power, then you can define a function like "calculate the hash of something there ..." But you can solve the problem with data storage, taking, for example, from Amazon and Lambda functions, and their distributed storage. And it turns out that you are writing a regular service. Lambda functions will access the store and give some kind of response to the user.
Nicholas: Containers that run in Serverless are extremely resource limited. There is little memory and everything else. But if you have the entire infrastructure deployed completely on some cloud - Google, Amazon - and you have a permanent contract with them, there is a budget for all this, then you can use Serverless containers for some tasks. It is necessary to be inside this infrastructure, because everything is tailored for use in a specific environment. That is, if you are ready to tie everything to the cloud infrastructure, you can experiment. The advantage is that you do not have to manage this infrastructure.
Ara: That Serverless does not require you to manage Kubernetes, Docker, install Kafka and so on is self-deception. The same Amazon and Google manage this and put it. Another thing is that you have an SLA. You might as well outsource everything instead of programming yourself.
Andrei: Serverless itself is inexpensive, but you have to pay a lot for other Amazon services - for example, a database. The people have already sued them, because they fought big money for the API gate.
Ara: If we talk about money, then you need to consider this point: you will have to turn the entire development methodology in the company 180 degrees in order to transfer all the code to Serverless. This will take a lot of time and money.

- Are there any worthy alternatives to the paid Serverless Amazon and Google?

Nicholas: In Kubernetes, you run some kind of job, it works out and dies - this is quite Serverless in terms of architecture. If you want to create really interesting business logic with queues, with databases, then you need to think a little more about it. All this is solved without leaving Kubernetes. I would not drag additional implementation.

— How important is it to monitor what is happening in Serverless?

AraA: Depends on system architecture and business requirements. In essence, the provider must provide reporting that will help the devops to understand possible problems.
Nicholas: Amazon has CloudWatch, where all the logs are streamed, including those from Lambda. Integrate log forwarding and use some separate tool for viewing, alerting, and so on. In the containers that you start, you can cram agents.

- Let's summarize.

Andrei: Thinking in terms of Lambda functions is useful. If you create a service on your knee - not a microservice, but one that writes a request, accesses the database and sends a response - the Lambda function solves a number of problems: with multithreading, scalability, and so on. If your logic is built in a similar way, then in the future you will be able to transfer these Lambdas to microservices or use third-party services like Amazon. The technology is useful, the idea is interesting. To what extent it is justified for business is still an open question.
Nikolai: Serverless is better to use for operation tasks than for calculating some business logic. I always think of it as event processing. If you have it in Amazon, if you are in Kubernetes, yes. Otherwise, you will have to make quite a lot of effort to raise Serverless on your own. You need to look at a specific business case. For example, I now have one of the tasks: when files appear on disk in a certain format, I need to upload them to Kafka. I can use WatchDog or Lambda. From a logical point of view, both options are suitable, but the implementation of Serverless is more complicated, and I prefer the simpler way, without Lambda.
Ara: Serverless - the idea is interesting, applicable, very technically beautiful. Sooner or later, technology will reach the point where any function will be raised in less than 100 milliseconds. Then, in principle, there will be no question of whether the waiting time is critical for the user. At the same time, the applicability of Serverless, as colleagues have already said, depends entirely on the business task.

We thank our sponsors who helped us a lot:

IT conference space «Spring» for the conference site.
IT events calendar Runet ID and the editionInternet in numbers» for information support and news.
«Acronis» for gifts.
Avito for co-creation.
"Association of Electronic Communications" RAEC for engagement and experience.
main sponsor RUVDS - for all!

Source: habr.com

Backend, machine learning and serverless - the most interesting from the July conference of Habr