Why AIOps and umbrella monitoring for a bank, or what relationships with a client are built on

In publications on Habré, I already wrote about my experience of building partnerships with my team (here talks about how to draw up a partnership agreement when starting a new business so that the business does not fall apart). And now I would like to talk about how to build partnerships with clients, because without them there will be nothing to fall apart. I hope this article will be useful for startups who are starting to sell their product to large businesses.

I am currently the head of such a startup MONQ Digital lab, where my team and I are developing a product for automating the processes of supporting and operating corporate IT. Entering the market is not an easy task, and we started with a little homework, went through the market experts, our partners and carried out market segmentation. The main question was to understand “whose pains can we best heal?”

Banks are in the TOP3 segments. And of course, Tinkoff and Sberbank were first on the list. When we went to the experts of the banking market, they said: introduce your product there, and the way to the banking market will be opened. We tried to enter both there and there, but at Sberbank we were in for a failure, and the guys from Tinkoff turned out to be an order of magnitude more open to productive communication with Russian startups (maybe due to the fact that Sber at that time bought almost a billion of our Western competitors). A month later, we started a pilot project. How it was, read on.

We have been dealing with issues of operation and monitoring for many years, now we are implementing our product in the public sector, in insurance, in banks, in telecom companies, an airline had one implementation (before the project, we did not think that aviation is such an IT-dependent industry, and now we really hope, despite COVID, that the company will emerge and take off).

The product that we make belongs to corporate software, the AIOps (Artificial Intelligence for IT Operations, or ITOps) segment. The main goals of implementing such systems as the level of maturity of processes in the company increases:

  1. Extinguish fires: identify failures, clear the flow of alerts from debris, assign tasks and incidents to those responsible;
  2. Increase the efficiency of the IT service: reduce the time to resolve incidents, point out the causes of failures, increase the transparency of the state of IT;
  3. Increase business efficiency: reduce the amount of manual labor, reduce risks, increase customer loyalty.

In our experience, banks have the following “pains” with monitoring in common with all large IT infrastructures:

  • “who is in what much”: there are many technical departments, almost everyone has at least one monitoring system, and most have more than one;
  • “mosquito swarm” of alerts: each system generates hundreds and bombs all those responsible with them (sometimes also between departments). It is difficult to constantly keep the focus of control on each notification, their urgency and importance is leveled due to the large number;
  • big banks - leaders in the sector want not just to constantly monitor their systems, to know where there are failures, but also the real magic of AI - to make systems self-monitor, self-predict and self-correct.

When we came to the first meeting in Tinkoff, we were immediately told that they had no problems with monitoring and nothing hurt them, and the main question was: “What can we offer for those who are already doing well?”

The conversation was long, we discussed how their microservices are built, how divisions work, which infrastructure issues are more sensitive, which ones are less sensitive to users, where are the “white spots”, and what are their goals and SLAs.

By the way, the bank's SLAs are really impressive. For example, a first priority incident related to network availability is given only a few minutes to resolve. The cost of error and downtime here, of course, is impressive.

As a result, we have identified several areas of cooperation:

  1. the first stage is umbrella monitoring, to increase the speed of incident resolution
  2. the second stage is the automation of processes to reduce risks and reduce the cost of scaling the IT department.

Several “blank spots” could be painted in bright colors of alerts only by processing information from several monitoring systems, since it was impossible to directly take metrics, and centralization of data from different monitoring systems onto a “single screen” was also needed in order to understand the overall picture of what was happening. “Umbrellas” are suitable for this task, and we then met these requirements.

A very important thing, in our opinion, in customer relationships is honesty. After the first conversation and calculation of the cost of the license, it was said that since the cost is so low, it might be worth buying a license right away (compared to Dynatrace Key-Astrom from the article above about the green bank, our license costs not a third of a billion, but 12 thousand rubles a month for 1 gigabyte, for Sberbank it would cost several times cheaper). But we immediately told them what we have and what we don't. Perhaps a salesperson from a major integrator could say “yes, we all can, of course, buy our license”, but we decided to put all the cards on the table. At the time of launch, our box did not have integration with Prometheus, and a new version with an automation subsystem was about to be released, but we have not yet shipped it to customers.

A pilot project began, its boundaries were defined and we were given 2 months. The main tasks were:

  • prepare a new version of the platform and deploy it in the bank's infrastructure
  • connect 2 monitoring systems (Zabbix and Prometheus);
  • send notifications to those responsible in Slack and via SMS;
  • run autohealing scripts.

The first month of the pilot project took us to prepare a new version of the platform in a super-fast mode for the needs of the pilot project. The new version immediately included integration with Prometheus and autohealing. Thanks to our development team, they did not sleep for several nights, but they released what they promised without missing the deadlines for other commitments made earlier.

While setting up the pilot, we encountered a new problem that could close the project ahead of schedule: to send notifications to messengers and via SMS, we need incoming and outgoing connections to Microsoft Azure servers (at that time we used this platform to send notifications to Slack) and an external sending service SMS. But in this project, safety was a special focus of attention. In accordance with the policy of the bank, such "holes" could not be opened under any circumstances. Everything had to work from a closed loop. We were offered to use the API of our own internal services that send notifications to Slack and via SMS, but we did not have the opportunity to connect such services out of the box.

An evening of debate with the development team ended with a successful search for a solution. Digging through the backlog, we found one task for which there was never enough time and priority - to make a plugin system so that implementation teams or the client themselves can write add-ons, expanding the capabilities of the platform.

But we had exactly one month left, during which we had to install everything, configure and deploy automation.

According to Sergey, our chief architect, it takes at least a month to implement a plug-in system.

We didn't make it...

There was only one solution - to go to the client and tell everything as it is. Discuss the shift together. And it worked. We were given an extra 2 weeks. They also had their own deadlines and internal obligations to show results, but there were 2 spare weeks. In the end, we put everything on the line. It was impossible to goof off. Honesty and partnership approach again paid off.

As a result of the pilot, several important technical results and conclusions were obtained:

We tested the new functionality for processing alerts

The deployed system began to correctly receive alerts from Prometheus and group them. Alerts on a problem from the Prometheus client flew every 30 seconds (grouping by time is not enabled), and we were wondering if they could be grouped in the “umbrella” itself. It turned out that it is possible - the setting for processing alerts in the platform is implemented by a script. This makes it possible to implement almost any logic of their processing. We have already implemented the standard logic in the platform in the form of templates - if you don’t want to come up with something of your own, you can use the ready-made one.

Why AIOps and umbrella monitoring for a bank, or what relationships with a client are built on

Synthetic trigger interface. Configuring the processing of alerts from connected monitoring systems

Built the state of "health" of the system

Based on alerts, monitoring events were created that affect the health of configuration items (CUs). We are implementing a resource-service model (RSM), which can use either an internal CMDB or connect an external one - in the pilot project, the client did not connect its own CMDB.

Why AIOps and umbrella monitoring for a bank, or what relationships with a client are built on

Interface for working with the resource-service model. Pilot PCM.

Well, actually, the client finally got a single monitoring screen, where you can see events from different systems. At the moment, two systems are connected to the “umbrella” - Zabbix and Prometheus, and the internal monitoring system of the platform itself.

Why AIOps and umbrella monitoring for a bank, or what relationships with a client are built on

Analytics interface. Single monitoring screen.

Launched process automation

Monitoring events triggered the launch of preconfigured actions - sending alerts, running scripts, registering / enriching incidents - the latter was not tried with this particular client, because. in the pilot project there was no integration with the service desk.

Why AIOps and umbrella monitoring for a bank, or what relationships with a client are built on

Action setting interface. Sending notifications to Slack and restarting the server.

Expanded product functionality

When discussing automation scripts, the client asked for bash support and an interface in which these scripts could be conveniently configured. In the new version, a little more has been done (the ability to write full-fledged logical constructs in Lua with support for cURL, SSH and SNMP) and functionality has been implemented that allows you to manage the life cycle of the script (create, edit, manage versions, delete and archive).

Why AIOps and umbrella monitoring for a bank, or what relationships with a client are built on

Interface for working with autohealing scripts. Server reboot script via SSH.

Main conclusions

During the pilot, user stories were also formed that improve the current functionality and increase value for the client, here are some of them:

  • implement the ability to forward variables directly from the alert to the autohealing script;
  • add authorization on the platform through Active Directory.

And we received more global challenges for us - to “build up” the product with other features:

  • auto-building a resource-service model based on ML, rather than rules and agents (probably the main challenge now);
  • support for additional scripting and logic languages ​​​​(and this will be JavaScript).

In my opinion, the most important thingwhat this pilot shows is two things:

  1. Partnership with the client is the key to efficiency, when effective communication is built on the basis of honesty and openness, and the client becomes part of a team that achieves significant results in a short time.
  2. Under no circumstances is it necessary to “customize” and build “crutches” - only systemic solutions. It is better to spend a little more time, but to make a system solution that will be used by other clients. By the way, it happened, the plug-in system and the rejection of dependence on Azure, gave additional value to other customers (hello, 152nd Federal Law).

Source: habr.com

Add a comment