In my publications on Habr, I have already written about my experience of building partnerships with my team ( (This article discusses how to draft a partnership agreement when starting a new business to prevent it from failing.) Now I'd like to talk about how to build partnerships with clients, because without them, there's nothing to fail. I hope this article will be useful for startups starting to sell their product to large businesses.
I'm currently leading such a startup, MONQ Digital Lab, where my team and I are developing a product to automate corporate IT support and operations processes. Entering the market is no easy task, so we started with a little homework, interviewing market experts and our partners, and conducting market segmentation. The key question was, "Whose pain points can we best address?"
Banks made it into the top three segments. And, of course, Tinkoff and Sberbank topped the list. When we approached banking market experts, they said, "Introduce your product there, and the path to the banking market will be clear." We tried to enter both, but Sberbank was a failure, while the guys at Tinkoff turned out to be much more open to productive communication with Russian startups (perhaps because Sberbank was in the process of (Almost a billion behind our Western competitors). Just a month later, we launched a pilot project. Read on to see how it all went.
We've been working on operations and monitoring for many years, and we're currently implementing our product in the public sector, insurance, banking, and telecom companies. We've also implemented one with an airline (before the project, we had no idea that aviation was such an IT-dependent industry, and we're very hopeful, despite COVID, that the company will thrive and take off).
The product we develop falls under the AIOps (Artificial Intelligence for IT Operations, or ITOps) umbrella of enterprise software. The primary goals of implementing such systems as a company's processes mature are:
- Put out fires: identify failures, clear the alert stream of garbage, assign tasks and incidents to those responsible;
- Improve the efficiency of the IT service: reduce the time it takes to resolve incidents, identify the causes of failures, and increase the transparency of the IT status;
- Improve business efficiency: reduce manual labor, mitigate risks, and increase customer loyalty.
In our experience, banks share the same monitoring pain points as all large IT infrastructures:
- “Everyone does their own thing”: there are many technical departments, almost each has at least one monitoring system, and most have more than one;
- A "mosquito swarm" of alerts: each system generates hundreds and bombards all responsible parties with them (sometimes even across departments). Maintaining constant focus on each notification is difficult, as their urgency and importance are negated by the sheer volume;
- Large banks—the sector's leaders—want not only to continuously monitor their systems and know where failures are, but also to harness the true magic of AI—to make systems self-monitoring, self-predicting, and self-correcting.
When we arrived at our first meeting at Tinkoff, we were immediately told that they had no problems with monitoring and that nothing was bothering them. The main question was, "What can we offer for those who are already doing well?"
The conversation was lengthy, and we discussed how their microservices are built, how their departments operate, which infrastructure issues are more sensitive and which are less sensitive for users, where the blind spots are, and what their goals and SLAs are.
Incidentally, the bank's SLAs are truly impressive. For example, only a few minutes are allotted to resolve a priority incident related to network availability. The cost of error and downtime here is certainly significant.
As a result, we identified several areas of cooperation:
- The first stage is umbrella monitoring to increase the speed of incident resolution.
- The second stage is automation of processes to reduce risks and reduce the costs of scaling the IT department.
Several "blind spots" could only be brightened with alert colors by processing information from multiple monitoring systems, as directly capturing metrics was impossible. Data from various monitoring systems also needed to be centralized on a single screen to understand the overall picture. Umbrellas are suitable for this task, and we met these requirements at the time.
We believe honesty is crucial in our relationships with clients. After the initial conversation and the license price calculation, they suggested that since the price was so low, maybe they should buy it right away (compared to Dynatrace Key-Astrom from the article above about the green bank, our license doesn't cost a third of a billion rubles, but 12,000 rubles per month per gigabyte; for Sber, it would have been several times cheaper). But we immediately told them what we had and didn't have. Perhaps a sales representative from a large integrator could have said, "Yes, we can do everything, of course, buy our license," but we decided to lay all our cards on the table. Our product didn't have Prometheus integration at launch, and a new version with an automation subsystem was about to be released, but we hadn't shipped it to clients yet.
The pilot project began, its scope was defined, and we were given two months to complete it. The main objectives were:
- prepare a new version of the platform and deploy it in the bank's infrastructure
- connect 2 monitoring systems (Zabbix and Prometheus);
- send notifications to those responsible via Slack and SMS;
- run autohealing scripts.
We spent the first month of the pilot project preparing a new version of the platform at breakneck speed for the pilot's needs. The new version immediately included Prometheus integration and autohealing. Thanks to our development team, who stayed up all night, delivering what they promised without missing any other deadlines.
While setting up the pilot, we encountered a new issue that could have shut down the project prematurely: to send notifications to messengers and via SMS, we needed incoming and outgoing connections to Microsoft Azure servers (we were using this platform at the time to send notifications to Slack) and an external SMS service. However, security was a particular focus for this project. According to the bank's policy, such vulnerabilities could not be opened under any circumstances. Everything had to operate within a closed loop. We were offered the option to use the APIs of our own internal services that send notifications to Slack and via SMS, but we had no way to integrate these services out of the box.
An evening of debate with the development team resulted in a successful solution. After digging through the backlog, we identified one task that we'd never had the time or priority for: creating a plugin system so that implementation teams or clients could write add-ons themselves, expanding the platform's capabilities.
But we had exactly one month left in which to install, configure, and deploy automation.
According to Sergey, our chief architect, implementing the plugin system will take at least a month.
We didn't have time...
There was only one solution: go to the client and explain everything as it was. We discussed a deadline extension together. And it worked. They gave us an extra two weeks. They had their own deadlines and internal commitments to deliver results, but they had two weeks to spare. In the end, we put everything on the line. There was no way we could screw up. Honesty and a collaborative approach paid off again.
As a result of the pilot, several important technical results and conclusions were obtained:
We tested new alert processing functionality
The deployed system began correctly receiving alerts from Prometheus and grouping them. Issue alerts from the Prometheus client were being sent every 30 seconds (time grouping wasn't enabled), and we wondered if they could be grouped within the umbrella itself. It turned out that it could—the platform's alert handling configuration is implemented using a script. This allows us to implement virtually any processing logic. We've already implemented the standard logic in the platform as templates—if you don't want to invent your own, you can use a pre-built one.

Synthetic trigger interface. Configuring alert processing from connected monitoring systems
We built the system's health status
Alerts generated monitoring events that impacted the health of configuration items (CIs). We implement a resource-service model (RSM), which can use either an internal CMDB or connect to an external one—the client did not connect their CMDB during the pilot project.

Interface for working with the resource-service model. Pilot RSM.
And so, the client finally has a unified monitoring screen, displaying events from various systems. Currently, two systems are connected to the "umbrella"—Zabbix and Prometheus—as well as the platform's internal monitoring system.

Analytics interface. Single monitoring screen.
Launched process automation
Monitoring events triggered preconfigured actions—sending alerts, running scripts, and logging/enriching incidents—but the latter was not tested with this particular client because the pilot project did not include integration with the service desk.

Action settings interface. Sending notifications to Slack and server restart.
We expanded the product's functionality
When discussing automation scripts, the client requested bash support and an interface that would allow for easy configuration of these scripts. The new version adds a bit more (the ability to write full-fledged logic constructs in Lua with support for cURL, SSH, and SNMP) and implements functionality for managing the script lifecycle (creation, editing, versioning, deletion, and archiving).

Autohealing script interface. Server reboot script via SSH.
Main conclusions
During the pilot, user stories were also created that improve current functionality and increase customer value. Here are some of them:
- implement the ability to pass variables directly from the alert to the auto-healing script;
- Add Active Directory authorization to the platform.
And we faced more global challenges – to “grow” the product with other capabilities:
- automatic construction of a resource-service model based on ML, rather than rules and agents (probably the main challenge now);
- support for additional scripting and logic languages (and this will be JavaScript).
In my opinion, the most important thingWhat this pilot shows is two things:
- A partnership with a client is the key to effectiveness when honesty and openness underpin effective communication, and the client becomes part of a team that achieves significant results in a short time.
- Under no circumstances should you "customize" or build "cheats"—only systemic solutions. It's better to spend a little more time but create a systemic solution that will be used by other clients. Incidentally, this is exactly what happened: the plugin system and eliminating the dependency on Azure provided additional value to other clients as well (hello, Federal Law 152).
Source: habr.com
