Five Challenges in Operating and Maintaining Highload IT Systems

Hey Habr! For ten years I have been supporting Highload IT systems. I will not write in this article about the problems of setting up nginx to work in 1000+ RPS mode or other technical things. I will share my observations about the problems in the processes that arise in the support and operation of such systems.

Monitoring

Technical support does not wait until an application arrives with the content “What Why ... the site is not working again?”. Support in a minute after the site crash should already see the problem and start solving it. But the site is the tip of the iceberg.. Its availability is one of the first to be monitored.

What to do with the situation when the remains of the goods of the online store stopped coming from the ERP system? Or the CRM system that calculates discounts for customers has stopped responding? The site seems to be working. The conditional Zabbix gets its 200 response. The duty shift has not received any notifications from monitoring and is happily watching the first episode of the new season of Game of Thrones.

Often monitoring is limited only to measuring the state of memory, RAM and the load of server processors. But for business, it is much more important to get the availability of goods on the site. A conditional fall of one virtual machine in a cluster will lead to the fact that the traffic will stop going to it and the load on other servers will increase. The company will not lose money.

Therefore, in addition to monitoring the technical parameters of operating systems on servers, you need to configure business metrics. Metrics that directly affect money. Various interactions with external systems (CRM, ERP and others). The number of orders for a certain period of time. Successful or unsuccessful client authorizations and other metrics.

Interaction with external systems

Any site or mobile application with an annual turnover of more than a billion rubles interacts with external systems. Starting from the aforementioned CRM and ERP and ending with the transfer of sales data to an external Big Data system for analyzing purchases and offering the client a product that he will definitely buy (actually not). Each such system has its own support. And often communication with these systems causes pain. Especially when the problem is global and you need to analyze it in different systems.

Some systems give a phone or telegram to their admins. Somewhere you need to write letters to managers or go to the bug trackers of these external systems. Even in the context of one large company, different systems often work in different application accounting systems. Sometimes it becomes impossible to track the status of an application. You receive an application in one conditional Jira. Then, in the comments of this first Jira, you put a link to the task in another Jira. In the second Jira in the application, someone already writes a comment that you need to call the conditional admin Andrey to resolve the issue. And so on.

The best solution to this problem would be to create a single space for communication, for example, in Slack. Invitation to it of all participants in the process of operating external systems. As well as a single tracker, so as not to duplicate applications. Tickets should be tracked in one place, from notification of monitoring to the output of bug resolution in production. You will say that this is unrealistic and it has so historically developed for you that we work in one tracker, and they work in another. Different systems appeared, they had their own autonomous IT teams. I agree and therefore the problem needs to be solved from above at the level of CIO or product owner.

Every system you interact with should provide support as a service with clear SLAs to prioritize problem resolution. And not when the conditional admin Andrey has a minute for you.

bottleneck man

Does everyone on the project (or product) have such a person whose going on vacation causes convulsions in the authorities? This could be a devops engineer, analyst or developer. After all, only a devops engineer knows which servers have which containers installed, how to restart the container in case of a problem, and indeed, any complex problem cannot be solved without him. The analyst is the only one who knows how your complex mechanism works. Which data streams go where. At what parameters of requests to what services, what answers we will receive.
Who will quickly understand why there are errors in the logs and quickly fix a critical bug in the production? Of course the same developer. There are others, but for some reason only he understands how the different modules of the system are arranged.

The root of this problem is the lack of documentation. After all, if all the services of your system were described, then it would be possible to deal with the problem without an analyst. If devops took a couple of days out of his busy schedule and described all the servers, services and instructions for solving typical problems, then the problem in his absence could be solved without him. No need to quickly finish your beer on the beach on vacation and look for wi-fi to solve the problem.

Competence and responsibility of support staff

On large projects, companies do not skimp on the salaries of developers. They hunt expensive middles or seniors from similar projects. Support is a little different. These costs are being tried in every possible way to reduce. Companies hire inexpensive yesterday's enikey workers and boldly go into battle. Such a strategy is possible if we are talking about a business card website of some plant in Zelenograd.

If we are talking about a large online store, then every hour of downtime costs more than the monthly salary of an admin manager. Let's take 1 billion rubles of annual turnover as a starting point. This is the minimum turnover of any online store from the rating TOP-100 for 2018. We divide this amount by the number of hours in a year and we get more than 100 rubles of net losses. And if you do not count the night hours, then you can safely double the amount.

But money isn't everything, is it? (no, of course the main thing) There are also reputational losses. The hour of the fall of a well-known online store can cause both a wave of reviews on social networks and publications in thematic media. And the conversations of friends in the kitchen in the style of “Don’t buy anything there, their site is constantly down” are not measurable at all.

Now to responsibility. In my practice, there was a case when the administrator on duty did not respond in time to the notification of the monitoring system about the unavailability of the site. On a pleasant summer Friday evening, the site of a well-known online store in Moscow lay quietly. On Saturday morning, the product of this site did not understand why the site did not open, and there was silence in the support chats and urgent alerts in Slack. Such a mistake cost us a six-figure sum, and this duty of work.

Responsibility is a hard skill to develop. Either a person has it or they don't. Therefore, at interviews, I try to identify its presence with various questions that indirectly show whether a person is used to taking responsibility. If a person answers that he chose a university because his parents said so or changes his job because his wife said that he does not get enough, then it is better not to contact such people.

Interaction with the development team

When users have simple problems on the product during operation, support solves them on its own. Tries to reproduce the problem, analyzes the logs, and so on. But what to do when a bug surfaced on the sale? In this case, support starts a task for developers, and this is where the fun begins.

Developers are constantly overwhelmed. They are creating new features. Fixing bugs from sales is not the most interesting thing to say. The deadline for the completion of the next sprint is burning. And then unpleasant people from support come and say: "Quickly drop everything, we have problems." The priority of such tasks is minimal. Especially when the problem is not the most critical and the main functionality of the site works, and when the release manager does not run around with bulging eyes and write: “It is urgent to include this task in the next release or hotfix.”

Tasks with normal or low priority move from release to release. To the question "When will the task be completed?" you will receive answers in the style: “Sorry, there are a lot of tasks now, ask the team leads or the release manager.”

Productive issues take precedence over new feature creation. Bad reviews will not keep you waiting if users constantly stumble upon bugs. A damaged reputation is hard to repair.

DevOps solves the issues of interaction between development and support. This abbreviation is often used in the form of a specific person who helps create test environments for development, builds CICD pipelines and quickly brings the tested code to production. DevOps is an approach to software development when all participants in the process closely interact with each other and help to create and update software products and services faster. I mean analysts, developers, testers and support.

Support and development in this approach are not different departments with their own goals and objectives. Development is involved in exploitation and vice versa. The famous phrase of distributed teams: “The problem is not on my side” is no longer so often flashed in chats, and end users are becoming a little happier.

Source: habr.com

Add a comment