Monitor Sportmaster - how and with what

We thought about creating a monitoring system at the stage of forming product teams. It became clear that our business - exploitation - does not fall into these teams. Why is that?

The fact is that all our teams are built around separate information systems, microservices and fronts, so the teams do not see the overall health of the entire system as a whole. For example, they may not know how some small part in the deep backend affects the front end. The range of their interests is limited to the systems with which their system is integrated. If the team and its service A have almost nothing to do with service B, then such a service is almost invisible to the team.

Monitor Sportmaster - how and with what

Our team, in turn, works with systems that are very strongly integrated with each other: there are many connections between them, this is a very large infrastructure. And the operation of an online store depends on all these systems (of which, by the way, we have a huge number).

So it turns out that our department does not belong to any team, but is a little aloof. Throughout this story, our task is to understand in a complex how information systems work, their functionality, integrations, software, network, hardware, and how all this is interconnected.

The platform on which our online stores operate looks like this:

  • front
  • middle-office
  • back-office

No matter how much we would like, but it does not happen that all systems work smoothly and flawlessly. The point, again, is the number of systems and integrations - with such as we have, some incidents are inevitable, despite the quality of testing. Moreover, both within a separate system and in terms of their integration. And you need to monitor the state of the entire platform in a complex, and not some separate part of it.

Ideally, monitoring the health of the entire platform should be automated. And we have come to monitoring as an inevitable part of this process. Initially, it was built only for the front part, while networkers, software and hardware administrators had and have their own monitoring systems by layers. All these people followed the monitoring only at their level, no one had a comprehensive understanding either.

For example, if a virtual machine has crashed, in most cases only the administrator responsible for the hardware and the virtual machine knows about it. The front team in such cases saw the very fact of the application crash, but they had no data on the crash of the virtual machine. And the administrator can know who the customer is and roughly imagine what is currently running on this virtual machine, provided that this is some kind of large project. He probably doesn't know about the little ones. In any case, the administrator needs to go to the owner, ask what was on this machine, what needs to be restored and what needs to be changed. And if something very serious broke, running around in a circle began - because no one saw the system as a whole.

Ultimately, these disparate stories affect the entire front-end, users, and our core business function, online sales. Since we are not part of the teams, but are engaged in the operation of all ecommerce applications as part of an online store, we took on the task of creating an integrated monitoring system for the ecommerce platform.

System structure and stack

We started by identifying several layers of monitoring for our systems, in the context of which we need to collect metrics. And all this needed to be combined, which we did at the first stage. Now, at this stage, we are finalizing the collection of metrics of the highest quality across all our layers in order to build a correlation and understand how systems affect each other.

The lack of comprehensive monitoring in the early stages of application launch (since we started building it when most of the systems were in production) led to the fact that we had a significant technical debt in setting up monitoring of the entire platform. We could not afford to focus on setting up monitoring of a single IS and work out monitoring for it in detail, since the rest of the systems would have been left without monitoring for some time. To solve this problem, we have identified a list of the most necessary metrics for assessing the state of the information system by layers and began to implement it.

Therefore, the elephant decided to eat in parts.

Our system is made up of:

  • hardware;
  • operating system;
  • development;
  • UI parts in the monitoring application;
  • business metrics;
  • integration applications;
  • information security;
  • Networks
  • traffic balancer.

Monitor Sportmaster - how and with what

At the heart of this system is monitoring itself. To generally understand the state of the entire system, you need to know what is happening with applications at all these layers and in the context of the entire set of applications.

So, about the stack.

Monitor Sportmaster - how and with what

We use open source software. At the center we have Zabbix, which we use primarily as an alerting system. Everyone knows that it is ideal for infrastructure monitoring. What is meant here? Just those low-level metrics that every company that maintains its own data center has (and Sportmaster has its own data centers) - server temperature, memory status, raid, network device metrics.

We have integrated Zabbix with the Telegram messenger and Microsoft Teams, which are actively used in teams. Zabbix covers the layer of the actual network, hardware and part of the software, but it is not a panacea. We enrich this data from some other services. For example, in terms of hardware, we directly connect via API in our virtualization system and collect data.

What else. In addition to Zabbix, we use Prometheus, which allows us to monitor metrics in a dynamic environment application. That is, we can receive application metrics via HTTP endpoint and not worry about which metrics to load into it and which not. Based on this data, you can work out analytical queries.

Data sources for other layers, for example, business metrics, are divided into three components.

Firstly, these are external business systems, Google Analytics, we collect metrics from logs. From them we get data on active users, conversions and everything else related to the business. Secondly, it is a UI monitoring system. It should be told in more detail.

Once upon a time, we started with manual testing and it turned into autotests of functionality and integrations. We made monitoring out of it, leaving only the main functionality, and started using markers that are as stable as possible and do not change often over time.

The new team structure means that all application activities are focused on product teams, so we stopped doing pure testing. Instead, we made UI monitoring out of tests, written in Java, Selenium and Jenkins (used as a launch and report generation system).

We had many tests, but in the end we decided to go to the main road, the top-level metric. And if we have a lot of specific tests, it will be difficult to keep the data up to date. Each subsequent release will significantly break the entire system, and we will only deal with its repair. Therefore, we started on very fundamental things that rarely change, and we monitor only them.

Finally, thirdly, the data source is a centralized logging system. We use Elastic Stack for logs, and then we can pull this data into our business metrics monitoring system. In addition to all this, our own Monitoring API service, written in Python, works, which queries any services using the API and takes data from them into Zabbix.

Another indispensable attribute of monitoring is visualization. We build it on the basis of Grafana. Among other visualization systems, it stands out in that it is possible to visualize metrics from different data sources on the dashboard. We can collect high-level metrics for an online store, such as the number of orders placed in the last hour from the DBMS, performance metrics for the OS running this online store from Zabbix, and instance metrics for this application from Prometheus. And all this will be on one dashboard. Clear and accessible.

I will note about security - we are now finishing the system, which we will later integrate with the global monitoring system. In my opinion, the main problems that e-commerce faces in the field of information security are related to bots, scrapers and brute force. We need to keep an eye on this, because they can all critically affect both the operation of our applications and our reputation from a business point of view. And with the chosen stack, we successfully cover these tasks.

Another important point is that the application layer is assembled by Prometheus. He himself is also integrated with Zabbix. And we also have sitespeed, a service that allows us to respectively monitor such parameters as our page loading speed, bottlenecks, page rendering, script loading, etc., it is also integrated by API. So our metrics are collected in Zabbix, respectively, we also alert from there. All alerts so far go to the main methods of sending (so far these are email and telegram, MS Teams has also recently been connected). There are plans to upgrade alerting to such a state that smart bots work as a service and provide monitoring information to all interested product teams.

For us, not only metrics of individual information systems are important, but also general metrics for the entire infrastructure that applications use: clusters of physical servers on which virtual machines are running, traffic balancers, Network Load Balancers, the network itself, utilization of communication channels. Plus metrics for our own data centers (we have several of them and the infrastructure is quite significant).

Monitor Sportmaster - how and with what

The advantage of our monitoring system is that with its help we can see the health status of all systems, we can assess their impact on each other and on common resources. And ultimately it allows for resource planning, which is also our area of ​​responsibility. We manage server resources - a pool within e-commerce, put new equipment into and out of service, buy new equipment, audit resource utilization, and so on. Every year teams plan new projects, develop their systems, and it is important for us to provide them with resources.

And with the help of metrics, we see the trend of resource consumption by our information systems. And on their basis we can plan something. At the virtualization level, we collect data and see information on the available amount of resources in the context of data centers. And already inside the data center, both utilization and actual distribution, consumption of resources are visible. Moreover, both with standalone servers, and with virtual machines and clusters of physical servers, on which all these virtual machines are briskly spinning.

Prospects

Now we have the core of the system as a whole ready, but there are enough points left to work on. At a minimum, this is an information security layer, but it is also important to get to the network, develop alerting and solve the issue of correlation. We have many layers and systems, and there are many more metrics on each layer. It turns out a matryoshka in the degree of nesting dolls.

Our task is to eventually make the right alerts. For example, if there was a problem with the hardware, again, with a virtual machine, and there was an important application, and the service was not reserved in any way. We learn that the virtual machine has died. Then business metrics will alert: users have disappeared somewhere, there is no conversion, the UI in the interface is unavailable, software and services have also died.

In this scenario, we will get spam from alerts, and this no longer fits into the format of a proper monitoring system. There is a question of correlation. Therefore, ideally, our monitoring system should say: β€œGuys, your physical machine has died, and with it this application and such metrics,” using one alert instead of furiously bombarding us with hundreds of alerts. It should report the main thing - the cause, which contributes to the efficiency of eliminating the problem due to its localization.

Our alert system and alert processing is built around a XNUMX/XNUMX hotline service. All alerts that we consider a must-have and are included in the checklist are transferred there. Each alert must necessarily have a description: what happened, what it actually means, what it affects. And also a link to the dashboard and instructions on what to do in this case.

That's all for the requirements for building an alert. Further, the situation can develop in two directions - either there is a problem and it needs to be solved, or there was a failure in the monitoring system. But in any case, you need to go and understand.

On average, now we get about a hundred alerts per day, this is taking into account the fact that the correlation of alerts has not yet been properly configured. And if we need to carry out technical work, and we forcibly turn something off, their number grows exponentially.

In addition to monitoring the systems that we operate and collecting metrics that are regarded as important on our side, the monitoring system allows us to collect data for product teams. They can influence the composition of metrics within the information systems that are monitored by us.

Our colleague can come and ask to add some metric that will be useful both for us and for the team. Or, for example, the team may not have enough of the basic metrics that we have, they need to track some specific one. In Grafana, we create a space for each team and issue admin rights. Also, if the team needs dashboards, but they themselves cannot / do not know how to do it, we help them.

Since we are out of the team's value creation, release and planning stream, we gradually come to the conclusion that releases of all systems are seamless and can be rolled out daily without coordinating with us. And it is important for us to keep track of these releases, because they can potentially affect the operation of the application and break something, and this is critical. To manage releases, we use Bamboo, from where we get data via the API and can see which releases have been released in which information systems and their status. And most importantly, what time. We put release markers on the main critical metrics, which is visually very indicative in case of problems.

This way we can see the correlation between new releases and emerging issues. The main idea is to understand how the system works at all layers, quickly localize the problem and fix it just as quickly. After all, it often happens that most of the time it takes not to solve the problem, but to find the cause.

And in this direction in the future we want to focus on proactivity. Ideally, I would like to know in advance about an approaching problem, and not after the fact, in order to deal with its prevention, not a solution. Sometimes there are false positives of the monitoring system, both due to human error and changes in the application. And we are working on it, debugging, and try to warn users who use it with us before any manipulations on the monitoring system , or carry out these events in the technical window.

So, the system has been launched and has been working successfully since the beginning of spring ... and shows a very real profit. Of course, this is not its final version, we will implement many more useful things. But right now, with so many integrations and applications, monitoring automation is really indispensable.

If you also monitor large projects with a serious number of integrations, write in the comments what silver bullet you found for this.

Source: habr.com

Add a comment