Free software for LMS: how free software helps to administer critical business systems at VTB

The documentation support system in our bank is constantly evolving and scaling, while the requirements for speed and fault tolerance only increase. At some point, it became too risky to serve LMS without effective centralized monitoring. To secure business processes at VTB and simplify the work of administrators, we have implemented a solution based on an open technology stack. With it, we can proactively respond to incidents, preventing potential problems. Under the cut is a story about our experience in using free software to monitor large-scale business systems.

Free software for LMS: how free software helps to administer critical business systems at VTB

Why monitor the document management system

Since 2005, the CompanyMedia system has been β€œmanaging” documentation support at VTB Bank. More than 60 users work in LMS, creating more than a million new documents every month. Our servers must function 24 hours a day: at almost any moment there are 2500-3000 people in the system who connect across the country, from Petropavlovsk-Kamchatsky to Kaliningrad. Each second of LMS operation is 10-15 changes.

In order for the system to accurately perform the tasks assigned to it, we deployed a fault-tolerant infrastructure using proxy servers, request balancing, information protection, full-text search, integration routes and backup. Enormous resources are required to support and administer a project of this magnitude. Administrators around the clock monitor basic information about the operation of servers, RAM load, CPU time, I / O subsystem, and so on. But, besides this, we need more subtle analytics:

  • calculation of the time spent on the execution of business scenarios;
  • monitoring the dynamics of system performance and load on it;
  • search for deviations in system components from approved non-functional requirements.

11 years after the introduction of the LMS, the issue of proactive response to various kinds of errors has become particularly acute. The bank's management realized that working without monitors and a system life console is playing with fire: the slightest failure in a business system of this level is fraught with millions in losses.

In 2016, we began implementing tools to quickly identify problems in the functioning of the LMS, including monitoring the parameters of interest to us in real time. Previously, the applied monitoring system was deployed and tested within the infrastructure of the InterTrust company.

How it all began

Today, the centralized system for applied monitoring of VTB LMS, based on open source software products, helps to prevent most of the errors associated with document management, quickly and accurately classify problems, and promptly respond to any incidents. It includes two subsystems:

  • to monitor the IT infrastructure of the system services;
  • to monitor the occurrence of errors in the work of the LMS.

It all started with the only free monitoring app. After going through several options, we settled on Zabbix, a free software that was originally written for banking services and equipment. This web-based PHP system that can store data in MySQL, PostgreSQL, SQLite or Oracle Database was a perfect fit for our needs.

Zabbix runs its agents on each server and collects information on the metrics of interest in real time into a single database. Using the application, it is convenient to collect data on the load on processors and RAM, on the use of the network and other components, check the availability and response of standard services (SMTP or HTTP), execute external programs, support monitoring via SNMP.

Once we deployed Zabbix, we configured the default hardware metrics and that was enough at first. But VTB LMS is constantly developing and growing: in 2016, the number of servers increased markedly, migration processes appeared, the Bank of Moscow, VTB Capital, VTB24 joined the system. There are not enough standard metrics, and we taught Zabbix to track information about the presence of queues on each of the volumes connected to the server (out of the box, Zabbix only reflects the total disk queue), as well as the time it takes to process a particular procedure.

Free software for LMS: how free software helps to administer critical business systems at VTB

In addition, we equipped the system with multiple triggers - the conditions under which a notification is sent to the administrator (message in Telegram, SMS to a phone number or email). Triggers are configured for any set of parameters. For example, you can specify a certain percentage of free disk space, and the system will notify the administrator when the specified threshold is reached, or inform if any background procedure takes longer than usual.

Java connection and data visualization

We significantly expanded the range of analyzed data, but soon this was not enough for effective monitoring. Taking advantage of the fact that LMS from CompanyMedia is a Java application, we connected to the Java Virtual Machine via the JMX interface and were able to take Java metrics directly. And not only the standard Java life parameters, such as the intensity of the GC work or the Heap consumption, but also specific samples related directly to the application's executable code.

Free software for LMS: how free software helps to administer critical business systems at VTB

In 2017, about a year after the implementation of the monitoring system, it became clear that for normal work with the colossal array of data that is collected in Zabbix, there is not enough visualization - complex screens. The best option to solve this problem again was free software - Grafana, a convenient dashboard for metrics that allows you to aggregate all the data on one screen.

Free software for LMS: how free software helps to administer critical business systems at VTB

The Grafana interface is interactive, reminiscent of an OLAP system. The subsystem displays the data that Zabbix receives on a single screen, presenting information in the form of graphs and diagrams that are convenient for analysis. The administrator can easily customize the slices that he needs.

Free software for LMS: how free software helps to administer critical business systems at VTB

Monitoring and preventive elimination of errors in the LMS system

The ELK free software platform helps to filter and analyze the information received during monitoring. This open source product consists of three powerful tools for collecting, storing and analyzing data: Elasticsearch, Logstash and Kibana. The introduction of this subsystem allows, in particular, to see in real time how many errors have occurred in the system, on which servers and whether these errors are repeated.

Free software for LMS: how free software helps to administer critical business systems at VTB

Now the administrator can detect the problem at an early stage, even before the user encounters it. Such proactive monitoring helps to prevent system malfunctions by eliminating errors in a timely manner. In addition, we can understand how the behavior of the system has changed after the update, as well as detect new problems if they occur.

Free software for LMS: how free software helps to administer critical business systems at VTB

Monitoring of business operations

In addition to the basic functions of monitoring resource consumption, the system has the ability to analyze and control business operations.

Free software for LMS: how free software helps to administer critical business systems at VTB

Monitoring the overall time to complete business operations allows you to identify new factors and understand how they affect the operation of the system.

Free software for LMS: how free software helps to administer critical business systems at VTB

Monitoring the execution time of requests in the context of each business service makes it possible to detect operations that have a deviation from the norm.

Free software for LMS: how free software helps to administer critical business systems at VTB

The screenshot above is an example of monitoring a background task in terms of its deviation from the norm.

Free software for LMS: how free software helps to administer critical business systems at VTB

The list of monitored tasks in terms of their activity on a particular server allows you to identify errors - including duplication of task execution - across all servers.

Free software for LMS: how free software helps to administer critical business systems at VTB

It also monitors trends in the execution time of background procedures.

The system grows, develops and helps to cope with problems

With the introduction of the described system, monitoring the operation of LMS servers has become much simpler. Nevertheless, various kinds of conflicts periodically arise that affect the speed of workflow and cause complaints from users. So we realized that it is necessary to control the behavior of the application itself, and not just the servers.

To solve this problem, a balancer was connected to the monitoring system via API, which works with a cluster of application servers. Thanks to this, the administrator can see how long the server is responsible for each user request.

Server response time data became available for analysis, which made it possible to link the LMS slowdown with the processes occurring on the server. In particular, an interesting situation came to light: the server is running slowly, although at this moment it is not loaded. Analyzing the anomaly, we found anomalies in the operation of the Garbage Collector Java. In the end, it turned out that it was the incorrect operation of this service that led to this situation. By taking control of the Garbage Collector Java, we have completely eliminated the problem.

This is how free software helps to develop and grow the workflow system in the banking sector. We have touched only on the main issues related to the monitoring system of the VTB LMS. If you are interested in the details - ask in the comments, we will be happy to share our experience with you.

Source: habr.com

Add a comment