Monitoring + load testing = predictability and no failures

The IT department of VTB had to deal with emergency situations in the operation of systems several times, when the load on them increased many times over. Therefore, it became necessary to develop and test a model that would predict the peak load on critical systems. To do this, the bank's IT specialists set up monitoring, analyzed the data, and learned how to automate forecasts. What tools helped to predict the load and whether it was possible to optimize the work with their help, we will tell in a short article.

Monitoring + load testing = predictability and no failures

Problems with highly loaded services arise in almost all industries, but they are critical for the financial sector. At hour X, all combat units had to be ready, so it was necessary to know in advance what might happen and even determine the day when the load would rise and which systems would face it. Failures need to be dealt with and prevented, so the need to implement a predictive analytics system was not even discussed. Systems had to be upgraded based on monitoring data.

Analytics on the knee

A payroll project is one of the most sensitive in the event of a failure. It is the most understandable for forecasting, so we decided to start with it. Due to the high connectivity at peak times, other subsystems could also experience problems, including remote banking (RB). For example, customers who were delighted with SMS about the receipt of money began to actively use them. In this case, the load could jump by more than an order of magnitude. 

The first prediction model was created manually. We took the upload for the last year and calculated on which days the maximum peaks are expected: for example, on the 1st, 15th and 25th, as well as on the last days of the month. This model required serious labor costs and did not give an accurate forecast. Nevertheless, she identified bottlenecks where it was necessary to add β€œiron”, and allowed to optimize the process of transferring money by agreeing with anchor clients: in order not to give salaries β€œin one gulp”, transactions from different regions were spread over time. Now we process them in parts that the bank’s IT infrastructure is able to β€œchew” without failures.

Having received the first positive result, we moved on to automating forecasting. A dozen more critical areas were waiting for their turn.

A complex approach

VTB has implemented a monitoring system from MicroFocus. From there, we took data collection for forecasting, a storage system, and a reporting system. In fact, there was already monitoring, it only remained to add metrics, a prediction module and create new reports. This solution is supported by the external contractor Technoserv, so the main work on the implementation of the project fell on its specialists, but we built the model ourselves. The forecasting system was made on the basis of Prophet - this open product was developed by Facebook. It is easy to use and easily integrates with our integrated monitoring tools and Vertica. Roughly speaking, the system analyzes the loading schedule and extrapolates it based on the Fourier series. It is also possible to add some coefficients for the days taken from our model. Metrics are taken without human intervention, once a week the forecast is automatically recalculated, new reports are sent to recipients. 

This approach reveals the main cycles, for example, annual, monthly, quarterly and weekly. Salaries and advance payments, vacation periods, holidays and sales - all this affects the number of calls to the systems. It turned out, for example, that some cycles overlap each other, and the main load (75%) on the systems comes from the Central Federal District. Legal entities and individuals behave differently. If the load from the "physicists" is relatively evenly distributed over the days of the week (there are a lot of small transactions), then companies have 99,9% during working hours, moreover, transactions can be short, or they can be processed within several minutes or even hours.

Monitoring + load testing = predictability and no failures

Based on the data obtained, long-term trends are determined. The new system revealed that people are leaving en masse for remote banking. Everyone knows this, but we did not expect such a scale and at first we did not believe in them: the number of calls to the bank's offices is declining extremely quickly, and the number of remote transactions is growing by exactly the same amount. Accordingly, the load on the systems is also growing and will continue to grow. We are now forecasting the load until February 2020. Normal days can be predicted with an error of 3%, and peak days - with an error of 10%. This is a good result.

Pitfalls

As usual, it was not without difficulties. The extrapolation mechanism using Fourier series does not cross zero well - we know that legal entities generate few transactions at the weekend, but the predictor produces values ​​that are far from zero. It was possible to correct them forcibly, but crutches are not our method. In addition, we had to solve the problem of painless data removal from source systems. Regular collection of information requires serious computing resources, so we built fast caches using replication, we get business data already from replicas. The absence of additional load on the master systems in such cases is a blocking requirement.

New Challenges

The head-on task of forecasting peaks was solved: there were no overload-related failures in the bank since May of this year, and the new forecasting system played an important role in this. Yes, it was not enough, and now the bank wants to understand how dangerous spades are for it. We need forecasts using metrics from load testing, and for about 30% of critical systems this already works, the rest are in the process of obtaining forecasts. At the next stage, we are going to predict the load on the systems not in business transactions, but in terms of IT infrastructure, that is, we will go down to the layer below. In addition, we need to fully automate the collection of metrics and the construction of forecasts based on them, so as not to deal with unloading. There is nothing outstanding in this - we just cross monitoring and load testing in accordance with the best world practices.

Source: habr.com

Add a comment