Monitoring + stress testing = forecasting and no failures

VTB IT department had to deal with emergency situations in the operation of systems several times, when the load on them increased many times. Therefore, it became necessary to develop and test a model that would predict the peak load on critical systems. For this, the bank’s IT-specialists set up monitoring, analyzed the data and learned how to automate forecasts. What tools helped to predict the load and whether it was possible to optimize the work with their help, we will tell in a short article.







Problems with highly loaded services arise in almost all sectors, but they are critical for the financial sector. At hour X, all combat units must be ready, and therefore it was necessary to know in advance what could happen and even determine the day when the load would jump, and which systems would encounter it. Failures need to be fought and prevented, so the need to implement a predictive analytics system has not even been discussed. It was necessary to upgrade systems based on monitoring data.



Analytics on the knee



A payroll project is one of the most sensitive in the event of a failure. It is most understandable for forecasting, so we decided to start with it. Due to the high connectivity at peak times, other subsystems, including remote banking (RBS), could have experienced problems. For example, customers, delighted by SMS about the receipt of money, began to actively use them. The load could jump more than an order of magnitude.



The first forecast model was created manually. We took the unloading for the last year and calculated on which days the maximum peaks are expected: for example, on the 1st, 15th and 25th, as well as on the last days of the month. This model required serious work and did not give an accurate forecast. Nevertheless, she identified bottlenecks where it was necessary to add hardware, and made it possible to optimize the process of transferring money by agreeing with anchor clients: in order to prevent salaries in one gulp, transactions from different regions were spread out in time. Now we process them in parts that the bank’s IT infrastructure is capable of chewing without failures.



Having received the first positive result, we moved on to forecasting automation. A dozen more critical sections were waiting for our turn.



A complex approach



At VTB, they introduced the MicroFocus monitoring system. From there we took forecasting data collection, a storage system and a reporting system. In fact, there was already monitoring, it only remained to add metrics, a prediction module and create new reports. This decision is supported by an external contractor, Technoserv, so the main work on the project was carried out by its specialists, but we built the model on our own. The forecasting system was made based on Prophet - this open source product was developed on Facebook. It is easy to use and easy to integrate with our comprehensive monitoring and Vertica tools. Roughly speaking, the system analyzes the load schedule and, based on the Fourier series, makes its extrapolation. It is also possible to add certain coefficients by days, taken from our model. Metrics are taken without human intervention, once a week the forecast is automatically recounted, new reports are sent to the recipients.



This approach reveals the main cycles, for example, annual, monthly, quarterly and weekly. Payments of salaries and advances, vacation periods, holidays and sales - all this affects the number of calls to the systems. It turned out, for example, that some cycles overlap each other, and the Central Federal District gives the main load (75%) to the systems. Legal entities and individuals behave differently. If the workload of the “physicists” is relatively evenly distributed across the days of the week (this is a lot of small transactions), then companies account for 99.9% of the working time, moreover, transactions can be short, and can be processed within a few minutes or even hours.







Based on the data obtained, long-term trends are determined. The new system has revealed that people are massively leaving for RBS. This is known to all, but we did not expect such a scale and at first did not believe in them: the number of calls to the bank's offices decreases extremely quickly, and the number of distance transactions grows by exactly the same amount. Accordingly, the load on the system is also growing and will continue to grow. Now we forecast the load until February 2020. Normal days can be predicted with an error of 3%, and peak days with an error of 10%. This is a good result.



Underwater rocks



As usual, there were some difficulties. The extrapolation mechanism using Fourier series goes badly through zero - we know that on the weekend legal entities generate few transactions, but the prediction module produces values ​​that are far from zero. It was possible to correct them by force, but crutches are not our method. In addition, it was necessary to solve the problem of painless data collection from source systems. The regular collection of information requires serious computing resources, so we built fast caches using replication, we get business data from replicas already. The absence of additional load on the master systems in such cases is a blocking requirement.



New challenges



The forehead task of predicting peaks was solved: there were no bank failures related to overloading since May this year, and the new forecasting system played an important role in this. Yes, it was not enough, and now the bank wants to understand how dangerous peaks are for it. We need forecasts using metrics from load testing, and for about 30% of critical systems this already works, the rest are in the process of obtaining forecasts. At the next stage, we are going to predict the load on the systems not in business transactions, but in terms of IT infrastructure, that is, we will go down a layer below. In addition, we need to fully automate the collection of metrics and build forecasts based on them, so as not to engage in unloading. There is nothing outstanding in this - we just cross-monitor and stress test in accordance with the best world practices.



All Articles