HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

Everyone talks about development and testing processes, staff training, increasing motivation, but these processes are not enough when a minute of service downtime costs space money. What to do when you conduct financial transactions under a strict SLA? How to increase the reliability and fault tolerance of your systems, taking development and testing out of the brackets?

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

The next HighLoad++ conference will take place on April 6 and 7, 2020 in St. Petersburg. Details and tickets link. November 9, 18:00. HighLoad++ Moscow 2018, Delhi + Kolkata hall. Abstracts and presentation.

Evgeny Kuzovlev (hereinafter referred to as EC): - Friends, hello! My name is Evgeny Kuzovlev. I am from EcommPay, a specific division is EcommPay IT, the IT division of the group of companies. And today we will talk about downtimes - about how to avoid them, about how to minimize their consequences if you cannot avoid them. The topic is stated as follows: “What to do when a minute of downtime costs $100”? Looking ahead, our numbers are comparable.

What does EcommPay IT do?

Who are we? Why am I standing here in front of you? Why do I have the right to tell you something here? And what will we talk about here in more detail?

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

The EcommPay group of companies is an international acquirer. We process payments all over the world - in Russia, Europe, Southeast Asia (All Around the World). We have 9 offices, 500 employees in total, and about a little less than half of them are IT specialists. Everything we do, everything we earn money on, we did ourselves.

We have all our products (and we have quite a lot of them - in the line of large IT products we have about 16 different components) we wrote ourselves; we write ourselves, we develop ourselves. And at the moment we are doing about a million transactions a day (millions - probably it would be correct to say so). We are a fairly young company – we are only about six years old.

6 years ago it was such a startup, when the guys came along with the business. They were united by an idea (there was nothing else but an idea) and we ran. Like any startup, we ran faster… For us, speed was more important than quality.

At some point, we stopped: we realized that we can no longer somehow live with that speed and with that quality, and we need to do quality first of all. At that moment, we decided to write a new platform that would be correct, scalable, reliable. They began to write this platform (they started investing, developing development, testing), but at some point they realized that development and testing did not allow reaching a new level of service quality.

You make a new product, you put it out for production, but still something goes wrong somewhere. And today we will talk about how to reach a new qualitative level (how we did it, about our experience), taking development and testing out of the brackets; we will talk about what is available to exploitation - what exploitation can do itself, what it can offer to testing in order to influence quality.

Downtimes. Operating Commands.

Always the main cornerstone, which we will actually talk about today - downtime. Terrible word. If we have a downtime, everything is bad for us. We run to raise, the admins hold the server - God forbid it does not fall, as in that song it is sung. That's what we'll talk about today.

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

When we began to change our approaches, we formed 4 commandments. They are presented on the slides:

These commandments are quite simple:

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

  • Identify the problem quickly.
  • Get rid of it even faster.
  • Help to understand the reason (later, for developers).
  • and standardize approaches.

Let me draw your attention to point number 2. We get rid of the problem, not solve it. Deciding is secondary. For us, it is primary that the user is protected from this problem. It will exist in some isolated environment, but this environment will not contact it in any way. Actually, we will go through these four groups of problems (for some in more detail, for some less detailed), I will tell you what we use, what kind of experience we have in solutions.

Troubleshooting: when do they happen and what to do about them?

But we will start out of order, we will start with point number 2 - how to quickly get rid of the problem? There is a problem - we need to fix it. "What do we do with it?" - the main question. And when we started thinking about how to fix the problem, we developed some requirements for ourselves that the troubleshooting should follow.

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

In order to formulate these requirements, we decided to ask ourselves the question: “When do we have problems”? And problems, as it turned out, occur in four cases:

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

  • Hardware failure.
  • Failure of external services.
  • Change of software version (the same deployment).
  • Explosive workload.

We will not talk about the first two. A hardware malfunction is solved quite simply: you must have everything duplicated. If these are disks - the disks must be assembled in a RAID, if this is a server - the server must be duplicated, if you have a network infrastructure - you must put a second copy of the network infrastructure, that is, you take and duplicate. And if something fails you, you switch to reserve capacities. It's hard to say more here.

The second is the failure of external services. For most, the system is not a problem at all, but not for us. Since we process payments, we are such an aggregator that stands between the user (who enters his card data) and banks, payment systems (Visa, MasterCard, Mira). Our external services (payment systems, banks) tend to fail. Neither we nor you (if you have such services) can influence this.

What to do then? There are two options here. First, if you can, you should duplicate this service in some way. For example, if we can, we transfer traffic from one service to another: we process, for example, cards through Sberbank, Sberbank has problems - we transfer traffic [conditionally] to Raiffeisen. The second thing we can do is notice the failure of external services very quickly, and therefore we will talk about the speed of response in the next part of the report.

In fact, out of these four, we can specifically influence the change of software versions - take actions that will lead to an improvement in the situation in the context of deployments and in the context of an explosive increase in load. Actually, we did just that. Here, again, a small remark ...

Of these four problems, several are immediately solved if you have a cloud. If you are in the Microsoft Azhur, Ozone clouds, use our clouds, from Yandex or Mail, then at least a hardware malfunction becomes their problem and everything immediately becomes fine in the context of a hardware malfunction.

We are a slightly non-standard company. Everyone here talks about Kubernetes, about clouds - we don't have Kubernets or clouds. On the other hand, we have racks with iron in many data centers, and we are forced to live on this iron, we are forced to be responsible for all this. Therefore, in this context, we will talk. So, about problems. The first two have been bracketed.

Change of software version. bases

Our developers do not have access to production. Why is that? But we are simply PCI DSS certified, and our developers simply do not have the right to climb into the “prod”. All point. At all. Therefore, the responsibility of development ends exactly at the moment when the development handed over the build for release.

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

Our second basis, which we have, which also helps us a lot, is the absence of unique undocumented knowledge. I hope you do the same. Because if it's not, you're in trouble. Problems arise when this unique, undocumented knowledge is not present at the right time in the right place. Let's say you have one person who knows how to deploy a specific component - there is no person, he is on vacation or ill - that's it, you have problems.

And the third basis, to which we have come. We came to it through pain, blood, tears - we came to the conclusion that any of our builds contains errors, even if it is without errors. We decided this for ourselves: when we deploy something, when we roll something into production, we have a build with errors. We have formed the requirements that our system must satisfy.

Software upgrade requirements

There are three of these requirements:

HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

  • We need to quickly roll back the deployment.
  • We must minimize the impact of an unsuccessful deployment.
  • And we should be able to quickly deploy in parallel.
    It's in that order! Why? Because, first of all, when deploying a new version, speed is not important, but it is important for you, if something went wrong, to quickly roll back and have minimal impact. But if you have a set of production versions for which it turned out that there is an error (like out of the blue, there was no deployment, but the error is contained) - the speed of the subsequent deployment is important to you. What have we done to meet these requirements? We have resorted to the following methodology:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    It is quite well known, we have never invented it - it is Blue/Green deploy. What it is? You must have a copy for each group of servers on which your applications are located. The copy is “warm”: there is no traffic on it, but at any time this traffic can be sent to this copy. This copy contains the previous version. And at the time of deployment, you roll out the code to an inactive copy. Then switch part of the traffic (or all) to the new version. Thus, in order to change the traffic flow from the old version to the new one, you need to do only one action: you need to change the balancer in the upstream, change the direction - from one upstream to another. This is very convenient and solves the problem of fast switching, fast rollback.

    Here, the solution to the second question is minimization: you can put only part of your traffic on a new line, on a line with a new code (let it be, for example, 2%). And these 2% are not 100%! If you lost 100% of your traffic during an unsuccessful deployment, it's scary, if you lost 2% of your traffic, it's unpleasant, but it's not scary. Moreover, users will most likely not even notice this, because in some cases (not in all) the same user, by pressing F5, will go to another, working version.

    Blue/Green deploy. Routing

    At the same time, not everything is so simple “Blue / Green Deploy” ... All our components can be divided into three groups:

    • this is a front-end (payment pages that our customers see);
    • processing core;
    • adapter for working with payment systems (banks, MasterCard, Visa ...).

    And here there is a nuance - the nuance lies in the routing between the lines. If you just switch 100% of traffic, you don't have these problems. But if you want to switch 2%, you start asking questions: “How to do it?” The simplest, in the forehead: you can, by random choice, set up Round Robin in nginx, and you have 2% to the left, 98% to the right. But that doesn't always work.

    In our country, for example, the user interacts with the system with more than one request. This is normal: 2, 3, 4, 5 requests - your systems may be the same. And if it is important for you that all user requests come to the same line that the first request came to, or (second moment) all user requests come to a new line after the switch (he could start working earlier with the system, before the switch), - then this random distribution is not suitable for you. Then there are the following options:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    The first option, the simplest one, is based on the basic parameters of the client (IP Hash). You have an IP, and you separate it by IP address from right to left. Then the second case I described will work for you, when a deployment has occurred, the user could already start working with your system, and from the moment of deployment, all requests will go to a new line (to the same one, say).

    If for some reason this does not suit you and you need to send requests to the line where the initial, init user request came from, then you have two options ...
    First option: you can take paid nginx+. There is a Sticky sessions mechanism, which, at the initial request of the user, exposes the user to a session and binds it to one or another upstream. All subsequent user requests within the session lifetime will go to the same upstream where the session was exposed.

    It did not suit us, because we already had a regular nginx. Switching to nginx+ is not that expensive, it was just a little painful for us and not very correct. “Sticks Sessions”, for example, did not work for us for the simple reason that “Sticks Sessions” do not allow routing on the basis of “Either-or”. There you can specify what we are doing “Stick Sessions”, for example, by IP address or by IP address and by cookies or by post-parameter, but “Either-or” is already more difficult there.

    Therefore, we have come to the fourth option. We took nginx on steroids (this is openresty) - this is the same nginx, which additionally supports the inclusion of last scripts. You can write a last script, slip an "openrest" into it, and that last script will be executed when a user request comes in.

    And we wrote, in fact, such a script, set ourselves an “openrest” and in this script we sort through 6 different parameters by concatenating “Or”. Depending on the presence of one or another parameter, we know that the user came to one page or another, to one line or to another.

    Blue/Green deploy. Advantages and disadvantages

    Of course, it was probably possible to make it a little simpler (use the same “Stick Sessions”), but we still have such a nuance that not only the user interacts with us within the framework of one processing of one transaction ... But payment systems also interact with us: we, after we process the transaction (by sending a request to the payment system), we get a coolback.
    And suppose, if inside our circuit we can throw the user's IP address in all requests and separate users based on the IP address, then we will not say to the same “Visa”: “Dudes, we are such a retro company, we are kind of international (on the site and in Russia) ... And please send us the user's IP address in an additional field, your protocol is standardized! Of course, they won't agree.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    Therefore, it did not suit us - we made openresty. Accordingly, with routing, we got it like this:

    Blue / Green Deploy has, respectively, the advantages that I mentioned, and the disadvantages.

    Drawback two:

    • you have to bother with routing;
    • The second main drawback is the cost.

    You need twice as many servers, you need twice as many operational resources, you need to spend twice as much effort to maintain this whole zoo.

    By the way, among the advantages - one more thing that I have not mentioned before: you have a reserve in case of an increase in load. If you have an explosive load growth, a large number of users have fallen on you, then you simply include the second line in the 50 to 50 distribution - and you immediately have x2 servers in your cluster until you solve the problem of having more servers.

    How to make a quick deployment?

    We talked about how to solve the problem of minimization and quick rollback, but the question remains: “How to quickly deploy”?

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    This is short and simple.

    • You must have a CD-system (Continuous Delivery) - without it, nowhere. If you have one server, you can deploy handles. We have about one and a half thousand servers and one and a half thousand handles, of course - we can plant a department the size of this room, just to deploy.
    • The deployment must be parallel. If you have a sequential deployment, then everything is bad. One server is fine, you will deploy one and a half thousand servers all day.
    • Again, for acceleration, this is no longer necessary, I guess. Desploy usually builds the project. You have a web project, there is a front-end part (you make a webpack there, you collect npm - something like that), and this process, in principle, is short-lived - 5 minutes, but these 5 minutes can be critical. Therefore, for example, we do not do this: we removed these 5 minutes, we deploy artifacts.

      What is an artifact? An artifact is an assembled build in which the entire assembly part has already been completed. We store this artifact in the artifact repository. We used two such repositories at one time - it was Nexus and now jFrog Artifactory). We initially used Nexus because we started practicing this approach in java applications (it was well suited for this). Then some of the applications that are written in PHP were put there; and the Nexus was no longer suitable, and therefore we chose jFrog Artefactory, which can artifactor almost everything. We even came to the conclusion that we store our own binary packages in this artifact repository, which we collect for servers.

    Explosive load growth

    We talked about changing the software version. The next thing we have is an explosive increase in load. Here, I probably understand the explosive growth of the load is not quite the right thing ...

    We wrote a new system - it is service-oriented, fashionable beautiful, workers are everywhere, queues are everywhere, asynchrony is everywhere. And in such systems, data can go along different flows. For the first transaction, the 1st, 3rd, 10th worker can be involved, for the second transaction - the 2nd, 4th, 5th. And today, let's say, in the morning you have a data stream that uses the first three workers, and in the evening it changes dramatically, and everything uses the other three workers.

    And here it turns out that you need to somehow scale the workers, you need to somehow scale your services, but at the same time not to allow resources to bloat.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    We have set our requirements. These requirements are quite simple: to have Service discovery, parameterization - everything is standard for building such scalable systems, except for one point - this is resource depreciation. We said that we are not ready to depreciate resources so that the servers warm the air. We took Consul, we took Nomad, who manages our workers.

    Why is this a problem for us? Let's backtrack a little. There are about 70 payment systems behind us now. In the morning, traffic goes through Sberbank, then Sberbank fell, for example, and we switch it to another payment system. We had 100 workers before Sberbank, and after that we need to dramatically raise 100 workers for another payment system. And all this is desirable to happen without human participation. Because if there is human participation, an engineer should sit there 24/7, who should only do this, because such failures, when 70 systems are behind you, occur regularly.

    Therefore, we looked at Nomad, which has a public IP, and wrote our own Scale-Nomad thing - ScaleNo, which does something like this: it monitors the growth of the queue and reduces or increases the number of workers depending on the dynamics of the queue. When we did it, we thought: “Maybe it should be opensourced?” Then they looked at her - she is simple, like two pennies.

    So far, we have not open sourced it, but if after the report, after realizing that you need such a thing, you suddenly need it, there are my contacts in the last slide - please write to me. If there are at least 3-5 people, we will open source it.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    How it works? Let's get a look! Looking ahead: on the left side there is a piece of our monitoring: this is one line, on top is the time for processing events, in the middle is the number of transactions, on the bottom is the number of workers.

    If you look, there is a glitch in this picture. On the top chart, one of the charts crashed in 45 seconds - one of the payment systems went down. Immediately, traffic was brought in 2 minutes and the queue began to grow on another payment system where there were no workers (we did not utilize the resources - on the contrary, we utilized the resource correctly). We didn't want to heat - there was a minimum number, about 5-10 workers, but they couldn't cope.

    On the last chart, you can see the "hump", which just says that "Scaleno" doubled this amount. And then, when the chart went down a bit, he reduced it a bit - the number of workers was changed automatically. That's how this thing works. We talked about point number 2 - "How to quickly get rid of the reasons."

    Monitoring. How to quickly identify the problem?

    Now the first point - "How to quickly identify the problem?" Monitoring! We have to quickly understand certain things. What are the things we need to quickly understand?

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    Three things!

    • We must quickly understand and quickly understand the health of our own resources.
    • We must quickly understand the failure, monitor the performance of systems that are external to us.
    • The third point is the identification of logical errors. This is when the system works for you, according to all indicators, everything is fine, but something is going wrong.

    Here I, probably, will not tell much that is so cool. I'll be Captain Obvious. We were looking for what is on the market. We have a fun zoo. This is the zoo we have now:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    We use Zabbix to monitor hardware, to monitor the main indicators of servers. "Okmeter" we use for databases. We use "Grafana" and "Prometheus" for all other indicators that did not fit the first two, and some of them - with "Grafana" and "Prometheus", some - "Grafana" with "Influx" and Telegraf.

    A year ago we wanted to use New Relic. Cool thing, she can do everything. But as far as she knows everything, she is so expensive. When we grew to a volume of 1,5 thousand servers, a vendor came to us and said: "Let's conclude an agreement for the next year." We looked at the price, that no, we will not do that. Now we are giving up New Relic, we have about 15 servers left under New Relic monitoring. The price was absolutely outrageous.

    And there is one tool that we implemented ourselves - this is Debugger. At first we called him “Bagger”, but then our English teacher passed us, laughed wildly, and renamed him “Debugger”. What it is? This is a tool that, in fact, in 15-30 seconds on each component, like a “black box” of the system, launches tests for the overall performance of the component.

    For example, if the external page (payment page) - he just opens it and sees how it should look. If this is processing, it fires a test "transaction" - it looks for this "transaction" to reach. If this is a connection with payment systems, we accordingly fire a test request where we can, and see that everything is fine with us.

    What metrics are important to monitor?

    What do we mainly monitor? What metrics are important to us?

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    • Response time / RPS on the fronts is a very important indicator. He immediately replies that something is wrong with you.
    • The number of processed messages in all queues.
    • The number of workers.
    • Basic correctness metrics.

    The last point is a “business”, “business” metric. If you want to monitor the same thing, you need to define one or two metrics that are the main indicators for you. We have such a metric - this is throughput (this is the ratio of the number of successful transactions to the total flow of transactions). If something changes in it in the interval of 5-10-15 minutes, then we have problems (if it changes dramatically).

    How it looks like for us - an example of one of our boards:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    On the left side - 6 graphs, respectively, along the lines - the number of workers and the number of messages in the queues. On the right side - RPS, RTS. Below is the same “business” metric. And on the “business” metric, we can immediately see that something went wrong on the two middle charts ... This is just another system that has fallen behind us.

    The second thing we had to do was to monitor the fall of external payment systems. Here we took OpenTracing - a mechanism, a standard, a paradigm that allows you to trace distributed systems; and it has been changed a bit. The standard OpenTracing paradigm says that we build a trace for each individual request. We did not need this, and we wrapped it in a summary, aggregation trace. We made a tool that allows us to track the speed of the systems behind us.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    The graph shows us that one of the payment systems began to respond in 3 seconds - we have problems. At the same time, this thing will react when the problems started, at an interval of 20-30 seconds.

    And the third class of monitoring errors that exist is logical monitoring.

    To be honest, I didn’t know what to draw on this slide, because we were looking for something on the market for a long time that would suit us. We couldn't find anything, so we had to do it ourselves.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    What do I mean by logical monitoring? Well, imagine: you make yourself a system (for example, a clone of Tinder); you made it, launched it. Successful manager Vasya Pupkin put it on his phone, sees a girl there, likes her ... and the like goes not to the girl - the like goes to the security guard Mikhalych from the same business center. The manager goes downstairs, and then wonders: “Why is this security guard Mikhalych smiling at him so pleasantly”?

    In such situations... For us, this situation sounds a little different, because (I wrote) this is such a reputational loss that indirectly leads to financial losses. Our situation is the opposite: we may incur direct financial losses - for example, if we conducted a transaction as successful, but it was unsuccessful (or vice versa). I had to write my own tool that tracks the number of successful transactions in dynamics over a time interval based on business indicators. Didn't find anything on the market! This is exactly what I wanted to convey. There is nothing on the market to solve such problems.

    This was to the question of how to quickly identify the problem.

    How to determine the reasons for the deployment

    The third group of tasks that we solve is after we have identified the problem, after we got rid of it, it would be good to understand the reason for development, for testing, and do something about it. Accordingly, we need to investigate, we need to raise the logs.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    If we are talking about logs (the main reason is logs), the main part of the logs we have in the ELK Stack - almost everyone has it. Some may not have ELK, but if you write logs in gigabytes, then sooner or later you will come to ELK. We write them in terabytes.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    There is a problem here. We fixed, corrected the error for the user, began to dig out what was there, climbed into Kibana, entered the transaction id there and got such a footcloth (shows a lot). And in this footcloth, absolutely nothing is clear. Why? Yes, because it is not clear which part belongs to which worker, which part belongs to which component. And at that moment we realized that we needed tracing - the same OpenTracing that I was talking about.

    We thought about it a year ago, turned our eyes towards the market, and there were two instruments there - Zipkin and Jaeger. "Huntsman" is, in fact, such an ideological heir, the ideological successor of "Zipkin". In "Zipkin" everything is fine, except that he does not know how to aggregate, does not know how to include logs in the trace, only time trace. And the Jaeger supported it.

    We looked at Jaeger: you can instrument applications, you can write in Api (the Api standard for PHP at that time, however, was not approved - it was a year ago, but now it has already been approved), but there was absolutely no client. “Okay,” we thought, and wrote our own client. What did we get? This is how it looks like:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    In Jaeger, spans are created for each message. That is, when a user opens the system, he sees one or two blocks for each incoming request (1-2-3 - how many incoming requests from the user were, so many blocks). To make it easier for users, we added tags to the logs and time trace. Accordingly, in case of an error, our application will mark the log with the appropriate Error tag. You can filter by the Error tag and only spans that contain this block with an error will be displayed. Here's what it looks like if we expand the span:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    There is a set of traces inside the span. In this case, these are three test traces, and the third trace tells us that an error has occurred. At the same time, here we see a time trace: we have a time scale on top, and we see at what time interval this or that log was recorded.

    Accordingly, we have done well. We wrote our own extension, and we open sourced it. If you want to work with tracing, if you want to work with "Huntsman" in PHP, there is our extension, welcome to use, as they say:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    We have this extension - this is a client for the OpenTracing Api, made as a php-extention, that is, you will need to assemble it and put it in the system. A year ago there was nothing else. Now there are other clients that are like components. Here it's up to you: either you download components as a composer, or you use extention up to you.

    Corporate standards

    We talked about the three commandments. The fourth commandment is to standardize approaches. What is it about? It's about this:

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    Why is the word "corporate" here? Not because we are a big or bureaucratic company, no! I would like to use the word “corporate” here in the context of the fact that each company, each product must have its own standards, including yours. What standards do we have?

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    • We have a deployment schedule. We can't move anywhere without him, we can't. We deploy about 60 times a week, that is, we deploy almost constantly. At the same time, we have, for example, in the deployment regulations, a taboo for deployments on Friday - in principle, we do not deploy.
    • We need documentation. Not a single new component gets into production if there is no documentation for it, even if it was born under the pen of our RnD-shnikov. We require from them instructions for deployment, a monitoring map and an approximate description (well, how programmers can write) of how this component works, how to troubleshoot it.
    • We are not solving the cause of the problem, but the problem - what I have already said. It is important for us to protect the user from problems.
    • We have permits. For example, we do not consider it a downtime if we lost 2% of traffic within two minutes. This, in principle, does not fall into our statistics. If more as a percentage or temporary, we already consider.
    • And we always write postmortems. Whatever happens to us, any situation when it behaved abnormally in production, it will be reflected in the potsmortem. A postmortem is a document in which you write what happened to you, detailed timing, what you did to fix it and (this is a mandatory block!) What you will do to prevent this from happening in the future. This is essential for further analysis.

    What is considered downtime?

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    What did it all lead to?

    This resulted in (we had some stability issues, it didn't sit well with the clients or us) for the last 6 months our stability score was 99,97. We can say that this is not very much. Yes, we have something to strive for. Of this indicator, about half is the stability, as it were, not of ours, but of our web application firewall, which stands in front of us and is used as a service, but customers do not care about it.

    We have learned to sleep at night. Finally! Six months ago, we couldn't. And on this note with the results, I want to make one remark. Last night there was a wonderful report on the control system of a nuclear reactor. If the people who wrote this system can hear me, please forget what I said about “2% is not downtime”. For you, 2% is downtime, even if it's for two minutes!

    That's all! Your questions.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    About balancers and database migration

    Question from the audience (hereinafter - B): - Good evening. Thank you very much for such an admin report! A short question, about your balancers. You mentioned that you have a WAF, that is, as I understand it, you use some external one as a balancer ...

    EC: – No, we use our own services as a balancer. In this case, WAF is exclusively a DDoS protection tool for us.

    В: – Can I say a few words about balancers?

    EC: - As I said, this is a group of servers in openresty. We now have 5 redundant groups that respond exclusively ... that is, the server on which only openresty is installed, it only proxies traffic. Accordingly, to understand how much we hold: we now have a regular traffic flow - this is several hundred megabits. They cope, they are fine, they do not even strain.

    В: - Also a simple question. Here is Blue/Green deployment. And what do you do, for example, with database migrations?

    EC: - Good question! Look, we are in Blue/Green deployment, we have separate queues for each line. That is, if we are talking about event queues that are transmitted from worker to worker, there are separate queues for the blue line and the green line. If we are talking about the database itself, then we deliberately narrowed it down as much as we could, shifted everything practically into queues, we only store the transaction stack in the database. And our transaction stack is the same for all lines. With the database in this context: we do not separate it into blue and green, because both versions of the code must know what is happening with the transaction.

    Friends, I have another small prize to spur you on - a book. And I should give it to you for the best question.

    В: - Hello. Thanks for the report. The question is. You monitor payments, you monitor the services with which you communicate… But how do you monitor so that a person somehow came to your payment page, made a payment, and the project credited him with money? That is, how do you monitor that the marchant is available and accepted your callback?

    EC: - "Merchant" for us in this case is exactly the same external service as the payment system. We monitor the speed of the merchant's response.

    About database encryption

    В: - Hello. I have a little bit of a question. You have sensitive PCI DSS data. I wanted to know how you store PANs in queues, into which you need to pass? What kind of encryption do you use? And hence the following second question: according to PCI DSS, it is necessary to periodically re-encrypt the database in case of changes (dismissal of administrators, and so on) - how does accessibility happen in this case?

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    EC: - Great question! First, we do not store PANs in queues. We do not have the right to store PAN anywhere in the clear, in principle, so we use a special service (we call it “Kademon”) - this is a service that does only one thing: it takes a message as input and gives an encrypted message. And we store everything with this encrypted message. Accordingly, we have a key length of under a kilobyte, so that it is directly serious and reliable.

    В: - Do you need 2 kilobytes now?

    EC: - It seems like yesterday there were 256 ... Well, where else ?!

    Accordingly, this is the first. And secondly, the solution that exists, it supports the re-encryption procedure - there are two pairs of “keks” (keys) that give “decks” that encrypt (key are keys, dek are derivatives of keys that encrypt). And in case of initiation of the procedure (it takes place regularly, from 3 months to ± some), we upload a new pair of "keks", and we have the data re-encrypted. We have separate services that rip out all the data, encrypt it in a new way; next to the data is stored the identifier of the key with which it is encrypted. Accordingly, as soon as we encrypt the data with new keys, we delete the old key.

    Sometimes payments need to be made manually...

    В: - That is, if a return came for some operation, then decrypt it with the old key for now?

    EC: - Yes.

    В: “Then one more little question. When some kind of failure, fall, incident occurs, it is necessary to push the transaction in manual mode. There is such a situation.

    EC: - Yes, sometimes.

    В: – Where do you get this data from? Or do you yourself walk with pens into this vault?

    EC: - No, well, of course - we have some kind of back-office system that contains an interface for our support. If we don’t know what status the transaction is in (for example, until the payment system responded with a timeout), we don’t know a priori, that is, we assign the final status only with full confidence. In this case, we dump the transaction into a special status for manual processing. In the morning, the next day, as soon as the support receives information that such and such transactions remain in the payment system, they manually process them in this interface.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    В: – I have a couple of questions. One of them is the continuation of the PCI DSS zone: how do you take out the logs of their circuit? Such a question because the developer could put anything in the logs! Second question: how do you roll out hotfixes? Handles in the database - this is one option, but there may be free hot-fixes - what is the procedure there? And the third question is probably related to RTO, RPO. Your availability was 99,97, almost four nines, but as I understand it, you have a second data center, and a third data center, and a fifth data center ... How do you synchronize them, replicate, and everything else?

    EC: - Let's start with the first one. The first question was about the logs? When logs are written, we have a layer that masks all sensitive data. She looks at the mask and additional fields. Accordingly, our logs come out with data already masked and the PCI DSS circuit. This is one of the regular tasks assigned to the testing department. They are required to check each task, including for those logs that they write, and this is one of the regular tasks during code reviews in order to control that the developer did not write something down. The subsequent verification of this is carried out regularly by the information security department about once a week: logs for the last day are selectively taken, and they are run through a special scanner-analyzer from the test servers to check it all.
    About hot-fixes. This is included in our deployment regulations. We have a separate item about hotfixes. We believe that we deploy hotfixes around the clock when we need it. As soon as the version is built, as soon as it is run, as soon as we have an artifact - we have a duty system administrator on call from support, and he deploys it at the moment when it is necessary.

    About "four nines". The figure that we now have was really achieved, and we were striving for it at one more data center. Now we have a second data center, and we are starting to route between them, and the issue of cross data center replication is really a non-trivial issue. We tried to solve it at one time by various means: we tried to use the same Tarantula - it didn’t work for us, I say right away. Therefore, we came to the fact that we make the order of "sense" manually. We have each application, in fact, in the asynchronous mode of the necessary synchronization "change - done" drives between data centers.

    В: - If you have a second one, then why didn't a third one appear? Because Split-brain is still nobody ...

    EC: “But we don’t have a split-brain.” Due to the fact that each application is running a multimaster, it does not matter to us which center the request came to. We are ready for the fact that, in case one of our data centers crashes (we rely on this) and in the middle of the user's request it switches to the second data center, we are ready to lose this user, indeed; but these will be units, absolute units.

    В: - Good evening. Thanks for the report. You talked about your debugger, which runs some test transactions in production. Tell us about test transactions! How deep does it go?

    EC: “It goes through a full cycle of the entire component. For a component, there is no difference between a test transaction and a combat transaction. And from the point of view of logic, this is just a separate project in the system, on which only test transactions are chasing.

    В: - Where do you cut it off? Core sent...

    EC: – We are behind “Kor” in this case for test transactions… We have such a thing as routing: “Kor” knows which payment system to send to - we send it to a fake payment system that just gives an http-beat and that's it.

    В: - Tell me, please, is your application written in one huge monolith, or did you cut it into some services or even microservices?

    EC: - We do not have a monolith, of course, we have a service-oriented application. We have a joke that we have a service of monoliths - they are really big enough. Microservices to call it the language does not turn from the word at all, but these are the services inside which the workers of distributed machines work.

    If a service on a server is compromised...

    В: “Then I have the next question. Even if it was a monolith, you still said that you have a lot of these instant servers, they all process data in principle, and the question is: “If one of the instant servers or an application is compromised, any individual link do they have some sort of access control? Which of them can do what? Whom to contact, for what information?

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    EC: – Yes, definitely. Security requirements are quite serious. Firstly, we have open data traffic, and only those ports through which we presuppose traffic traffic. If the component communicates with the database (say, Muskul) via 5-4-3-2, only 5-4-3-2 will be open to it, and other ports, other traffic directions will not be available. In addition, you need to understand that in our production there are about 10 different security loops. And even if the application was compromised in some way, God forbid, the attacker will not be able to access the server management console, because this is a different network security zone.

    В: - And in this context, I’m more interested in the moment that you have some contracts with services - what they can do, through what “actions” they can contact each other ... And in a normal flow, some specific services request some a row, a list of "actions" on the other. They do not seem to turn to others in a normal situation, and they have other areas of responsibility. If one of them is compromised, will he be able to pull the “actions” of that service? ..

    EC: - I understand. If, in a normal situation, communication with another server was allowed at all, then yes. According to the SLA contract, we do not monitor that only the first 3 “actions” are allowed to you, and 4 “actions” are not allowed to you. This is probably redundant for us, because we already have a 4-level protection system in principle for circuits. We prefer to defend with the contours, rather than at the level of the insides.

    How Visa, MasterCard and Sberbank work

    В: - I want to clarify the point about switching a user from one data center to another. As far as I know, "Visa" and "MasterCard" work on the binary synchronous protocol 8583, there are mixes there. And I wanted to know, now I mean switching - is it directly "Visa" and "MasterCard" or to payment systems, to processing?

    EC: - It's up to the mixes. We have mixes in one data center.

    В: – Roughly speaking, do you have one connection point?

    EC: - "Visa" and "MasterCard" - yes. Just because "Visa" and "MasterCard" require quite serious investments in infrastructure to conclude separate contracts for the second pair of mixes, for example. They are reserved within the same data center, but if, God forbid, our data center dies, where there are mixes for connecting to Visa and MasterCard, then communication with Visa and MasterCard will be lost ...

    В: How can they be reserved? I know that "Visa" allows only one connection to keep in principle!

    EC: “They supply the equipment themselves. In any case, we received equipment that is ironically reserved inside.

    В: – So the stand is from their Connects Orange…?

    EC: - Yes.

    В: - But how in this case: if your data center disappears, how can you continue to use it? Or is the traffic just stopped?

    EC: - No. In this case, we will simply switch traffic to another channel, which, of course, will be more expensive for us, more expensive for customers. But the traffic will not go through our direct connection to Visa, MasterCard, but through the conditional Sberbank (very exaggerated).

    I am wildly sorry if I offended the employees of Sberbank. But according to our statistics, of the Russian banks, Sberbank falls most often. Not a month goes by without something falling off at Sberbank.

    HighLoad++, Evgeny Kuzovlev (EcommPay IT): what to do when a minute of downtime costs $100000

    Some ads 🙂

    Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, cloud VPS for developers from $4.99, a unique analogue of entry-level servers, which was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $19 or how to share a server? (available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

    Dell R730xd 2 times cheaper in Equinix Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $99! Read about How to build infrastructure corp. class with the use of Dell R730xd E5-2650 v4 servers worth 9000 euros for a penny?

Source: habr.com

Add a comment