How Badoo achieved the ability to render 200k photos per second

How Badoo achieved the ability to render 200k photos per second

The modern web is almost unthinkable without media content: almost every grandmother has a smartphone, everyone is on social networks, and service downtime is costly for companies. For your attention, a transcript of the company's story Badoo about how she organized the delivery of photos using a hardware solution, what performance problems she encountered in the process, what caused them, and how these problems were solved using a software solution based on Nginx, while ensuring fault tolerance at all levels (video). We thank the authors of the story Oleg Sannis Efimova and Alexandra Dymova, who shared their experience at the conference Uptime day 4.

Let's start with a little introduction about how we store and cache photos. We have a layer where we store them, and a layer where we cache the photos. At the same time, if we want to achieve a large trick and reduce the load on the storage, it is important for us that each photo of an individual user lies on one caching server. Otherwise, we would have to install as many times as many disks as we have more servers. Our hit rate is around 99%, that is, we reduce the load on our storage by 100 times, and in order to do this, 10 years ago, when all this was being built, we had 50 servers. Accordingly, in order to give these photos, we needed, in fact, 50 external domains that these servers serve.

Naturally, the question immediately arose: if one of our servers goes down and becomes unavailable, what part of the traffic do we lose? We looked at what was on the market and decided to buy a piece of iron so that it would solve all our problems. The choice fell on the solution of the F5-network company (which, by the way, bought NGINX, Inc not so long ago): BIG-IP Local Traffic Manager.

How Badoo achieved the ability to render 200k photos per second

What this piece of iron (LTM) does: it is a iron router that makes iron redundancy of its external ports and allows you to route traffic based on the network topology, on some settings, and does health-checks. It was important for us that this piece of iron can be programmed. Accordingly, we could describe the logic of how photos of a certain user were returned from a particular cache. What does it look like? There is a piece of hardware that looks at the Internet for one domain, one ip, does ssl offload, parses http requests, selects a cache number from IRule, where to go, and lets traffic go there. At the same time, it does health-checks, and in case of unavailability of a machine, we made it so that traffic went to one backup server at that time. From the point of view of configuration, there are, of course, some nuances, but in general everything is quite simple: we prescribe a map, a certain number corresponds to our IP on the network, we say that we will listen on ports 80 and 443, we say that if the server is unavailable, then you need to let traffic go to the backup, in this case, the 35th, and we describe a bunch of logic how this architecture should be disassembled. The only problem was that the language in which the piece of iron is programmed is the Tcl language. If anyone remembers this at all ... this language is more write-only than a language convenient for programming:

How Badoo achieved the ability to render 200k photos per second

What did we get? We got a piece of hardware that ensures high availability of our infrastructure, routes all our traffic, provides health checks and just works. Moreover, it has been working for quite a long time: over the past 10 years, there have been no complaints about it. By the beginning of 2018, we were already rendering about 80k photos per second. This is somewhere around 80 gigabits of traffic from both of our data centers.

But…

At the beginning of 2018, we saw an ugly picture on the charts: the time to return photos has clearly increased. And it ceased to suit us. The problem is that this behavior was visible only at the very peak of traffic - for our company this is the night from Sunday to Monday. But the rest of the time the system behaved as usual, no signs of breakage.

How Badoo achieved the ability to render 200k photos per second

However, the problem had to be solved. We identified possible bottlenecks and began to eliminate them. First of all, of course, we expanded the external uplinks, carried out a complete revision of the internal uplinks, and found all possible bottlenecks. But all this did not give an obvious result, the problem did not disappear.

Another possible bottleneck was the performance of the photo caches themselves. And we decided that perhaps the problem rests on them. Well, we have expanded the performance - basically, network ports on photo caches. But again, no clear improvement was seen. In the end, we paid close attention to the performance of the LTM itself, and here we saw a sad picture on the graphs: the loading of all CPUs starts to go smoothly, but then sharply rests on the shelf. At the same time, LTM stops responding adequately to health-checks and uplinks and starts turning them off randomly, which leads to serious performance degradation.

That is, we have identified the source of the problem, identified the bottleneck. It remains to decide what we will do.

How Badoo achieved the ability to render 200k photos per second

The first obvious thing we could do is to somehow modernize the LTM itself. But there are some nuances here, because this iron is quite unique, you won’t go to the nearest supermarket and buy it. It's a separate contract, a separate license contract, and it will take a lot of time. The second option is to start thinking for yourself, come up with your own solution on your own components, preferably using an open source program. It remains only to decide what exactly we will choose for this and how much time we will spend on solving this problem, because users did not receive photos. Therefore, it is necessary to do all this very, very quickly, one might say - yesterday.

Since the task sounded like “do something as quickly as possible and using the hardware that we have”, the first thing we thought was just to remove some not the most powerful machines from the front, put Nginx there, with which we we know how to work, and try to implement all the same logic that the piece of iron used to do. That is, in fact, we left our piece of hardware, installed 4 more servers that we had to configure, made external domains for them by analogy with how it was 10 years ago ... We lost a little in availability if these machines fell, but, nevertheless less, solved the problem of our users locally.

Accordingly, the logic remains the same: we install Nginx, it can do SSL-offload, we can somehow program the routing logic, health-checks on the configs and simply duplicate the logic that we had before.

We sit down to write configs. At first it seemed that everything was very simple, but, unfortunately, it is very difficult to find manuals for each task. Therefore, we do not advise you to simply google “how to configure Nginx for photos”: it is better to refer to the official documentation, which will show which settings should be touched. But it is better to choose a specific parameter yourself. Well, then everything is simple: we describe the servers that we have, we describe the certificates ... But the most interesting thing is, in fact, the routing logic itself.

At first, it seemed to us that we simply describe our location, match our photo cache number in it, describe with our hands or a generator how many upstreams we need, in each upstream we indicate the server to which traffic should go, and the backup server - in case the main server not available:

How Badoo achieved the ability to render 200k photos per second

But, probably, if everything were so simple, we would just go home and not tell anything. Unfortunately, with the default settings of Nginx, which, in general, were made over many years of development and not quite for this case ... the config looks like this: in case some upstream server has a request error or timeout, Nginx always switches traffic to the next one. At the same time, after the first fail, the server will also be turned off within 10 seconds, both by mistake and by timeout - this cannot even be configured in any way. That is, if we remove or reset the timeout option in the upstream directive, then, although Nginx will not process this request and respond with some not-so-good error, the server will turn off.

How Badoo achieved the ability to render 200k photos per second

To avoid this, we did two things:

a) they forbade Nginx from doing this manually - and unfortunately, the only way to do this is to simply set the max fails settings.

b) we remembered that in other projects we use a module that allows you to do background health checks - accordingly, we made quite frequent health checks so that we have minimal downtime in the event of an accident.

Unfortunately, this is not all either, because literally the first two weeks of this scheme's operation showed that TCP health-check is also an unreliable thing: not Nginx, or Nginx in the D-state can be launched on the upstream server, and in this case the kernel will accept the connection, the health-check will pass, but it will not work. Therefore, we immediately replaced it with http's health-check, made a specific one, which, if it already gives 200, then everything works in this script. You can do additional logic - for example, in the case of caching servers, check that the file system is correctly mounted:

How Badoo achieved the ability to render 200k photos per second

And that would suit us, except that at the moment the circuit completely repeated what the piece of iron did. But we wanted to do better. Previously, we had one backup server, and this is probably not very good, because if you have a hundred servers, then when several fall at once, one backup server is unlikely to cope with the load. Therefore, we decided to distribute the reservation among all servers: we simply made another separate upstream, recorded all servers there with certain parameters in accordance with what load they can serve, added the same health-checks that we had before :

How Badoo achieved the ability to render 200k photos per second

Since you can’t go to another upstream inside one upstream, it was necessary to make sure that if the main upstream, in which we simply wrote the correct, necessary photo cache, was unavailable, we simply went to fallback through error_page, from where we went to the backup upstream:

How Badoo achieved the ability to render 200k photos per second

And literally adding four servers, we got this: we replaced part of the load - removed from LTM to these servers, implemented the same logic there using standard hardware and software, immediately received a bonus that these servers can be scaled, because they can be simply put in as much as you need. Well, the only negative is that we have lost high availability for external users. But at that time I had to sacrifice this, because it was necessary to immediately solve the problem. So, we removed part of the load, it was about 40% at that time, LTM got better, and literally two weeks after the problem started, we began to give not 45k requests per second, but 55k. In fact, we grew by 20% - this is clearly the traffic that we did not give to the user. And after that, they began to think about how to solve the remaining problem - to ensure high external availability.

How Badoo achieved the ability to render 200k photos per second

We had some pause during which we discussed what solution we would use for this. There were proposals to ensure reliability using DNS, with the help of some self-written scripts, dynamic routing protocols ... there were many options, but it already became clear that for a truly reliable return of photos, you need to introduce another layer that will monitor this. We called these machines photo directors. The software we relied on was Keepalived:

How Badoo achieved the ability to render 200k photos per second

To begin with, what Keepalived consists of. The first is the VRRP protocol, widely known to networkers, located on network equipment that provides fault tolerance for the external IP address to which clients connect. The second part is IPVS, IP virtual server, to balance between photo routers and provide fault tolerance at this level. And the third is health-checks.

Let's start with the first part: VRRP - what does it look like? There is a certain virtual IP, which has an entry in dns badoocdn.com, where clients connect. At some point in time, we have an IP address on one server. Keepalived packets run between servers using the VRRP protocol, and if the master disappears from the radar - the server has rebooted or something else, then the backup server automatically raises this IP address - no manual actions are needed. Master and backup differ, mainly priority: the higher it is, the more likely it is that the machine will become the master. A very big advantage is that you do not need to configure IP addresses on the server itself, it is enough to describe them in the config, and if the IP addresses need some custom routing rules, this is described directly in the config, in the same syntax as described in the VRRP package. You will not meet any unfamiliar things.

How Badoo achieved the ability to render 200k photos per second

What does it look like in practice? What happens if one of the servers goes down? As soon as the master disappears, our backup stops receiving promotions and automatically becomes the master. After some time, we repaired the master, rebooted, raised Keepalived - advertisments with a higher priority than the backup arrive, and the backup automatically turns back, removes IP addresses from itself, no manual actions are needed.

How Badoo achieved the ability to render 200k photos per second

Thus, we have ensured the fault tolerance of the external IP address. The next part is to somehow balance traffic from an external IP address to photo routers that already terminate it. With balancing protocols, everything is clear enough. This is either a simple round-robin, or a little more complex things, wrr, list connection and so on. This is basically described in the documentation, there is nothing special about it. But the delivery method ... Here we will dwell in more detail - why we chose one of them. These are NAT, Direct Routing and TUN. The fact is that we immediately laid down the return of 100 gigabits of traffic from the sites. This is if you figure it out, you need 10 gigabit cards, right? 10 gigabit cards in one server - this is already beyond, at least, our concept of "standard equipment". And then we remembered that we don't just give away some traffic, we give away photos.

What is the feature? - A huge difference between incoming and outgoing traffic. Incoming traffic is very small, outgoing traffic is very large:

How Badoo achieved the ability to render 200k photos per second

If you look at these graphs, you can see that at the moment about 200 Mb per second is being sent to the director, this is the most typical day. We give back 4,500 MB per second, the ratio is about 1/22. It is already clear that for us to fully provide outgoing traffic to 22 working servers, one is enough that accepts this connection. Here the direct routing algorithm, the routing algorithm, comes to our aid.

What does it look like? Our photo director, according to his table, transfers connections to photo routers. But photo routers send reverse traffic directly to the Internet, send it to the client, it does not go back through the photo director, thus, with a minimum number of machines, we provide full fault tolerance and pumping of all traffic. In the configs, it looks like this: we specify the algorithm, in our case it is a simple rr, we provide the direct routing method, and then we start listing all the real servers, how many of them we have. Which will determine this traffic. In the event that we have one or two more, several servers appearing there, such a need arises - we simply add this section to the config and don’t worry too much. From the side of real servers, from the side of a photo-router, this method requires the most minimal configuration, it is perfectly described in the documentation, and there are no pitfalls there.

What is especially pleasant is that such a solution does not imply a radical alteration of the local network, this was important for us, we had to solve it at minimal cost. If you look at IPVS admin command outputthen we will see what it looks like. Here we have a certain virtual server, on port 443, listens, accepts a connection, all working servers are listed, and it is clear that the connection is, plus or minus, the same. If we look at the statistics on the same virtual server, we have incoming packets, incoming connections, but absolutely no outgoing ones. Outgoing connections go directly to the client. Well, we managed to unbalance. Now, what happens if one of our photo routers fails? After all, iron is iron. It may go into a kernel panic, it may break, the power supply may burn out. Anything. This is what health-checks are for. They can be as simple as checking how the port is open with us, or some more complex, up to some self-written scripts that will even check the business logic.

We stopped somewhere in the middle: we have an https request for a specific location, a script is called if it responds with a 200th response, we believe that everything is fine with this server, that it is alive and can be turned on quite easily.

How does this, again, look in practice. Turned off the server, allowable for maintenance - flashing the BIOS, for example. In the logs, we immediately have a timeout, we see the first line, then after three attempts it is marked as “failed”, and it is simply removed from the list.

How Badoo achieved the ability to render 200k photos per second

A second behavior is also possible, when VS is simply set to zero, but in the case of returning a photo, this does not work well. The server rises, Nginx starts up there, immediately health-checks understand that the connection is going through, that everything is fine, and the server appears in our list, and the load immediately starts to be applied to it automatically. No manual actions are required from the duty administrator. At night, the server rebooted - the monitoring department does not call us about this at night. They let you know what happened, everything is fine.

So, in a fairly simple way, with the help of a small number of servers, we solved the problem of external fault tolerance.

It remains to say that all this, of course, needs to be monitored. Separately, it should be noted that Keepalivede, as a software written a very long time ago, has a bunch of ways to monitor it, both using checks via DBus, SMTP, SNMP, and standard Zabbix. Plus, he himself knows how to write letters to almost every sneeze, and to be honest, at some point we even thought to turn it off, because he writes a lot of letters for any traffic switch, inclusion, for every IP-shnik and so on . Of course, if there are a lot of servers, then you can flood yourself with these letters. Using standard methods, we monitor nginx on photo routers, and hardware monitoring has not gone away. Of course, we would advise two more things: firstly, external health-checks and accessibility, because even if everything works, in fact, it is possible that users do not receive photos due to problems with external providers or something something more complex. It is always worth keeping somewhere on another network, in amazon or somewhere else, a separate machine that can ping your servers from the outside, and it is also worth using either anomaly detection, for those who are good at tricky machine learning, or simple monitoring, at least in order to track if the requests have fallen sharply, or vice versa, have grown. It's also useful.

To summarize: we, in fact, replaced the iron solution, which at some point ceased to suit us, with a fairly simple system that does everything the same, that is, it provides termination of HTTPS traffic and further smart routing with the necessary health-checks. We have increased the stability of this system, that is, we still have high availability for each layer, plus we got the bonus that it is quite easy to scale it all on each layer, because this is standard hardware with standard software, that is, by doing so, we have simplified ourselves diagnosing possible problems.

What did we end up with? We had a problem on the January holidays of 2018. During the first six months, while we were putting this scheme into operation, expanding it to all traffic, in order to remove all traffic from LTM, we grew only in traffic in one data center from 40 gigabits to 60 gigabits, and at the same time for the entire 2018 year were able to give almost three times more photos per second.

How Badoo achieved the ability to render 200k photos per second

Source: habr.com

Add a comment