A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Modern data centers have hundreds of active devices covered by different types of monitoring. But even a perfect engineer with perfect monitoring in hand will be able to properly respond to a network failure in just a few minutes. In a report at the Next Hop 2020 conference, I presented a data center network design methodology that has a unique feature - the data center heals itself in milliseconds. More precisely, the engineer calmly fixes the problem, while the services simply do not notice it.

- To begin with, I will give a fairly detailed introduction for those who, perhaps, are not aware of the structure of a modern DC.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

For many network engineers, the data center network begins, of course, with ToR, with a switch in the rack. ToR usually has two types of links. The little ones go to the servers, others - there are N times more of them - go towards the first level spines, that is, to its uplinks. Uplinks are usually considered equal, and traffic between uplinks is balanced based on the 5-tuple hash, which includes proto, src_ip, dst_ip, src_port, dst_port. There are no surprises here.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Next, what does the architecture of the planes look like? The spines of the first level are not connected to each other, but are connected by means of superspins. The letter X will be responsible for superspins, it is almost like a cross-connect.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

And it is clear that, on the other hand, tori are connected to all spines of the first level. What is important in this picture? If we have interaction inside the rack, then the interaction, of course, goes through ToR. If the interaction goes inside the module, then the interaction goes through the spines of the first level. If the interaction is intermodular - as here, ToR 1 and ToR 2 - then the interaction will go through the spines of both the first and second levels.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Theoretically, such an architecture is easily scalable. If we have port capacity, a reserve of space in the data center and a pre-laid fiber, then the number of planes can always be increased, thereby increasing the overall capacity of the system. On paper, this is very easy to do. It would be like that in real life. But today's story is not about that.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

I want the right conclusions to be drawn. We have many paths inside the data center. They are conditionally independent. One way inside the data center is possible only inside ToR. Inside the module, we have the same number of paths as the number of planes. The number of paths between modules is equal to the product of the number of planes and the number of superspins in each plane. To make it clearer, to feel the scale, I will give the numbers that are valid for one of the Yandex data centers.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

There are eight planes, each plane has 32 superspins. As a result, it turns out that there are eight paths inside the module, and with inter-module interaction there are already 256 of them.

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

That is, if we are developing a Cookbook, trying to learn how to build fault-tolerant data centers that heal themselves, then the planar architecture is the right choice. It allows you to solve the scaling problem, and theoretically it is easy. There are many independent paths. The question remains: how does such an architecture survive failures? There are various crashes. And we will discuss this now.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Let one of our superspins get sick. Here I returned to the architecture of two planes. We'll stick with them as an example because it will simply be easier to see what's going on here with fewer moving parts. Let X11 get sick. How will this affect services that live inside data centers? A lot depends on how the failure actually looks.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

If the failure is good, it is caught at the level of automation of the same BFD, automation happily puts problem joints and isolates the problem, then everything is fine. We have many paths, traffic is instantly rerouted to alternative routes, and the services will not notice anything. This is a good scenario.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

A bad scenario is if we have constant losses, and the automation does not notice the problem. To understand how this affects the application, we will have to spend a little time discussing how the TCP protocol works.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

I hope I don't shock anyone with this information: TCP is a handshake protocol. That is, in the simplest case, the sender sends two packets, and receives a cumulative ack on them: "I received two packets."
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

After that, he will send two more packets, and the situation will repeat. I apologize in advance for some simplification. This scenario is correct if the window (number of packets in flight) is two. Of course, this is not necessarily the case in general. But the packet forwarding context is not affected by the window size.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What happens if we lose package 3? In this case, the recipient will receive packets 1, 2 and 4. And he will explicitly inform the sender using the SACK option: β€œYou know, three came, but the middle was lost.” He says "Ack 2, SACK 4".
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

The sender at this moment repeats exactly the packet that was lost without any problems.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

But if the last packet in the window is lost, the situation will look very different.

The recipient receives the first three packets and first of all starts to wait. Thanks to some optimizations in the Linux kernel TCP stack, it will wait for a paired packet, unless there is an explicit indication in the flags that this is the last packet or something like that. It will wait until the Delayed ACK timeout expires and then send an acknowledgment for the first three packets. But now the sender will be waiting. He doesn't know if the fourth package has been lost or is about to arrive. And in order not to overload the network, it will try to wait for the explicit indication that the packet is lost, or the expiration of the RTO timeout.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What is RTO timeout? This is the maximum from the RTT calculated by the TCP stack and some constant. What is this constant, we will now discuss.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

But it is important that if we are unlucky again and the fourth packet is lost again, then the RTO doubles. That is, each unsuccessful attempt is a doubling of the timeout.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Now let's see what this base is equal to. By default, the minimum RTO is 200ms. This is the minimum RTO for data packets. For SYN packets, it is different, 1 second. As you can see, even the first attempt to resend packets will take 100 times longer than RTT inside the data center.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Now back to our scenario. What's going on with the service? The service starts losing packets. Let the service be initially lucky and lose something in the middle of the window, then it receives a SACK, resends the lost packets.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

But if bad luck repeats, then we have an RTO. What is important here? Yes, we have a lot of paths in the network. But the TCP traffic of one particular TCP connection will continue to go through the same broken stack. Packet loss, provided that our magic X11 does not go out on its own, does not lead to traffic flowing to areas that are not problematic. We are trying to deliver a packet through the same broken stack. This leads to a cascading failure: a data center is a set of interacting applications, and some of the TCP connections of all these applications begin to degrade - because the superspin affects all applications that are inside the data center. As in the saying: if you don’t shoe a horse, the horse limps; the horse limped - the report was not delivered; the message was not delivered - they lost the war. Only here the count goes for seconds from the moment the problem occurs to the stage of degradation that services begin to feel. This means that users may not receive something somewhere.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

There are two classic solutions that complement each other. The first is services that are trying to lay straws and solve the problem like this: β€œLet's tweak something in the TCP stack. And let's make application-level timeouts or long-lived TCP sessions with internal health checks. The problem is that such solutions: a) do not scale at all; b) very poorly tested. That is, even if the service accidentally configures the TCP stack so that it becomes better, firstly, this is unlikely to be applicable to all applications and all data centers, and secondly, most likely, it will not understand what was done correctly and what not. That is, it works, but it works poorly and does not scale. And if there is a network problem, who is to blame? Of course NOC. What does NOC do?

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Many services believe that in NOC, work goes something like this. But to be honest, not only.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

NOC in the classical scheme is engaged in the development of many monitoring. These are both black box monitoring and white box monitoring. About the example of black box-monitoring of spines I told Alexander Klimenko on the past Next Hop. By the way, this monitoring works. But even perfect monitoring will have a time lag. Usually it is several minutes. After it works, the engineers on duty need time to double-check its operation, localize the problem, and then extinguish the problem area. That is, in the best case, the treatment of the problem takes 5 minutes, at worst 20 minutes, if it is not immediately obvious where the losses occur. It is clear that all this time - 5 or 20 minutes - our services will continue to hurt, which is probably not good.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What would you like to receive? We have so many paths. And problems arise precisely because TCP flows that are unlucky continue to use the same route. We need something that will allow us to use multiple routes within a single TCP connection. It would seem that we have a solution. There is TCP, which is called so - multipath TCP, that is, TCP for many paths. True, it was developed for a completely different task - for smartphones that have several network devices. To maximize the transfer or make the primary / backup mode, a mechanism was developed that transparently creates several threads (sessions) for the application and allows you to switch between them in case of a failure. Or, as I said, maximize the bandwidth.

But there is a nuance here. To understand what it is, we'll have to look at how streams are set up.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Threads are set sequentially. The first stream is installed first. Subsequent flows are then set using the cookie already agreed within that thread. And here is the problem.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

The problem is that if the first thread doesn't install, the second and third threads will never come up. That is, multipath TCP does not solve the loss of the SYN packet in the first stream. And if the SYN is lost, multipath TCP becomes normal TCP. So, in a data center environment, it will not help us solve the problem of losses in the factory and learn how to use multiple paths in case of failure.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What can help us? Some of you have already guessed from the name that the important field in our further story will be the IPv6 flow label header field. Indeed, this is a field that appears in v6, it is not in v4, it takes 20 bits, and there has been controversy about its use for a long time. This is very interesting - there were disputes, something was fixed within the framework of the RFC, and at the same time, an implementation appeared in the Linux kernel that was never documented anywhere.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

I suggest you join me on a little investigation. Let's take a look at what's been happening in the Linux kernel over the past few years.

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

year 2014. An engineer from a large and reputable company adds to the functionality of the Linux kernel the dependence of the value of the flow label on the hash of the socket. What are they trying to fix here? This is related to RFC 6438 which discussed the following issue. Inside the data center, IPv4 is often encapsulated in IPv6 packets, because the factory itself is IPv6, but IPv4 must somehow be given out. For a long time there were problems with switches that could not look under two IP headers to get to TCP or UDP and find src_ports, dst_ports there. It turned out that the hash, if you look at the first two IP headers, turned out to be almost fixed. To avoid this, so that the balancing of this encapsulated traffic works correctly, it was proposed to add a hash from the 5-tuple encapsulated packet to the value of the flow label field. Approximately the same was done for other encapsulation schemes, for UDP, for GRE, in the latter the GRE Key field was used. One way or another, the goals here are clear. And at least at that point in time they were useful.

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

In 2015, a new patch comes from the same respected engineer. He is very interesting. It says the following - we will randomize the hash in case of a negative routing event. What is a negative routing event? This is the RTO that we discussed earlier, that is, the loss of the window tail is an event that is really negative. True, it is relatively difficult to guess what it is.

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

2016, another respected company, also big. It parses the last crutches and makes it so that the hash that we previously made randomized is now changed on every SYN retransmit and after every RTO timeout. And in this letter, for the first and last time, the ultimate goal sounds - to make sure that traffic in the event of loss or overload of channels has the possibility of soft rerouting, using multiple paths. Of course, after that there were a lot of publications, you can easily find them.

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Although no, you can’t, because there hasn’t been a single publication on this topic. But we know!

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

And if you do not fully understand what was done, I will tell you now.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What has been done, what functionality has been added to the Linux kernel? txhash changes to a random value after each RTO event. This is the same negative routing result. The hash depends on this txhash and the flow label depends on the skb hash. There are some calculations on the functions here, all the details cannot be placed on one slide. If anyone is curious, you can go through the kernel code and check.

What is important here? The value of the flow label field changes to a random number after each RTO. How does this affect our unlucky TCP stream?
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

In the case of a SACK, nothing has changed because we are trying to resend a known lost packet. So far so good.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

But in the case of RTO, provided that we have added a flow label to the hash function on ToR, traffic can take a different route. And the more planes, the more likely it is to find a path that is not affected by a crash on a particular device.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

One problem remains - RTO. Another route, of course, is found, but a lot of time is spent on it. 200ms is a lot. A second is generally wildness. Earlier, I talked about timeouts that configure services. So, a second is a timeout that usually sets up a service at the application level, and in this the service will even be relatively right. Moreover, I repeat, the real RTT inside a modern data center is around 1 millisecond.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What can be done about RTO timeouts? The timeout that is responsible for RTO in case of loss of data packets can be relatively easily configured from user space: there is an IP utility, and one of its parameters contains the same rto_min. Considering that, of course, you need to turn RTO not globally, but for given prefixes, such a mechanism looks quite working.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

True, with SYN_RTO everything is somewhat worse. It is naturally nailed down. The value is fixed in the core - 1 second, and that's it. You can't reach it from user space. There is only one way.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

eBPF comes to the rescue. To put it simply, these are small C programs. They can be inserted into hooks at different places in the execution of the kernel stack and the TCP stack, with which you can change a very large number of settings. In general, eBPF is a long-term trend. Instead of sawing dozens of new sysctl parameters and expanding the IP utility, the movement is in the direction of eBPF and expanding its functionality. With eBPF, you can dynamically change congestion controls and various other TCP settings.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

But it is important for us that with the help of it you can twist the values ​​of SYN_RTO. And there is a publicly posted example: https://elixir.bootlin.com/linux/latest/source/samples/bpf/tcp_synrto_kern.c. What is done here? The example is working, but in itself is very rough. It is assumed here that inside the data center we compare the first 44 bits, if they match, then we find ourselves inside the DC. And in this case, we change the value of SYN_RTO timeout to 4ms. The same task can be done much more gracefully. But this simple example shows what is a) possible; b) relatively easy.

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What do we already know? That the planar architecture allows scaling, it turns out to be extremely useful for us when we turn on the flow label on ToR and get the opportunity to flow around problem areas. The best way to lower RTO and SYN-RTO values ​​is to use eBPF programs. The question remains: is it safe to use the flow label for balancing? And there is a nuance here.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Suppose you have a service on the network that lives in anycast. Unfortunately, I don't have time to go into detail about anycast, but it's a distributed service where different physical servers are available on the same IP address. And here is a possible problem: the RTO event can occur not only when traffic passes through the factory. It can also occur at the ToR buffer level: when an incast event occurs, it can even occur on the host when the host spills something. When an RTO event occurs and it changes the flow label. In this case, the traffic may go to another anycast instance. Suppose it is a stateful anycast, it contains a connection state - it can be an L3 Balancer or some other service. Then a problem arises, because after the RTO, the TCP connection arrives at the server, which knows nothing about this TCP connection. And if we do not have state sharing between anycast servers, then such traffic will be dropped and the TCP connection will break.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

What can be done here? Within your controlled environment, where you enable flow label balancing, you must fix the value of the flow label when accessing anycast servers. The easiest way is to do it through the same eBPF program. But here is a very important point - what to do if you do not operate a data center network, but are a telecom operator? This is your problem too: starting with certain versions of Juniper and Arista, they include the flow label in the hash function by default - to be honest, for no reason I understand. This can cause you to drop TCP connections from users going through your network. Therefore, I highly recommend checking your router settings at this location.

One way or another, it seems to me that we are ready to move on to experiments.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

When we turned on the flow label on ToR, prepared the agent's eBPF, which now lives on the hosts, we decided not to wait for the next big failure, but to conduct controlled explosions. We took ToR, which has four uplinks, and made drops on one of them. They drew a rule, they said - now you are losing all packets. As you can see on the left, we have per-packet monitoring, which has dipped to 75%, that is, 25% of packets are lost. On the right are graphs of services living behind this ToR. In fact, these are traffic graphs of joints with servers inside the rack. As you can see, they sank even lower. Why did they sink lower - not by 25%, but in some cases by 3-4 times? If the TCP connection is unlucky, it continues to try to reach through the broken interface. This is exacerbated by the typical behavior of the service inside the DC - for one user request, N requests to internal services are generated, and the response will go to the user, either when all data sources respond, or when a timeout is triggered at the application level, which still needs to be configured. That is, everything is very, very bad.
A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

Now the same experiment, but with the flow label enabled. As you can see, on the left, our batch monitoring sank by the same 25%. This is absolutely correct, because it does not know anything about retransmits, it sends packets and simply counts the ratio of the number of delivered and lost packets.

And on the right is the schedule of services. You will not find the effect of a problem joint here. Traffic in those same milliseconds flowed from the problem area to the three remaining uplinks that were not affected by the problem. We got a network that heals itself.

A network that heals itself: the magic of the Flow Label and the detective around the Linux kernel. Yandex report

This is my last slide, time to take stock. Now, I hope you know how to build a self-healing data center network. You will not need to go through the Linux kernel archive and look for special patches there, you know that the Flow label solves the problem in this case, but you need to approach this mechanism carefully. And I emphasize again that if you are a carrier, you should not use the flow label as a hash function, otherwise you will break your users' sessions.

For network engineers, a conceptual shift needs to take place: the network does not start with ToR, not with a network device, but with a host. A fairly striking example is how we use eBPF both to change the RTO and to fix the flow label towards anycast services.

The flow label mechanic is certainly suitable for other uses within the controlled administrative segment. This can be traffic between data centers, or you can use such mechanics in a special way to control outgoing traffic. But I will talk about this, I hope, next time. Thank you very much for your attention.

Source: habr.com