How to scale data centers. Yandex report

We have developed a data center network design that allows deploying computing clusters larger than 100 servers with a peak bisection bandwidth of over one petabyte per second.

From Dmitry Afanasiev's report, you will learn about the basic principles of the new design, topology scaling, the problems that arise in this case, options for solving them, about the features of routing and scaling the forwarding plane functions of modern network devices in "dense" (densely connected) topologies with a large number of ECMP routes . In addition, Dima briefly spoke about the organization of external connectivity, the physical layer, the cable system and ways to further increase the capacity.

How to scale data centers. Yandex report

- Good afternoon everyone! My name is Dmitry Afanasiev, I'm a Yandex network architect and mainly design data center networks.

How to scale data centers. Yandex report

My story will be about the updated network of Yandex data centers. It's pretty much an evolution of the design we had, but at the same time there are some new elements as well. This is an overview presentation, because there was a lot of information to fit in a small amount of time. We will start by choosing a logical topology. Then there will be an overview of the control plane and data plane scalability issues, the choice of what will happen at the physical level, we will look at some of the features of the devices. Let's also touch a little on what is happening in the data center with MPLS, which we talked about some time ago.

How to scale data centers. Yandex report

So, what is Yandex in terms of loads and services? Yandex is a typical hyperscaler. If you look in the direction of users, we are primarily processing user requests. Also various streaming services and data return, because we also have storage services. If closer to the backend, then infrastructure loads and services appear there, such as distributed object storage, data replication and, of course, persistent queues. One of the main types of workloads is MapReduce and similar systems, stream processing, machine learning, etc.

How to scale data centers. Yandex report

How is the infrastructure on top of which this all happens? Again, we are quite a typical hyperscaler, although perhaps a little closer to the side of the spectrum where the smaller hyperscalers are. But we have all the attributes. We use commodity hardware and horizontal scaling wherever possible. We have resource pooling in full growth: we do not work with individual machines, individual racks, but combine them into a large pool of interchangeable resources with some additional services that deal with planning and allocation, and work with this entire pool.

So we have the next level - the operating system level of the computing cluster. It is very important that we have full control over the technology stack that we use. We control the endpoints (hosts), the network and the software stack.

We have several large data centers in Russia and abroad. They are united by a backbone using MPLS technology. Our internal infrastructure is built almost entirely on IPv6, but since we need to serve external traffic, which is still coming mainly on IPv4, we have to somehow deliver requests coming over IPv4 to the front-end servers, and go a little more to external IPv4- Internet - for example, for indexing.

The last few design iterations of data center networks use layered Clos topologies and use only L3. We left L2 a while ago and breathed a sigh of relief. Finally, our infrastructure includes hundreds of thousands of compute (server) instances. The maximum cluster size some time ago was about 10 thousand servers. This is largely due to how the very cluster-level operating systems, schedulers, resource allocation, etc. can work. Since there has been progress on the infrastructure software side, now the target is about 100 thousand servers in one computing cluster, and we had a task - to be able to build network factories that allow efficient pooling of resources in such a cluster.

How to scale data centers. Yandex report

What do we want from the data center network? First of all - a lot of cheap and fairly uniformly distributed bandwidth. Because the network is the substrate, due to which we can pool resources. The new target size is about 100 thousand servers in one cluster.

Of course, we also want a scalable and stable control plane, because on such a large infrastructure a lot of headaches arise even from just random events, and we don’t want the control plane to bring us a headache. In doing so, we want to minimize the state in it. The smaller the state, the better and more stable everything works, it is easier to diagnose.

Of course, we need automation, because manually managing such an infrastructure is not possible, and it was not possible some time ago. We need operational support where possible and CI/CD support to the extent possible.

With such sizes of data centers and clusters, the task of supporting incremental deployment and expansion without service interruption has already become quite acute. If on clusters of a thousand machines, perhaps close to ten thousand machines, they could still be rolled out as a single operation - that is, we plan to expand the infrastructure, and several thousand machines are added as a single operation, then a cluster of a hundred thousand machines does not arise immediately so, it is built over time. And it is desirable that all this time what has already been pumped out, the infrastructure that has been deployed, should be available.

And one requirement that we had and left: support for multitenancy, that is, virtualization or network segmentation. Now we do not need to do this at the network factory level, because the sharding has gone to the hosts, and this made it very easy for us to scale. Thanks to IPv6 and a large address space, we did not need to use duplicate addresses in the internal infrastructure, all addressing was already unique. And due to the fact that we have taken the filtering and segmentation of the network to the hosts, we do not need to create any virtual network entities in data center networks.

How to scale data centers. Yandex report

A very important thing is that we do not need. If some functions can be removed from the network, this greatly simplifies life, and, as a rule, expands the choice of available hardware and software, greatly simplifies diagnostics.

So, what do we not need, what we were able to give up, not always with joy at the moment when it happened, but with great relief when the process was completed?

First of all, the rejection of L2. We don't need L2 either real or emulated. Not used in large part due to the fact that we control the application stack. Our applications are horizontally scalable, they work with L3 addressing, they don’t really worry that some separate instance goes out, they just roll out a new one, it doesn’t need to roll out at the old address, because there is a separate level of service discovery and monitoring of machines located in the cluster . We do not shift this task to the network. The task of the network is to deliver packets from point A to point B.

Also, we do not have situations where addresses move within the network, and this needs to be tracked. In many designs, this is usually needed to support VM mobility. We do not use the mobility of virtual machines in the internal infrastructure of the large Yandex, and, in addition, we believe that even if this is done, it should not happen with network support. If it really needs to be done, it needs to be done at the host level, and drive addresses that can migrate into overlays so as not to touch or make too many dynamic changes to the routing system of the underlay itself (transport network).

Another technology that we do not use is multicast. I can tell you why if you want. This greatly simplifies life, because if someone dealt with it and looked at exactly what the multicast control plane looks like - in all but the simplest installations, this is a big headache. And what's more, it's hard to find a well-functioning open source implementation, e.g.

Finally, we design our networks so that they don't change too much. We can count on the fact that the flow of external events in the routing system is small.

How to scale data centers. Yandex report

What problems arise and what limitations should be taken into account when we develop a data center network? Cost, of course. Scalability, to what level we want to grow. The need to expand without stopping the service. Bandwidth, availability. Visibility of what is happening on the network, for monitoring systems, for operational teams. Support for automation - again, as much as possible, since different tasks can be solved at different levels, including the introduction of additional layers. Well, not-[if possible]-dependence on vendors. Although in different historical periods, depending on which cut to look at, this independence was easier or more difficult to achieve. If we take a slice of network device chips, then until recently it was very conditional to talk about independence from vendors, if we also wanted chips with a large bandwidth.

How to scale data centers. Yandex report

What logical topology will we build our network on? This will be a multi-level Clos. In fact, there are no real alternatives at the moment. And the Clos topology is good enough, even when compared to various advanced topologies that are more in the realm of academic interest now, if we have switches with a large radix.

How to scale data centers. Yandex report

How does a multi-level Clos-net work roughly, and how are the various elements named in it? First of all, the wind rose to orient where is north, where is south, where is east, where is west. Networks of this type are usually built by those who have very large west-east traffic. As for the rest of the elements, at the top is a virtual switch assembled from smaller switches. This is the basic idea of ​​recursively building Clos networks. We take elements with some radix and connect them so that what we get can be considered as a commutator with a larger radix. If more is needed, the procedure can be repeated.

In cases, for example, with two-level Clos, when it is possible to clearly distinguish components that are vertical in my diagram, they are usually called planes. If we built Clos with three levels of spine switches (all that are not boundary and not ToR switches and which are used only for transit), then the planes would look more complicated, two-level ones look just like that. We call a block of ToR or leaf switches and their associated first-level spine switches Pod. The spine-1 level switches at the top of the Pod are the top of the Pod. The switches that are located at the top of the entire factory are the top layer of the factory, Top of fabric.

How to scale data centers. Yandex report

Of course, the question arises: Clos-networks have been built for some time, the idea itself generally comes from the times of classical telephony, TDM-networks. Maybe something better has appeared, maybe something better can be done? Yes and no. Theoretically, yes, in practice, in the near future, definitely not. Because there are a number of interesting topologies, some of them are even used in production, for example, Dragonfly is used in HPC applications; there are also interesting topologies like Xpander, FatClique, Jellyfish. If you look at recent reports at conferences like SIGCOMM or NSDI, you can find a fairly large number of papers on alternative topologies that have better properties (in one way or another) than Clos.

But all these topologies have one interesting property. It prevents their implementation in data center networks, which we are trying to build on commodity hardware and which cost quite reasonable money. In all of these alternative topologies, most of the bandwidth is unfortunately not accessible via short cuts. Therefore, we immediately lose the opportunity to use the traditional control plane.

Theoretically, the solution to the problem is known. These are, for example, link state modifications using the k-shortest path, but, again, there are no such protocols that would be implemented in production and massively available on equipment.

Moreover, since most of the capacity is not available via the shortest paths, we need to modify more than just the control plane to select all of these paths (and by the way, this is a much larger state in the control plane). We still need to modify the forwarding plane, and usually at least two additional features are required. This is the ability to make all decisions about packet forwarding one-time, for example, on the host. In fact, this is source routing, sometimes in the literature on interconnection networks this is called all-at-once forwarding decisions. And adaptive routing is already a function that we need on network elements, which boils down, for example, to the fact that we choose the next hop based on information about the least queue load. As an example, other options are possible.

Thus, the direction is interesting, but, alas, we cannot apply it right now.

How to scale data centers. Yandex report

Okay, settled on the logical topology Clos. How are we going to scale it? Let's see how it works and what can be done.

How to scale data centers. Yandex report

In the Clos network, there are two main parameters that we can somehow vary and get certain results: radix elements and the number of levels in the network. I have a schematic of how both affect the size. Ideally, we combine both.

How to scale data centers. Yandex report

It can be seen that the final width of the Clos-network is the product of the southern radix across all levels of spine switches, how many links we have down, how it branches. This is how we scale the size of the network.

How to scale data centers. Yandex report

As for capacity, especially on ToR switches, there are two options for scaling. Either we can, while maintaining the overall topology, use faster links, or we can add more planes.

If you look at the expanded version of the Clos network (in the lower right corner) and go back to this picture with the Clos network at the bottom ...

How to scale data centers. Yandex report

… then this is exactly the same topology, but on this slide it is collapsed more compactly and the planes of the factory are superimposed on each other. It is the same.

How to scale data centers. Yandex report

What does Clos-network scaling look like in numbers? Here I have data on what maximum width a network can get, what is the maximum number of racks, ToR switches or leaf switches, if they are not in racks, we can get, depending on what radix of switches we use for spine -levels, and how many levels we use.

Here is how many racks we can have, how many servers, and approximately how much this all can consume at the rate of 20 kW per rack. A little earlier, I mentioned that we are aiming for a cluster size of about 100 thousand servers.

It can be seen that in this whole construction, two and a half options are of interest. There is an option with two layers of spines and 64-port switches, which is a little short. Then there are perfectly fitting options for 128-port (with radix 128) spine switches with two levels, or switches with radix 32 with three levels. And in all cases, where there are more radix and more levels, you can make a very large network, but if you look at the expected consumption, as a rule, there are gigawatts. It is possible to lay a cable, but we are unlikely to get so much electricity at one site. If you look at the statistics, public data on data centers, you can find very few data centers with a design capacity of more than 150 MW. What is larger is, as a rule, data center campuses, several large data centers located close enough to each other.

There is another important parameter. If you look at the left column, usable bandwidth is indicated there. It is easy to see that in a Clos network, a significant part of the ports goes to connecting switches to each other. Usable bandwidth, a useful band, is what can be given out, towards servers. Naturally, I'm talking about conditional ports and specifically about the band. As a rule, links within the network are faster than links towards servers, but for a unit of bandwidth, as far as we can give it out to our server equipment, there is still some bandwidth within the network itself. And the more levels we make, the greater the unit cost of getting that lane out.

Moreover, even this additional band is not exactly the same. As long as the spans are short, we can use something like DAC (direct attach copper, i.e. twinax cables), or multimode optics, which cost even more or less reasonable money. As soon as we switch to longer spans, as a rule, these are single mode optics, and the cost of this additional band increases markedly.

And again, returning to the previous slide, if we make a Clos network without oversubscription, then it is easy to look at the diagram, see how the network is built - by adding each level of spine switches, we repeat the entire strip that was below. Plus the level - plus the whole same band, the same number of ports on the switches as it was at the previous level, the same number of transceivers. Therefore, it is highly desirable to minimize the number of levels of spine switches.

Based on this picture, it is clear that we really want to build on something like switches with a 128 radix.

How to scale data centers. Yandex report

Here, in principle, everything is the same as what I just said, this is a slide rather for consideration later.

How to scale data centers. Yandex report

What are the options that we can choose as such switches? Very pleasant news for us that now such networks can finally be built on single-chip switches. And it's very cool, they have a lot of nice features. For example, they have almost no internal structure. This means they break more easily. They break, where without it, but break, fortunately, entirely. Modular devices have a large number of malfunctions (very unpleasant), when from the point of view of neighbors and the control plane it seems to work, but, for example, part of the factory is gone, and it does not work at full capacity. And the traffic to it is balanced based on the fact that it is fully functional, and we can get an overload.

Or, for example, there are problems with the backplane, because inside the modular device there are also high-speed SerDes - it is really complicated inside. Or at it labels between forwarding elements are synchronized or not synchronized. In general, any productive modular device, consisting of a large number of elements, inside itself, as a rule, contains the same Clos-network, only which is very difficult to diagnose. Often, even the vendor himself is difficult to diagnose.

And it has a large number of failure scenarios in which the device degrades, but does not fall out of the topology completely. Since our network is large, balancing between identical elements is actively used, the network is very regular, that is, one path on which everything is in order is no different from the other path, it is more profitable for us to simply lose some of the devices from the topology than to get into a situation where some of them seem to work and some don't.

How to scale data centers. Yandex report

The next nice feature of single-chip devices is that they evolve better, faster. They also tend to have better capacity. If we take the large assembled structures that we have around, then the capacity per rack unit for ports of the same speed is almost twice as good as that of modular devices. Devices built around a single chip are noticeably cheaper than modular devices and consume less power.

But, of course, it's not just that, there are drawbacks. Firstly, the radix is ​​almost always smaller than that of modular devices. If we can get a device built around a single chip with 128 ports, then we can get a modular device with several hundred ports now without any special problems.

This is a noticeably smaller size of forwarding tables and, as a rule, everything related to the scalability of the data plane. shallow buffers. And, as a rule, rather limited functionality. But it turns out that if you know these limitations and take care in time to bypass them or simply take them into account, then this is not so scary. The fact that the radix is ​​smaller is no longer a problem on devices with radix 128 that have recently appeared, we can build up in two layers of spines. And it is still impossible to build anything less than two sizes of interest to us. With one level, very small clusters are obtained. Even our previous designs and requirements still exceeded them.

In fact, if suddenly the solution is somewhere on the verge, there is another way to scale. Since the last (or first), lowest level where servers are connected are ToR switches or leaf switches, we are not required to connect one rack to them. Therefore, if the solution falls short somewhere by a factor of two, you can think about simply using a switch with a large radix at the lower level and connecting, for example, two or three racks to one switch. This is also an option, it has its own costs, but it works quite well and can be a good solution when you need to reach somewhere twice the size.

How to scale data centers. Yandex report

In summary, we are building a two-level topology with eight factory layers.

How to scale data centers. Yandex report

What will happen to physics? Very simple calculations. If we have two levels of spines, then we have only three levels of switches, and we expect that there will be three cable segments in the network: from servers to leaf switches, to spine 1, to spine 2. The options that we can use are − these are twinax, multimode, single mode. And here we need to consider what lane is available, how much it will cost, what the physical dimensions are, what spans we can go through, and how we will upgrade.

Everything can be lined up in terms of cost. Twinaxes are significantly cheaper than active optics, cheaper than multimode transceivers, if you take the span from the end, somewhat cheaper than a 100-gigabit switch port. And, attention, it costs less than single mode optics, because on spans where single mode is required, in data centers it makes sense to use CWDM for a number of reasons, it is not very convenient to work with parallel single mode (PSM), very large bundles are obtained fiber, and if we dwell on these technologies, we get approximately the same price hierarchy.

One more note: unfortunately, it is not very possible to use disassembled 100 to 4x25 multimode ports. Due to the design of the transceivers, SFP28 is not much cheaper than QSFP28 at 100 Gb. And this disassembly for multimode does not work very well.

Another limitation is that due to the size of computing clusters, due to the number of servers, our data centers turn out to be physically large. This means that at least one span will have to be done with a singlemode. Again, due to the physical size of the Pods, it will not be possible to go two spans of twinax (copper cables).

As a result, if we optimize for price and take into account the geometry of this design, we get one twinax span, one multimode span and one singlemode span using CWDM. This takes into account possible upgrade paths.

How to scale data centers. Yandex report

It looks something like this, what happened recently, where we are moving and what is possible. It is clear, at least, how to move towards 50-gigabit SerDes for both multimode and singlemode. Moreover, if you look at what is in singlemode transceivers now and in the future for 400G, often even when 50G SerDes come from the electrical side, then 100 Gbps per lane can go to the optics. Therefore, it is quite possible that instead of switching to 50, there will be a transition to 100 Gbps SerDes and 100 Gbps per lane, because according to the promises of many vendors, their availability is expected quite soon. The period when 50G SerDes were the fastest does not seem to be very long, because 100G SerDes roll out almost next year the first copies. And some time after that, they will probably cost reasonable money.

How to scale data centers. Yandex report

Another nuance about the choice of physics. In principle, already now we can use 400- or 200-gigabit ports using 50G SerDes. But it turns out that this does not make much sense, because, as I said earlier, we want a fairly large radix on the switches, within reason, of course. We want 128. And if we have a limited chip capacity and we increase the link speed, then the radix, of course, decreases, there are no miracles.

And we can increase the total capacity at the expense of planes, and at the same time there are no special costs, we can add the number of planes. And if we lose the radix, we will have to introduce an additional level, so in the current hands, with the current maximum available capacity per chip, it turns out that it is more efficient to use 100-gigabit ports, because they allow you to get a larger radix.

How to scale data centers. Yandex report

The next question is how physics is organized, but already from the point of view of the cable infrastructure. It turns out that it is organized quite funny. Cabling between leaf switches and first-level spines - there are not so many links, everything is built relatively simply there. But if we take one plane, what happens inside - there you need to connect all the spines of the first level with all the spines of the second level.

Plus, as a rule, there are some wishes for how it should look inside the data center. For example, we really wanted to combine cables into a bundle and pull them so that one high-density patch panel went entirely into one patch panel, so that there was no zoo in length. We managed to solve this problem. If you first look at the logical topology, you can see that the planes are independent, each plane can be built on its own. But when we add such a bundle and want to drag the entire patch panel into the patch panel, then we have to mix different planes inside one bundle and introduce an intermediate design in the form of optical cross-connects in order to repack them from the way they were assembled on one segment , in how they will be collected on another segment. Thanks to this, we get a nice feature: all the complex switching does not go beyond the racks. When you need to twist something very strongly, “unfold the planes”, as it is sometimes called in Clos-nets, all this is concentrated inside one rack. We do not have highly disassembled, up to individual links, switching between racks.

How to scale data centers. Yandex report

This is how it looks from the point of view of the logical organization of the cable infrastructure. In the picture on the left, multi-colored blocks depict blocks of first-level spine switches, eight each, and four bundles of cables coming from them, which go and intersect with bundles coming from the blocks of spine-2 switches.

Small squares represent intersections. At the top left is a scan of each such intersection, this is actually a 512 to 512 cross-connect module that repackages cables so as to come completely into one rack, where there is only one spine-2 plane. And on the right, the scan of this picture is a little more detailed in relation to several Pods at the spine-1 level, and how it is packaged in a cross-connect, how it comes to the spine-2 level.

How to scale data centers. Yandex report

Here's what it looks like. Not yet fully assembled spine-2 rack (left) and cross-connect rack. Unfortunately, there isn't much to see there. All this construction is being rolled out right now in one of our large data centers which is expanding. This is a work in progress, will look prettier, will be filled better.

How to scale data centers. Yandex report

An important question: they chose a logical topology, built physics. What will happen to the control plane? It is quite well known from operating experience, there are some reports that link state protocols are good, it is pleasant to work with them, but, unfortunately, they do not scale well on a tightly wired topology. And there is one main factor that prevents this - this is how flooding works in link state protocols. If we just take the algorithm of fledding, look at how our network is structured, it will be clear that there will be a very large fanout at each step, and it will simply flood the control plane with updates. Specifically, such topologies specifically mix very poorly with the traditional flooding algorithm in link state protocols.

The choice is to use BGP. How to prepare it correctly is described in RFC 7938 on the use of BGP in large data centers. The basic ideas are simple: minimum number of prefixes per host and generally minimum number of prefixes in the network, use aggregation if possible, and suppress path hunting. We want a very neat, very controlled distribution of updates, what is called valley free. We want updates to be deployed exactly once as they pass through the network. If they originate at the bottom, they go up, reversing at most once. There should be no zigzags. Zigzags are very bad.

To do this, we use a fairly simple scheme to use the underlying mechanisms of BGP. That is, we use eBGP running on link local, and autonomous systems are assigned as follows: an autonomous system for ToR, an autonomous system for the entire block of spine-1-switches of one Pod, and a general autonomous system for the entire Top of Fabric. It is easy to see and make sure that even the normal behavior of BGP gives us the distribution of updates that we want.

How to scale data centers. Yandex report

Naturally, you have to design addressing and address aggregation in a way that is compatible with the way routing is built, so that this ensures the stability of the control plane. L3 addressing in transport is tied to the topology, because without this aggregation cannot be achieved, without this individual addresses will crawl into the routing system. And one more thing is that aggregation, unfortunately, does not mix well with multi-path, because when we have multi-path and there is aggregation, everything is fine when the whole network is operational, there are no failures in it. Unfortunately, as soon as failures appear in the network and the symmetry of the topology is lost, we can come to the point from which the aggregate was announced, from which it is impossible to go further to where we need to go. Therefore, it is best to aggregate where there is no further multi-path, in our case, these are ToR switches.

How to scale data centers. Yandex report

In fact, you can aggregate, but be careful. If we can do controlled disaggregation when there are network failures. But this is a rather difficult task, we even wondered if it would be possible to do this, if additional automation could be added, and state machines that would correctly kick BGP in order to get the desired behavior. Unfortunately, handling corner cases is very non-obvious and complicated, and by attaching external attachments to BGP, this task is not well solved.

Very interesting work in this regard has been done within the framework of the RIFT protocol, which will be discussed in the next report.

How to scale data centers. Yandex report

Another important thing is how the data plane scales in dense topologies, where we have a large number of alternative paths. This uses several additional data structures: ECMP groups, which in turn describe Next Hop groups.

In a normally working network, without failures, when we go up the Clos topology, it is enough to use only one group, because everything that is not local is described by default, you can go up. When we go from top to bottom to the south, then all paths are not ECMP, these are single path paths. Everything is fine. The trouble is, and the peculiarity of the classical Clos topology is that if we look at the Top of fabric, at any element, it has one path to any element below. If failures occur along this path, then this particular element at the top of the factory becomes invalid precisely for those prefixes that lie behind the broken path. And for the rest, it is valid, and we have to parse ECMP groups and introduce a new state.

What does data plane scalability look like on modern devices? If we do LPM (longest prefix match), everything is good enough, over 100k prefixes. If we are talking about Next Hop groups, then everything is worse, 2-4 thousand. If we are talking about a table that contains the description of Next Hops (or adjacencies), then this is somewhere from 16k to 64k. And that can be a problem. And here we come to an interesting digression: what happened to MPLS in data centers? Basically, we wanted to do it.

How to scale data centers. Yandex report

Two things happened. We did micro-segmentation on hosts, we no longer need to do it on the network. It was not very good with support from different vendors, and even more so with open implementations on white boxes with MPLS. And MPLS, at least its traditional implementations, unfortunately, does not fit well with ECMP. And that's why.

How to scale data centers. Yandex report

This is how the ECMP forwarding structure for IP looks like. A large number of prefixes can use the same group and the same Next Hops block (or adjacencies, this may be called differently in different documentation for different devices). The point is that this is described as an outgoing port and what to rewrite the MAC address to get to the correct Next Hop. For IP, everything looks simple, you can use a very large number of prefixes for the same group, the same Next Hops block.

How to scale data centers. Yandex report

The classical MPLS architecture implies that, depending on the outgoing interface, the label can be rewritten to different values. Therefore, we need to keep a group and a Next Hops block for each input label. And, alas, it does not scale.

It is easy to see that in our design we needed about 4000 ToR switches, the maximum width is 64 ECMP paths, if we move away from spine-1 towards spine-2. We hardly crawl, at the limit, into one table of ECMP groups, if only one prefix with ToR leaves, and we don’t crawl into the Next Hops table at all.

How to scale data centers. Yandex report

All is not hopeless, because architectures like Segment Routing imply global labels. Formally, it would be possible to collapse all these Next Hops blocks again. This requires a wild card type operation: take a label and rewrite it to the same one without a specific value. But unfortunately, this is not very present in the available implementations.

And finally, we need to bring external traffic to the data center. How to do it? Previously, traffic started in the Clos network from above. That is, there were border routers that connected to all devices on the Top of fabric. This solution works quite well on small and medium sizes. Unfortunately, in order to send traffic symmetrically to the entire network in this way, it is necessary to arrive simultaneously at all Top of fabric elements, and when there are more than a hundred of them, it turns out that we also need a large radius on the edge routers. In general, it costs money, because edge routers are more functional, the ports on them will be more expensive, and the result is not a very beautiful design.

Another option is to drive such traffic from below. It is easy to see that the Clos topology is built in such a way that the traffic coming from below, that is, from the ToR side, is evenly distributed over the levels for the entire Top of fabric in two iterations, loading the entire network. Therefore, we introduce a special type of Pod, Edge Pod, which provide external connectivity.

There is one more option. So, for example, does Facebook. They call it Fabric Aggregator or HGRID. An additional spine level is being introduced to connect multiple data centers. Such a design is possible if we do not have additional functions or encapsulation changes at the junctions. If they are, these are additional touch points, it's difficult. As a rule, there are more functions and a kind of membrane separating different parts of the data center. It’s not worth making such a large membrane, and if it’s really needed for some reason, then it makes sense to consider taking it away, making it as wide as possible and transferring it to hosts. This is done, for example, by many cloud operators. They have overlays, they start with hosts.

How to scale data centers. Yandex report

What development opportunities do we see? First of all, improving support for the CI / CD pipeline. We want to fly the way we test and test the way we fly. This does not work out very well, because the infrastructure is large, it is impossible to duplicate it for tests. You need to understand how to introduce test elements into a working infrastructure without dropping it.

Better instrumentation, better monitoring is almost never superfluous. It's all about the balance of effort and return. If you can add reasonable forces - very good.

Open operating systems for network devices. Better protocols and better routing systems like RIFT. Research is also needed on the application of better congestion control schemes and perhaps the introduction, at least at some points, of RDMA support within the cluster.

Looking further into the future, advanced topologies are needed, and possibly networks that use smaller overheads. From fresh things - recently there were publications about the technology of factories for HPC Cray Slingshot, which is based on commodity Ethernet, but with the option of using much shorter headers. As a result, overhead is reduced.

How to scale data centers. Yandex report

Everything should be made as simple as possible, but not simpler. Complexity is the enemy of scalability. Simplicity and regular structures are our friends. If you can scale out somewhere, do it. And in general, it's great to be engaged in network technologies now. A lot of interesting things are happening. Thank you.

Source: habr.com

Add a comment