How AWS "cooks" its elastic services. Network scaling

The scale of the Amazon Web Services network is 69 zones around the world in 22 regions: the United States, Europe, Asia, Africa, and Australia. In each zone there are up to 8 DPCs - Data Processing Centers. Each data center has thousands or hundreds of thousands of servers. The network is built so that all low-probability outage scenarios are taken into account. For example, all regions are isolated from each other, and access zones are separated by distances of several kilometers. Even if the cable is cut, the system will switch to backup channels, and the loss of information will amount to units of data packets. Vasily Pantyukhin will tell about what other principles the network is built on and how it works.

How AWS "cooks" its elastic services. Network scaling

Vasily Pantyukhin started as a Unix administrator in .ru companies, for 6 years he worked on large pieces of hardware at Sun Microsystem, for 11 years he preached the data-centricity of the world at EMC. Naturally, it evolved into private clouds, then moved into public ones. Now, as an Amazon Web Services architect, he helps to live and develop in the AWS cloud with technical advice.

In the previous part of the trilogy about the AWS design, Vasily delved into the design of physical servers and database scaling. Nitro cards, custom KVM-based hypervisor, Amazon Aurora database - all this in the material "How AWS "cooks" its elastic services. Server and database scaling". Read to immerse yourself in context, or watch video recording performances.

In this part, we will talk about network scaling, one of the most complex systems in AWS. The evolution from a flat network to Virtual Private Cloud and its design, Blackfoot and HyperPlane internal services, the problem of a noisy neighbor, and at the end - network scale, backbone and physical cables. All this under the cut.

Disclaimer: Everything below is Vasily's personal opinion and may not reflect the position of Amazon Web Services.

Network scaling

The AWS cloud was launched in 2006. His network was quite primitive - with a flat structure. The private address range was common to all cloud tenants. When starting a new virtual machine, you accidentally received an available IP address from this range.

How AWS "cooks" its elastic services. Network scaling

This approach was easy to implement, but fundamentally limited the use of the cloud. In particular, it was quite difficult to develop hybrid solutions that combined private networks on the ground and in AWS. The most common problem was in the intersection of IP address ranges.

How AWS "cooks" its elastic services. Network scaling

Virtual Private Cloud

The cloud is in demand. It's time to think about scalability and the possibility of its use by tens of millions of tenants. The flat grid has become a major hurdle. Therefore, we thought about how to isolate users from each other at the network level so that they can independently choose IP ranges.

How AWS "cooks" its elastic services. Network scaling

What's the first thing that comes to mind when you think about network isolation? Certainly VLAN и VRF - Virtual Routing and Forwarding.

Unfortunately it didn't work. VLAN ID is only 12 bits, which gives us only 4096 isolated segments. Even in the largest switches, a maximum of 1-2 thousand VRFs can be used. Sharing VRF and VLAN gives us only a few million subnets. This is definitely not enough for tens of millions of tenants, each of which should be able to use several subnets.

We also simply cannot afford to buy the required number of large boxes, for example, from Cisco or Juniper. There are two reasons: it's insanely expensive, and we don't want to be dependent on their development and patching policies.

There is only one conclusion - to cook your own solution.

In 2009 we announced Mail orderVirtual Private Cloud. The name stuck and now many cloud providers use it too.

VPC is a virtual network SDN (Software Defined Network). We decided not to invent special protocols at the L2 and L3 levels. The network runs on standard Ethernet and IP. For transmission over the network, the traffic of virtual machines is encapsulated in a wrapper of our own protocol. It specifies the ID that belongs to the VPC of the tenant.

How AWS "cooks" its elastic services. Network scaling

Sounds simple. However, several serious technical problems need to be solved. For example, where and how to store mapping data for virtual MAC/IP addresses, VPC IDs, and corresponding physical MAC/IPs. On the scale of AWS, this is a huge table that should work with minimal delays when it is accessed. Responsible for this mapping service, which is smeared in a thin layer throughout the network.

In new generation machines, encapsulation is performed by Nitro cards at the iron level. In older instances, encapsulation and decapsulation are software-based. 

How AWS "cooks" its elastic services. Network scaling

Let's see how it works in general terms. Let's start with level L2. Suppose we have a virtual machine with IP 10.0.0.2 on a physical server 192.168.0.3. It sends data to the virtual machine 10.0.0.3, which lives at 192.168.1.4. An ARP request is generated and sent to the Nitro network card. For simplicity, we assume that both virtual machines live in the same "blue" VPC.

How AWS "cooks" its elastic services. Network scaling

The map replaces the source address with its own and forwards the ARP frame to the mapping service.

How AWS "cooks" its elastic services. Network scaling

The mapping service returns information that is necessary for transmission over the physical L2 network.

How AWS "cooks" its elastic services. Network scaling

The nitro card in the ARP response replaces the MAC on the physical network with the address on the VPC.

How AWS "cooks" its elastic services. Network scaling

When transferring data, we wrap the logical MAC and IP in a VPC wrapper. All this is transmitted over the physical network using the appropriate source and destination IP Nitro cards.

How AWS "cooks" its elastic services. Network scaling

The physical machine to which the package is intended performs the validation. This is necessary to prevent the possibility of address spoofing. The machine sends a special request to the mapping service and asks: “From the physical machine 192.168.0.3, I received a packet that is destined for 10.0.0.3 in the blue VPC. Is he legitimate? 

How AWS "cooks" its elastic services. Network scaling

The mapping service checks its resource allocation table and either allows or denies the packet. In all new instances, additional validation is embedded in Nitro cards. It cannot be bypassed even theoretically. Therefore, spoofing resources in another VPC will not work.

How AWS "cooks" its elastic services. Network scaling

Further, the data is sent to the virtual machine for which it is intended. 

How AWS "cooks" its elastic services. Network scaling

The mapping service also works as a logical router for transferring data between virtual machines on different subnets. Everything is conceptually simple there, I will not analyze it in detail.

How AWS "cooks" its elastic services. Network scaling

It turns out that when transmitting each packet, the servers turn to the mapping service. How to deal with the inevitable delays? caching, of course.

The beauty is that you don't have to cache the entire huge table. Virtual machines from a relatively small number of VPCs live on a physical server. You only need to cache information about these VPCs. Transferring data to other VPCs in the "default" configuration is still not legitimate. If such functionality as VPC-peering is used, then information about the corresponding VPCs is additionally loaded into the cache. 

How AWS "cooks" its elastic services. Network scaling

We figured out the transfer of data to the VPC.

blackfoot

What to do in cases when traffic needs to be transmitted outside, for example, to the Internet or through a VPN to the ground? Here we are rescued blackfoot is an internal AWS service. It is designed by our South African team. Therefore, the service is named after a penguin that lives in South Africa.

How AWS "cooks" its elastic services. Network scaling

Blackfoot decapsulates the traffic and does what it needs to do with it. Data on the Internet is sent as is.

How AWS "cooks" its elastic services. Network scaling

The data is decapsulated and re-wrapped with IPsec when using a VPN.

How AWS "cooks" its elastic services. Network scaling

When using Direct Connect, traffic is tagged and forwarded to the appropriate VLAN.

How AWS "cooks" its elastic services. Network scaling

HyperPlane

This is an internal flow control service. Many network services require control data flow states. For example, when using NAT, flow control must ensure that each "IP: Destination Port" pair has a unique outgoing port. In the case of a balancer NLBNetwork Load Balancer, the data stream should always be directed to the same target virtual machine. Security Groups is a stateful firewall. It monitors incoming traffic and implicitly opens ports for outgoing packets.

How AWS "cooks" its elastic services. Network scaling

In the AWS Cloud, transmission latency requirements are extremely high. That's why HyperPlane critical to the health of the entire network.

How AWS "cooks" its elastic services. Network scaling

Hyperplane is built on EC2 virtual machines. There is no magic here, only cunning. The trick is that these are virtual machines with a lot of RAM. Operations are transactional and are performed exclusively in memory. This allows you to achieve delays of only tens of microseconds. Disk work would kill all performance. 

Hyperplane is a distributed system of a huge number of such EC2 machines. Each virtual machine has a bandwidth of 5 GB / s. On the scale of the entire regional network, this gives crazy terabits of bandwidth and allows you to process millions of connections per second.

HyperPlane only works with threads. VPC packet encapsulation is completely transparent to him. A potential vulnerability in this internal service will still prevent VPC isolation from being breached. The levels below are responsible for security.

noisy neighbor

There's still a problem noisy neighbornoisy neighbor. Let's say we have 8 nodes. These nodes process the flows of all cloud users. It seems that everything is fine and the load should be evenly distributed across all nodes. The nodes are very powerful and it is difficult to overload them.

But we build our architecture based on even unlikely scenarios. 

Low probability does not mean impossible.

We can imagine a situation in which one or more users will generate too much load. All HyperPlane nodes are involved in handling this load and other users can potentially experience some performance degradation. This breaks the concept of the cloud, in which tenants have no way to influence each other.

How AWS "cooks" its elastic services. Network scaling

How to solve the problem of a noisy neighbor? The first thing that comes to mind is sharding. Our 8 nodes are logically divided into 4 shards of 2 nodes each. Now a noisy neighbor will interfere with only a quarter of all users, but strongly.

How AWS "cooks" its elastic services. Network scaling

Let's do it differently. We will allocate only 3 nodes for each user. 

How AWS "cooks" its elastic services. Network scaling

The trick is to assign nodes to different users randomly. In the picture below, the blue user is node-crossing with one of the other two users, green and orange.

How AWS "cooks" its elastic services. Network scaling

With 8 nodes and 3 users, the probability of crossing a noisy neighbor with one of the users is 54%. It is with this probability that the blue user will influence other tenants. In this case, only a part of its load. In our example, this influence will be at least somehow noticeable not to everyone, but only to a third of all users. This is already a good result.

Number of users that will overlap

Percentage Probability

0

18%

1

54%

2

26%

3

2%

Let's bring the situation closer to the real one - let's take 100 nodes and 5 users on 5 nodes. In this case, none of the nodes will intersect with a probability of 77%. 

Number of users that will overlap

Percentage Probability

0

77%

1

21%

2

1,8%

3

0,06%

4

0,0006%

5

0,00000013%

In a real situation, with a huge number of HyperPlane nodes and users, the potential impact of a noisy neighbor on other users is minimal. This method is called mixing shardingshuffle sharding. It minimizes the negative effect of node failure.

Many services are built on HyperPlane: Network Load Balancer, NAT Gateway, Amazon EFS, AWS PrivateLink, AWS Transit Gateway.

Network scale

Now let's talk about the scale of the network itself. For October, 2019 AWS offers the services in 22 regionsand 9 more are planned.

  • Each region contains several availability zones - Availability Zone. There are 69 of them all over the world.
  • Each AZ consists of Data Processing Centers. There are no more than 8 in total.
  • The data center houses a huge number of servers, some up to 300.

Now we average all this, multiply and get an impressive figure that displays Amazon cloud scale.

There are many optical channels between the access zones and the data center. In one of our largest regions, 388 channels have been laid for AZ communication between themselves and communication centers with other regions (Transit Centers). In sum, this gives rabid 5000 Tbps.

How AWS "cooks" its elastic services. Network scaling

Backbone AWS is built specifically for the cloud and optimized for it. We build it on channels 100 GB / s. We have full control over them, with the exception of regions in China. The traffic is not shared with the loads of other companies.

How AWS "cooks" its elastic services. Network scaling

Of course, we are not the only cloud provider with a private backbone network. More and more large companies are following this path. This is confirmed by independent researchers, for example, from Telegeography.

How AWS "cooks" its elastic services. Network scaling

The graph shows that the share of content providers and cloud providers is growing. Because of this, the share of Internet traffic of backbone providers is constantly decreasing.

I will explain why this happens. In the past, most web services were accessed and consumed directly from the Internet. Now more and more servers are located in the cloud and are available through CDNContent Distribution Network. To access the resource, the user goes through the Internet only to the nearest PoP CDN - Point of Presence. Most of the time it's somewhere nearby. Then it leaves the public Internet and flies across the Atlantic via a private backbone, for example, and gets directly to the resource.

I wonder how the Internet will change in 10 years if this trend continues?

Physical channels

Scientists have not yet figured out how to increase the speed of light in the Universe, but they have made great progress in the methods of transmitting it through optical fiber. We currently use 6912 fiber cables. This helps to significantly optimize the cost of their laying.

In some regions, we have to use special cables. For example, in the Sydney region, we use cables with a special anti-termite coating. 

How AWS "cooks" its elastic services. Network scaling

No one is immune from troubles and sometimes our channels are damaged. In the photo on the right, optical cables in one of the American regions that were torn by builders. As a result of the accident, only 13 data packets were lost, which is amazing. Once again - only 13! The system literally instantly switched to backup channels - the scale works.

We galloped through some of the Amazon cloud services and technologies. I hope that you have at least some idea of ​​the scale of the tasks that our engineers have to solve. Personally, this intrigues me a lot. 

This is the final part of the trilogy from Vasily Pantyukhin about the AWS device. IN first parts describe server optimization and database scaling, and in second - serverless functions and Firecracker.

On the HighLoad++ in November, Vasily Pantyukhin will share new details of the Amazon device. He will tell on the causes of failures and the design of distributed systems at Amazon. October 24 is still possible to book ticket at a good price, and pay later. We are waiting for you at HighLoad++, come and chat!

Source: habr.com

Add a comment