Automation for the smallest. Part two. Network design

In the first two articles, I raised the issue of automation and outlined its framework, in the second I made a digression into network virtualization, as the first approach to automating service configuration.
And now it's time to draw a diagram of the physical network.

If you are not on a short footing with the organization of data center networks, then I strongly recommend starting with articles about them.

All releases:

The practices described in this series should be applicable to any type of network, any scale, and any variety of vendors (no). However, it is impossible to describe a universal example of the application of these approaches. Therefore, I will focus on the modern architecture of the DC network: Kloz factory.
We will do DCI on MPLS L3VPN.

On top of the physical network, there is an Overlay network from the host (this can be OpenStack's VXLAN or Tungsten Fabric, or anything else that requires only basic IP connectivity from the network).

Automation for the smallest. Part two. Network design

In this case, we get a relatively simple scenario for automation, because we have a lot of equipment configured in the same way.

We will choose a spherical DC in vacuum:

  • One design version everywhere.
  • Two vendors forming two network planes.
  • One DC is like another like two drops of water.

Content

  • Physical topology
  • Routing
  • IP plan
  • Laba
  • Conclusion
  • Useful links

Let our Service Provider LAN_DC, for example, host training videos about surviving stuck elevators.

In megacities, this is wildly popular, so you need a lot of physical machines.

First, I will describe the network approximately as I would like to see it. And then I'll simplify for the lab.

Physical topology

Locations

LAN_DC will have 6 DCs:

  • Russia (RU):
    • Moscow (tbs)
    • Kazan (kzn)

  • Spain (SP):
    • Barcelona (bcn)
    • Malaga (mlg)

  • China (CN):
    • Shanghai (sha)
    • Xi'an (is)

Automation for the smallest. Part two. Network design

Inside the DC (Intra-DC)

All DCs have identical internal connectivity networks based on the Klose topology.
What are Kloz networks and why exactly they are - in a separate article.

Each DC has 10 racks with cars, they will be numbered as A, B, C And so on.

There are 30 machines in each rack. They will not interest us.

Also, in each rack there is a switch to which all machines are connected - this is Top of the Rack switch - ToR or otherwise, in terms of the Klose factory, we will call it Leaf.

Automation for the smallest. Part two. Network design
General scheme of the factory.

We will name them XXX-leafYWhere XXX - three-letter abbreviation DC, and Y - serial number. For example, kzn-leaf11.

In my articles, I will allow myself to use the terms Leaf and ToR rather frivolously as synonyms. However, it must be remembered that this is not the case.
ToR is a rack-mounted switch that machines connect to.
Leaf is the role of a device in a physical network or a first-level switch in terms of the Klose topology.
That is, Leaf != ToR.
So a Leaf can be an EndofRaw switch, for example.
However, within the framework of this article, we will still treat them as synonyms.

Each ToR switch is in turn connected to four higher aggregation switches - spine. Under Spine's, one rack is allocated in the DC. We will name it like this: XXX-spineY.

In the same rack there will be network equipment for connectivity between DCs - 2 routers with MPLS on board. But by and large, these are the same Tors. That is, from the point of view of Spine switches, it does not matter the usual ToR with connected machines or a router for DCI - one damn thing.

Such special ToRs are called Edge-leaf. We will name them XXX-edgeY.

It will look like this.

Automation for the smallest. Part two. Network design

In the diagram above, I really placed edge and leaf on the same level. Classical three-layer networks taught us to consider uplink (in fact, hence the term) as links up. And here it turns out that the DCI β€œuplink” goes back down, which breaks the usual logic a little for some. In the case of large networks, when data centers are divided into even smaller units - POD's (Point Of Delivery), allocate individual Edge-POD's for DCI and access to external networks.

For ease of perception in the future, I will still draw Edge over Spine, while we will keep in mind that there is no intelligence on Spine and there are no differences when working with ordinary Leaf and with Edge-leaf (although there may be nuances, but in general This is true).

Automation for the smallest. Part two. Network design
Factory diagram with Edge-leafs.

The trinity Leaf, Spine and Edge form an Underlay network or factory.

The task of the network factory (read Underlay), as we have already defined in last issue, very, very simple - to provide IP connectivity between machines both within the same DC and between them.
That is why the network is called a factory, just like, for example, a switching factory inside modular network boxes, which you can read more about in SDSM14.

In general, such a topology is called a factory, because fabric in translation is a fabric. And it's hard to disagree:
Automation for the smallest. Part two. Network design

The factory is completely L3. No VLANs, no Broadcast - these are the wonderful programmers we have in LAN_DC, they can write applications that live in the L3 paradigm, and virtual machines do not require Live Migration with saving the IP address.

And once again: the answer to the question why the factory and why L3 is in a separate article.

DCI - Data Center Interconnect (Inter-DC)

DCI will be organized with the help of Edge-Leaf, that is, they are our exit point to the highway.
For simplicity, we assume that DCs are interconnected by direct links.
Let us exclude from consideration the external connection.

I am aware that every time I remove a component, I greatly simplify the network. And when automating our abstract network, everything will be fine, but crutches will appear on the real one.
This is true. And yet the purpose of this series is to think and work on approaches, and not to heroically solve fictitious problems.

On Edge-Leafs, the underlay is placed in the VPN and transmitted through the MPLS backbone (the same direct link).

Here is such a top-level scheme obtained.

Automation for the smallest. Part two. Network design

Routing

For routing within the DC, we will use BGP.
On the OSPF+LDP MPLS backbone.
For DCI, that is, the organization of connectivity in the underlay - BGP L3VPN over MPLS.

Automation for the smallest. Part two. Network design
General routing scheme

There are no OSPF and ISIS at the factory (routing protocol banned in the Russian Federation).

And this means that there will be no Auto-discovery and calculation of the shortest paths - only manual (actually automatic - we're talking about automation here) protocol, neighborhood and policy settings.

Automation for the smallest. Part two. Network design
BGP routing scheme inside DC

Why BGP?

There are whole RFC named Facebook and Arista, which tells how to build very large data center networks using BGP. It reads almost like a fiction, I highly recommend it for a languid evening.

There is also a whole section in my article devoted to this. Where can I take you and send.

But still, in short, no IGP is suitable for networks of large data centers, where the number of network devices goes into the thousands.

In addition, the use of BGP everywhere will allow you not to spray on support for several different protocols and synchronization between them.

Hand on heart, in our factory, which most likely will not grow rapidly, OSPF would be enough for our eyes. This is actually the problem of mega scalers and cloud titans. But let's fantasize for just a few issues that we need it, and we will use BGP, as Peter Lapukhov bequeathed.

Routing Policies

On Leaf switches, we import prefixes from Underlay network interfaces into BGP.
We will have a BGP session between each a pair of Leaf-Spine, in which these Underlay prefixes will be announced over the network here and there.

Automation for the smallest. Part two. Network design

Inside one data center, we will distribute the specifics that we imported to Tor. On Edge-Leafs, we will aggregate them and announce them to remote DCs and send them down to Tors. That is, each Tor will know exactly how to get to another Tor in the same DC and where is the entry point to get to the Tor in another DC.

In DCI, routes will be transmitted as VPNv4. To do this, on the Edge-Leaf, the interface towards the factory will be placed in the VRF, let's call it UNDERLAY, and the neighborhood with the Spine on the Edge-Leaf will rise inside the VRF, and between the Edge-Leafs in the VPNv4-family.

Automation for the smallest. Part two. Network design

We will also prohibit re-announcing routes received from spines back to them.

Automation for the smallest. Part two. Network design

On Leaf and Spine we won't import Loopbacks. We will only need them to determine the Router ID.

But on Edge-Leafs we import it into Global BGP. Between Loopback addresses, Edge-Leafs will establish a BGP session in an IPv4 VPN-family with each other.

Between EDGE devices, we will have a trunk stretched to OSPF + LDP. All in one zone. Extremely simple configuration.

Here is a picture with routing.

BGP ASN

Edge Leaf ASN

On Edge-Leafs there will be one ASN in all DCs. It is important that there is iBGP between the Edge-Leafs, and we do not run into the nuances of eBGP. Let it be 65535. In reality, it could be a public AS number.

Spine ASN

On Spine, we will have one ASN per DC. Let's start here from the very first number from the range of private AS - 64512, 64513 And so on.

Why ASN on DC?

Let's decompose this question into two:

  • Why the same ASN on all spines of one DC?
  • Why are they different in different DCs?

Why the same ASN on all spines of one DC

This is how the AS-Path of the Underlay Route on the Edge-Leaf will look like:
[leafX_ASN, spine_ASN, edge_ASN]
When trying to announce it back to Spine, it will be rejected because its AS (Spine_AS) is already on the list.

However, within the DC, we are perfectly satisfied that the Underlay routes that have risen to the Edge will not be able to go down. All communication between hosts within the DC must occur within the level of spines.

Automation for the smallest. Part two. Network design

At the same time, the aggregated routes of other DCs will in any case freely reach the Tors - their AS-Path will contain only ASN 65535 - the number of AS Edge-Leafs, because it was on them that they were created.

Why are different in different DCs

Theoretically, we may need to drag Loopback and some service virtual machines between DCs.

For example, on the host we will run Route Reflector or the same VNGW (Virtual Network Gateway), which will lock itself with Tor via BGP and announce its loopback, which should be available from all DCs.

So this is what its AS-Path will look like:
[VNF_ASN, leafX_DC1_ASN, spine_DC1_ASN, edge_ASN, spine_DC2_ASN, leafY_DC2_ASN]

And there should not be repeated ASNs anywhere.

Automation for the smallest. Part two. Network design

That is, Spine_DC1 and Spine_DC2 must be different, just like leafX_DC1 and leafY_DC2, which is exactly what we are approaching.

As you probably know, there are hacks that allow you to accept routes with duplicate ASNs despite the loop prevention mechanism (allowas-in on Cisco). And it even has legitimate uses. But this is a potential hole in the stability of the network. And I personally fell into it a couple of times.

And if we have the opportunity not to use dangerous things, we will use it.

Leaf ASN

We will have an individual ASN on each Leaf switch within the entire network.
We do this for the reasons given above: AS-Path without loops, BGP configuration without bookmarks.

For routes between Leafs to pass smoothly, AS-Path should look like this:
[leafX_ASN, spine_ASN, leafY_ASN]
where leafX_ASN and leafY_ASN would be nice to be different.

This is also required for the situation with the announcement of the VNF loopback between DCs:
[VNF_ASN, leafX_DC1_ASN, spine_DC1_ASN, edge_ASN, spine_DC2_ASN, leafY_DC2_ASN]

We will use a 4-byte ASN and generate it based on the ASN of the Spine and the number of the Leaf switch, namely, like this: Spine_ASN.0000X.

Here is such a picture with ASN.
Automation for the smallest. Part two. Network design

IP plan

Basically, we need to allocate addresses for the following connections:

  1. Underlay network addresses between ToR and machine. They must be unique within the entire network so that any machine can communicate with any other. Great fit 10/8. For each rack / 26 with a margin. We will allocate /19 per DC and /17 per region.
  2. Link addresses between Leaf/Tor and Spine.

    I would like to assign them algorithmically, that is, calculate from the names of the devices that need to be connected.

    Let it be... 169.254.0.0/16.
    Namely 169.254.00X.Y/31Where X - Spine number, Y β€” P2P network /31.
    This will allow you to run up to 128 racks, and up to 10 Spine in the DC. Link addresses can (and will) be repeated from DC to DC.

  3. Joint Spine - Edge-Leaf organize on subnets 169.254.10X.Y/31where exactly the same X - Spine number, Y β€” P2P network /31.
  4. Link addresses from Edge-Leaf to MPLS backbone. Here the situation is somewhat different - the junction of all the pieces into one pie, so you won't be able to reuse the same addresses - you need to choose the next free subnet. Therefore, we take as a basis 192.168.0.0/16 and we will rake free out of it.
  5. Loopback addresses. Let's give them the whole range 172.16.0.0/12.
    • Leaf - /25 each at the DC - the same 128 racks. Allocate /23 per region.
    • Spine - by /28 on the DC - up to 16 Spine. Allocate /26 per region.
    • Edge-Leaf - /29 at DC - up to 8 boxes. Allocate /27 per region.

If we don’t have enough allocated ranges in the DC (and we won’t have them - we pretend to be hyperscaler), we simply select the next block.

Here is a picture with IP addressing.

Automation for the smallest. Part two. Network design

Loopbacks:

Prefix
Role of the device
Region
DC

172.16.0.0/23
edge
 
 

172.16.0.0/27
ru
 

172.16.0.0/29
tbs

172.16.0.8/29
kzn

172.16.0.32/27
sp
 

172.16.0.32/29
bcn

172.16.0.40/29
mlg

172.16.0.64/27
cn
 

172.16.0.64/29
sha

172.16.0.72/29
is

172.16.2.0/23
Spine
 
 

172.16.2.0/26
ru
 

172.16.2.0/28
tbs

172.16.2.16/28
kzn

172.16.2.64/26
sp
 

172.16.2.64/28
bcn

172.16.2.80/28
mlg

172.16.2.128/26
cn
 

172.16.2.128/28
sha

172.16.2.144/28
is

172.16.8.0/21
leaf
 
 

172.16.8.0/23
ru
 

172.16.8.0/25
tbs

172.16.8.128/25
kzn

172.16.10.0/23
sp
 

172.16.10.0/25
bcn

172.16.10.128/25
mlg

172.16.12.0/23
cn
 

172.16.12.0/25
sha

172.16.12.128/25
is

underlay:

Prefix
Region
DC

10.0.0.0/17
ru
 

10.0.0.0/19
tbs

10.0.32.0/19
kzn

10.0.128.0/17
sp
 

10.0.128.0/19
bcn

10.0.160.0/19
mlg

10.1.0.0/17
cn
 

10.1.0.0/19
sha

10.1.32.0/19
is

Laba

Two vendors. One network. ADSM.

Juniper + Arista. Ubuntu. Good old Eve.

The amount of resources on our virtual machine in Miran is still limited, so for practice we will use such a simplified network to the limit.

Automation for the smallest. Part two. Network design

Two data centers: Kazan and Barcelona.

  • Two spins each: Juniper and Arista.
  • One torus (Leaf) in each - Juniper and Arista, with one connected host (let's take a lightweight Cisco IOL for this).
  • One Edge-Leaf node each (only Juniper so far).
  • One Cisco switch to rule them all.
  • In addition to network boxes, a virtual machine-manager has been launched. Under Ubuntu.
    It has access to all devices, it will run IPAM / DCIM systems, a bunch of Python scripts, ansible and anything else that we might need.

Full configuration all network devices, which we will try to reproduce using automation.

Conclusion

Is it also accepted? Under each article to make a short conclusion?

So we have chosen three-level Kloz network inside the DC, because we expect a lot of East-West traffic and want ECMP.

We divided the network into physical (underlay) and virtual (overlay). At the same time, the overlay starts from the host - thereby simplifying the requirements for the underlay.

Chose BGP as the router routing protocol for its scalability and policy flexibility.

We will have separate nodes for organizing DCI - Edge-leaf.
There will be OSPF+LDP on the backbone.
DCI will be implemented based on MPLS L3VPN.
For P2P links, we will calculate IP addresses algorithmically based on device names.
Loopbacks will be assigned according to the role of devices and their location sequentially.
Underlay prefixes - only on Leaf switches in series based on their location.

Let's assume that we don't have any equipment installed right now.
Therefore, our next steps will be to get them into systems (IPAM, inventory), organize access, generate a configuration and deploy it.

In the next article, we will deal with Netbox, an inventory and management system for IP space in a DC.

Thanks

  • Andrey Glazkov aka @glazgoo for proofreading and editing
  • Alexandru Klimenko aka @v00lk for proofreading and editing
  • Artyom Chernobay for KDPV

Source: habr.com

Add a comment