History of one switch

History of one switch
We had six pairs of Arista DCS-7050CX3-32S switches and one pair of Brocade VDX 6940-36Q switches in our LAN aggregation. It’s not that Brocade switches in this network bothered us a lot, they work and perform their functions, but we were preparing full automation of some actions, but we didn’t have these features on these switches. And I also wanted to switch from 40GE interfaces to the ability to use 100GE in order to make a reserve for the next 2-3 years. So we decided to change Brocade to Arista.

These switches are LAN aggregation switches for each data center. Distribution switches (the second level of aggregation) are connected directly to them, which already assemble Top-of-Rack local network switches in racks with servers.

History of one switch
Each server is included in one or two access switches. The access switches are connected to a pair of distribution switches (two distribution switches and two physical links from the access switch to different distribution switches are used for redundancy).

Each server can be used by a different client, so the client is allocated a separate VLAN. The same VLAN is then assigned to another server of this client in any rack. The data center consists of several such rows (PODs), each row of racks has its own distribution switches. Then these distribution switches are connected to aggregation switches.

History of one switch
Clients can order a server in any row; it is impossible to predict in advance that the server will be allocated or installed in a specific row in a specific rack, which is why there are about 2500 VLANs on aggregation switches in each data center.

Equipment for DCI (Data-Center Interconnect) is connected to aggregation switches. It can be intended for L2 connectivity (a pair of switches forming a VXLAN tunnel to another data center) or for L3 connectivity (two MPLS routers).

History of one switch
As I already wrote, in order to unify the processes of automating the configuration of services on equipment in one data center, it was necessary to replace the central aggregation switches. We installed new switches next to the existing ones, combined them into an MLAG pair and began to prepare for work. They were immediately connected to existing aggregation switches, so that they had a common L2 domain across all client VLANs.

Circuit details

For specifics, let's name the old aggregation switches A1 ΠΈ A2, new - N1 ΠΈ N2. Let's imagine that in POD 1 ΠΈ POD 4 hosted servers of one client Π‘1,The client VLAN is indicated in blue. This client is using L2 connectivity service with another data center, so its VLAN is fed to a pair of VXLAN switches.

Customer Π‘2 hosts servers in POD 2 ΠΈ POD 3,The client VLAN is denoted in dark green. This client also uses the connectivity service with another data center, but L3, so its VLAN is served by a pair of L3VPN routers.

History of one switch
We need client VLANs to understand at what stages of the replacement work what happens, where a communication break occurs, and what its duration can be. The STP protocol is not used in this scheme, since the width of the tree for it in this case turns out to be large, and the convergence of the protocol grows exponentially from the number of devices and links between them.

All devices connected by double links form a stack, MLAG-pair or VCS-Ethernet-factory. For a pair of L3VPN routers, such technologies are not used, since there is no need for L2 redundancy, it is enough that they have L2 connectivity with each other through aggregation switches.

Implementation Options

When analyzing the options for further events, we realized that there are several ways to carry out these works. From a global break on the entire local network, to small literally 1-2 second breaks in parts of the network.

Network, stop! Switches, change!

The easiest way is, of course, to declare a global downtime for all PODs and all DCI services and switch all links from the switches А to switches N.

History of one switch
Aside from the break, which we can't predict with any certainty (yes, we know the number of links, but we don't know how many times something will go wrong - from a broken patch cord or a damaged connector to a port or transceiver failure), we still can't predict in advance whether the length of the patch cords, DAC, AOC, connected to the old switches A is enough to reach them to, albeit standing next to, but still a little aside, the new switches N, and whether the same transceivers will work /DAC/AOC from Brocade switches to Arista switches.

And all this under conditions of severe pressure from customers and technical support (β€œNatasha, get up! Natasha, everything doesn’t work there! Natasha, we’ve already written to technical support, honestly! Natasha, they’ve already dropped everything! Natasha, how many more have we not will it work? Natasha, when will it work?!"). Even despite the pre-announced break and notification to clients, an influx of requests at such a time is guaranteed.

Stop, 1-2-3-4!

And if not to announce a global break, but to announce a series of small communication breaks for POD and DCI services. At the first break, switch to switches N only POD 1, in the second - in a couple of days - POD 2, then a couple of days later POD 3Further POD4…[N], then VXLAN switches and then L3VPN routers.

History of one switch
With this organization of switching work, we reduce the complexity of one-time work and increase our time to solve problems if something suddenly goes wrong. POD 1 connectivity is not lost after switching to other PODs and DCIs. But the work itself drags on for a long time; during this work in the data center, an engineer is required to physically perform the switching, and during the work (and such work is carried out, as a rule, at night, from 2 to 5 am), the presence of an online network engineer is required at a fairly high level qualifications. But then we get short communication interruptions; as a rule, work can be carried out in an interval of half an hour with a break of up to 2 minutes (in practice, often 20-30 seconds with the expected behavior of the equipment).

In the example client Π‘1 or client Π‘2 you will have to warn about work with a communication interruption at least three times - the first time for work on one POD, in which one of its servers is located, the second time - on the second, and the third time - when switching equipment for DCI services.

Switching of aggregated communication channels

Why are we talking about the expected behavior of the equipment, and how aggregated channels can be switched with minimization of communication interruption. Let's imagine the following picture:

History of one switch
On one side of the link there are POD distribution switches - D1 и D2, they form an MLAG pair with each other (stack, VCS factory, vPC pair), on the other hand, two links - Link 1 и Link 2 - included in the MLAG pair of old aggregation switches А. On the switch side D an aggregated interface with the name Port channel A, on the side of aggregation switches А - aggregated interface with the name Port channel D.

Aggregated interfaces use LACP in their work, that is, switches on both sides regularly exchange LACPDU packets on both links to make sure that the links:

  • workers;
  • included in one pair of devices on the remote side.

When exchanging packets, a value is passed in the packet system-id, denoting the device where these links are included. For an MLAG pair (stack, factory, etc.), the system-id value for the devices that form the aggregated interface is the same. Switch D1 sends to Link 1 value system-id D, and the commutator D2 sends to Link 2 value system-id D.

Switches A1 ΠΈ A2 analyze LACPDU packets received over one Po D interface and check if the system-id in them matches. If the system-id received via some link suddenly differs from the current operating value, then this link is removed from the aggregated interface until the situation is corrected. Now we have on the side of the switches D current value of system-id from LACP partner β€” A, and on the side of the switches А β€” current system-id value from LACP partner β€” D.

If we need to switch the aggregated interface, we can do it in two different ways:

Method 1 - Simple
Disable both links from switches A. In this case, the aggregated channel does not work.

History of one switch
Connect both links one by one to the switches N, then the LACP operating parameters will be negotiated again and the interface will be formed PoD on switches N and transmission of values ​​on links system-id N.

History of one switch

Method 2 - Minimizing the Break
Disable Link 2 from switch A2. However, traffic between А и D will continue to be transmitted simply over one of the links, which will remain part of the aggregated interface.

History of one switch
Connect Link 2 to switch N2. On the switch N already configured aggregated interface Po DN, and the commutator N2 will start transmitting in LACPDU system-id N. At this stage, we can already check that the switch N2 works correctly with the transceiver used for Link 2that the connection port has switched to the state Up, and that no errors occur on the connection port when transmitting the LACPDU.

History of one switch
But the fact that the switch D2 for aggregated interface Po A by Link 2 receives a system-id N value different from the current operating system-id A value, does not allow switches D introduce Link 2 part of the aggregated interface Po A. Switch N can't enter Link 2 to work, because it does not receive a confirmation of its health from the LACP partner of the switch D2. traffic in the end Link 2 not getting through.

And now we turn off Link 1 from switch A1, thereby depriving the switches А и D running aggregated interface. So on the switch side D the current working value of the system-id for the interface disappears Po A.

History of one switch
This allows the switches D ΠΈ N agree on an exchange of system-id AT on interfaces Po A ΠΈ Po DN, so that traffic starts to be transmitted over the link Link 2. The break in this case is, in practice, up to 2 seconds.

History of one switch
And now we can easily switch Link 1 to switch N1, restoring capacity and interface redundancy Po A ΠΈ Po DN. Since connecting this link does not change the current value of the system-id on either side, there is no break.

History of one switch

Additional links

But the switchover can be done without the presence of an engineer at the time of the switchover. To do this, we need to lay additional links between distribution switches in advance D and new aggregation switches N.

History of one switch
We are laying new links between aggregation switches N and distribution switches for all PODs. This requires ordering and running additional patch cords, and installing additional transceivers as in NAnd in D. We can do this because in our switches D each POD has free ports (or we pre-free them). As a result, each POD is physically connected by two links to the old switches A and to the new switches N.

History of one switch
On the switch D two aggregated interfaces are formed - Po A with links Link 1 ΠΈ Link 2and Po N - with links Link N1 ΠΈ Link N2. At this stage, we check the correct connection of interfaces and links, the levels of optical signals at both ends of the links (using DDM information from the switches), we can even check the link's performance under load or monitor the status of optical signals and transceiver temperatures for a couple of days.

Traffic is still passing through the interface Po A, and the interface Po N stands without traffic. Interface settings are something like this:

Interface Port-channel A
Switchport mode trunk
Switchport allowed vlan C1, C2

Interface Port-channel N
Switchport mode trunk
Switchport allowed vlan none

Switches D, as a rule, support session reconfiguration, such switch models are used that have this functionality. So we can change the settings of the Po A and Po N interfaces in one go:

Configure session
Interface Port-channel A
Switchport allowed vlan none
Interface Port-channel N
Switchport allowed vlan C1, C2
Commit

Then the configuration change will happen quickly enough, and the break will, in practice, be no more than 5 seconds.

This method allows us to carry out all the preparatory work in advance, carry out all the necessary checks, coordinate the work with the participants in the process, predict in detail the actions for the production of work, without flights of creativity when β€œeverything went wrong”, and have at hand a plan to return to the previous configuration. Work on this plan is carried out by a network engineer without the presence of a data center engineer on site, who physically performs the switching.

What is more important with this method of switching is that all new links are already monitored in advance. Errors, inclusion of links in the aggregate, loading of links - all the necessary information is already in the monitoring system, and this is already drawn on the maps.

D-Day

POD

We chose the least painful switching path for clients and the least prone to β€œsomething went wrong” scenarios with additional links. So in a couple of nights we switched all PODs to new aggregation switches.

History of one switch
But it remains to switch the equipment that provides DCI services.

L2

In the case of equipment providing L2 connectivity, we were unable to carry out similar work with additional links. There are at least two reasons for this:

  • Lack of free ports of the required speed on VXLAN switches.
  • Lack of session configuration change functionality on VXLAN switches.

We did not switch links β€œone at a time” with a break only for the time of negotiating a new system-id pair, since we did not have 100% confidence that the procedure would pass correctly, and a test in the laboratory showed that in the case if β€œsomething goes wrong”, we still get a communication break, and the worst thing is not only for clients with L2 connectivity to other data centers, but in general for all clients of this data center.

We carried out campaign work on the transition from L2 channels ahead of time, so the number of customers affected by work on VXLAN switches was already several times less than a year ago. As a result, we decided to break the connection on the L2-connectivity service, provided that we keep the normal operation of the local network services in one data center. In addition, the SLA for this service provides for the possibility of carrying out scheduled work with a break.

L3

Why did we recommend that everyone switch to L3VPN when organizing DCI services? One of the reasons is the ability to carry out work on one of the routers that provide this service, simply reducing the redundancy level to N+0, without interrupting communication.

Let's take a closer look at the service delivery scheme. In this service, the L2 segment goes from client servers only to L3VPN Selectel routers. The client network is terminated on routers.

Each client server, e.g. S2 ΠΈ S3 in the above diagram, have their own private IP addresses - 10.0.0.2/24 at S2 server ΠΈ 10.0.0.3/24 at S3 server. Addresses 10.0.0.252/24 ΠΈ 10.0.0.253/24 assigned by Selectel to routers L3VPN-1 ΠΈ L3VPN-2, respectively. IP address 10.0.0.254/24 is a VRRP VIP address on Selectel routers.

You can learn more about the L3VPN service read the in our blog.

Before the switch, everything looked something like in the diagram:

History of one switch
Two routers L3VPN-1 и L3VPN-2 were connected to the old aggregation switch А. The master for VRRP VIP address 10.0.0.254 is the router L3VPN-1. It has a higher priority for this address than the router L3VPN-2.

unit 1006 {
    description C2;
    vlan-id 1006;
    family inet {       
        address 10.0.0.252/24 {
            vrrp-group 1 {
                priority 200;
                virtual-address 10.100.0.254;
                preempt {
                    hold-time 120;
                }
                accept-data;
            }
        }
    }
}

The S2 server uses the 10.0.0.254 gateway to communicate with servers in other locations. Thus, disconnecting the L3VPN-2 router from the network (of course, after disconnecting it from the MPLS domain first) does not affect the connectivity of the client's servers. At this point, the redundancy level of the circuit is simply reduced.

History of one switch
After this we can safely reconnect the router L3VPN-2 to a pair of switches N. Lay links, change transceivers. The router's logical interfaces, on which the operation of client services depends, are disabled until it is confirmed that everything is functioning as it should.

After checking the links, transceivers, signal levels, error levels on the interfaces, the router is put into operation, but already connected to a new pair of switches.

History of one switch
Next, we lower the VRRP priority of the L3VPN-1 router, and the VIP address 10.0.0.254 is moved to the L3VPN-2 router. These works are also performed without interruption of communication.

History of one switch
Transferring VIP address 10.0.0.254 to the router L3VPN-2 allows you to disable the router L3VPN-1 without interruption of communication for the client and connect it to a new pair of aggregation switches N.

History of one switch
Whether to return VRRP VIP to the L3VPN-1 router or not is another question, and if you return it, then this is done without interrupting the connection.

Total

After all these steps, we actually replaced the aggregation switches in one of our data centers, while minimizing interruptions for our customers.

History of one switch
Then all that's left is dismantling. Dismantling of old switches, dismantling of old links between switches A and D, dismantling of transceivers from these links, correction of monitoring, correction of network diagrams in documentation and monitoring.

We can use switches, transceivers, patch cords, AOC, DAC left after switching in other projects or for other similar switching.

β€œNatasha, we switched everything!”

Source: habr.com

Add a comment