Network-as-a-Service for a large enterprise: a non-standard case

Network-as-a-Service for a large enterprise: a non-standard case
How to update network equipment in a large enterprise without stopping production? About a large-scale project in the mode of "open heart surgery" tells Linxdatacenter project manager Oleg Fedorov. 

Over the past few years, we have seen an increased demand from customers for services related to the network component of the IT infrastructure. The need for connectivity of IT systems, services, applications, the tasks of monitoring and operational business management in almost any area are forcing companies today to pay increased attention to networks.  

Requests range from providing network fault tolerance to creating and managing a client autonomous system with the acquisition of a block of IP addresses, configuring routing protocols and managing traffic according to the policies of organizations.

There is also a growing demand for integrated solutions for the construction and maintenance of network infrastructure, primarily from customers whose network infrastructure is created from scratch or is obsolete, requiring serious modification. 

This trend coincided in time with the period of development and complication of Linxdatacenter's own network infrastructure. We expanded the geography of our presence in Europe by connecting to remote sites, which in turn required the improvement of the network infrastructure. 

The company has launched a new service for customers, Network-as-a-Service: we take care of all network tasks for customers, allowing them to focus on their core business.

In the summer of 2020, the first big project in this direction was completed, which I would like to talk about. 

At the start 

A large industrial complex turned to us for the modernization of the network part of the infrastructure at one of its enterprises. It was required to replace the old equipment with new one, including the core of the network.

The last modernization of the equipment at the enterprise took place about 10 years ago. The new management of the enterprise decided to improve connectivity, starting with infrastructure upgrades at the most basic, physical level. 

The project was divided into two parts: upgrade of the server park and network equipment. We were responsible for the second part. 

The basic requirements for the work included minimizing downtime of the production lines of the enterprise during the execution of work (and in some areas, the complete elimination of downtime). Any stop is a direct monetary loss of the client, which should not have happened under any circumstances. In connection with the operation mode of the facility 24x7x365, as well as taking into account the complete absence of periods of planned downtime in the practice of the enterprise, we were given the task, in fact, to perform open-heart surgery. This became the main distinguishing feature of the project.

Go

The works were planned according to the principle of movement from the network nodes remote from the core to closer ones, as well as from production lines that have less impact on the work to those that directly affect this work. 

For example, if you take a network node in the sales department, then a communication failure as a result of work in this department will not affect production in any way. At the same time, such an incident will help us as a contractor to verify the correctness of the chosen approach to work on such nodes and, having corrected actions, work at the next stages of the project. 

It is necessary not only to replace the nodes and wires in the network, but also to correctly configure all components for the correct operation of the solution as a whole. It was the configurations that were checked in this way: starting work away from the core, we kind of gave ourselves the β€œright to make a mistake”, without exposing critical areas for the operation of the enterprise to risk. 

We have identified areas that do not affect the production process, as well as critical areas - workshops, loading and unloading unit, warehouses, etc. At key areas, we agreed with the client the allowable downtime for each network node separately: from 1 to 15 minutes . It was impossible to completely avoid disconnecting individual network nodes, since the cable must be physically switched from the old equipment to the new one, and in the process of switching it is also necessary to unravel the "beard" of wires, which was formed during several years of operation without proper care (one of the consequences of outsourcing work installation of cable lines).

The work was divided into several stages.

Step 1 - Audit. Preparation and coordination of the approach to work planning and assessment of the readiness of the teams: the client, the contractor performing the installation, and our team.

Step 2 – Development of a format for carrying out work, with deep detailed analysis and planning. We chose a checklist format with an exact indication of the order and sequence of actions, up to the sequence of switching patch cords by port.

Step 3 – Carrying out work in cabinets that do not affect production. Estimation and adjustment of downtime for subsequent stages of work.

Step 4 – Carrying out work in cabinets that directly affect production. Estimation and adjustment of downtime for the final stage of work.

Step 5 – Carrying out work in the server room to switch the remaining equipment. Running on routing on a new kernel.

Step 6 – Sequential switching of the system core from old network configurations to new ones for a smooth transition of the entire system complex (VLAN, routing, etc.). At this stage, we connected all users and transferred all services to new hardware, checked the correct connection, made sure that none of the enterprise services stopped, guaranteed that in case of any problems they would be connected directly to the kernel, which made it easier to eliminate possible troubleshooting and final setup. 

Wire beard hairstyle

The project turned out to be difficult also because of the difficult initial conditions. 

Firstly, this is a huge number of nodes and sections of the network, with an intricate topology and classification of wires according to their purpose. Such "beards" had to be taken out of the cabinets and painstakingly "combed", figuring out which wire from where and where it leads. 

It looked something like this:

Network-as-a-Service for a large enterprise: a non-standard case
as follows:

Network-as-a-Service for a large enterprise: a non-standard case
Or so: 

Network-as-a-Service for a large enterprise: a non-standard case
Secondly, for each such task, it was necessary to prepare a file with a description of the process. "We take wire X from port 1 of the old equipment, we plug it into port 18 of the new equipment." It sounds simple, but when you have 48 completely clogged ports in the initial data, and there is no idle option (we remember about 24x7x365), the only way out is to work in blocks. The more wires you can pull out of old equipment at one time, the faster you can brush them up and plug them into new network hardware, avoiding network failures and downtime. 

Therefore, at the preparatory stage, we split the network into blocks - each of them belonged to a specific VLAN. Each port (or a subset of them) on the old equipment is one of the VLANs in the new network topology. We grouped them as follows: the first ports of the switch housed user networks, in the middle - production networks, and in the last ones - access points and uplinks. 

This approach made it possible to pull out and comb out of the old equipment not 1 wire, but 10-15 at one time. This speeded up the workflow several times.  

By the way, this is what the wires in the cabinets look like after combing: 

Network-as-a-Service for a large enterprise: a non-standard case
or, for example, like this: 

Network-as-a-Service for a large enterprise: a non-standard case
After the completion of the 2nd stage, we took a break to analyze the errors and dynamics of the project. For example, minor flaws immediately came out due to inaccuracies in the network diagrams provided to us (the wrong connector on the diagram is the wrong purchased patch cord and the need to replace it). 

The pause was necessary, because when working with server rights, even a small failure in the process was unacceptable. If the goal was to ensure downtime on the network section of no more than 5 minutes, then it could not be exceeded. Any possible deviation from the schedule had to be agreed with the client. 

However, the advance planning and blocking of the project made it possible to meet the planned downtime at all sites, and in most cases, to do without it at all. 

Challenge of time - a project under COVID 

However, it was not without additional complications. Of course, the coronavirus was one of the obstacles. 

The work was complicated by the fact that a pandemic began, and it was impossible to be present during the work at the client's site for all the specialists involved in the process. Only the installer was allowed into the site, and control was through a Zoom room that included a network engineer from the Linxdatacenter side, myself as the project manager, a network engineer from the client's side in charge of the work, and the team doing the installation work.

In the course of the work, unaccounted for problems arose, and adjustments had to be made on the fly. So it was possible to quickly prevent the influence of the human factor (errors in the scheme, errors in determining the status of the interface activity, etc.).

Although the remote format of work seemed unusual at the beginning of the project, we quickly adapted to the new conditions and entered the final stage of work. 

We have run a temporary network settings configuration to run two network cores, the old and the new, in parallel in order to achieve a smooth transition. However, it turned out that one extra line was not removed from the configuration file of the new kernel, and the transition did not occur. This forced us to spend some time looking for the problem. 

It turned out that the main traffic was transmitted correctly, and the control traffic did not reach the node through the new core. Due to the clear division of the project into stages, it was possible to quickly identify the network section where the difficulty arose, identify the problem and eliminate it. 

And as a result

Technical results of the project 

First of all, a new core of the new enterprise network was created, for which we built physical/logical rings. This is done in such a way that each switch in the network has a "second shoulder". In the old network, many switches were connected to the core along one route, one shoulder (uplink). If it was torn, the switch became completely inaccessible. And if several switches were connected through one uplink, then the accident disabled the whole department or production line at the enterprise. 

In the new network, even a fairly serious network incident under no circumstances will be able to β€œput down” the entire network or its significant section. 

90% of all network equipment has been updated, media converters (converters of the signal propagation medium) have been decommissioned, and the need for dedicated power lines to power equipment by connecting to PoE switches, where power is supplied via Ethernet wires, has been eliminated. 

Also, all optical connections in the server room and in field cabinets are marked - at all key communication nodes. This made it possible to prepare a topological diagram of equipment and connections in the network, reflecting its actual state today. 

Network diagram
Network-as-a-Service for a large enterprise: a non-standard case
The most important result in technical terms: rather large-scale infrastructure work was carried out quickly, without creating any interference in the work of the enterprise and almost imperceptibly for its personnel. 

Business results of the project

In my opinion, this project is interesting primarily not from the technical side, but from the organizational side. The difficulty was primarily in planning and thinking through the steps to implement the project tasks. 

The success of the project allows us to say that our initiative to develop the network direction within the Linxdatacenter service portfolio is the right choice for the company's development vector. A responsible approach to project management, a competent strategy, and clear planning allowed us to perform the work at the proper level. 

Confirmation of the quality of work - a request from the client to continue the provision of services for the modernization of the network at its other sites in Russia.

Source: habr.com

Add a comment