Automation for the little ones. Part zero. Planning

The SDSM is over, but the uncontrollable desire to write remains.

Automation for the little ones. Part zero. Planning

For years, our brother suffered from doing chores, crossing his fingers before a commit, and not getting enough sleep because of nightly rollbacks.
But the dark times are coming to an end.

With this article I will start a series on how me seems to be automation.
Along the way, we will deal with the stages of automation, storing variables, formalizing the design, with RestAPI, NETCONF, YANG, YDK and we will program a lot.
Me means that a) it’s not an objective truth, b) it’s not unconditionally the best approach, c) my opinion, even in the course of moving from the first to the last article, can change - to be honest, from the draft stage to publication, I rewrote everything completely twice.

Content

  1. Goals
    1. The network is like a single organism
    2. Configuration testing
    3. Versioning
    4. Monitoring and self-healing services

  2. Facilities
    1. inventory system
    2. IP space management system
    3. Network Service Description System
    4. Device initialization mechanism
    5. Vendor-agnostic configuration model
    6. Vendor-interface specific driver
    7. The mechanism for delivering the configuration to the device
    8. CI / CD
    9. Backup mechanism and deviation search
    10. Monitoring system

  3. Conclusion

ADSM I will try to lead in a format slightly different from SDSM. There will still be large, in-depth numbered articles, and in between I'll be posting little notes from everyday experience. I will try here to fight perfectionism and not lick each of them.

How funny it is that the second time you have to go the same way.

At first, I had to write articles about networks myself due to the fact that they were not in Runet.

Now I could not find a comprehensive document that would systematize approaches to automation and analyze the above technologies using simple practical examples.

Maybe I'm wrong, so please post links to good resources. However, this will not change my determination to write, because the main goal is to still learn something yourself, and making life easier for others is a nice bonus that caresses the gene for spreading experience.

We will try to take a medium-sized LAN DC data center and work out the entire automation scheme.
I will be doing some things almost for the first time with you.

In the ideas and tools described here, I will not be original. Dmitry Figol has an excellent channel with streams on this topic.
Articles in many aspects will intersect with them.

There are 4 DCs on the LAN DC, about 250 switches, half a dozen routers and a couple of firewalls.
Not Facebook, but enough to make you think deeply about automation.
There is, however, an opinion that if you have more than 1 device, you already need automation.
In fact, it's hard to imagine that anyone can now live without at least a pack of knee-high scripts.
Although I heard that there are such offices where IP addresses are recorded in Excel, and each of the thousands of network devices is configured manually and has its own unique configuration. This, of course, can be passed off as contemporary art, but the feelings of an engineer will definitely be offended.

Goals

Now we will set the most abstract goals:

  • The network is like a single organism
  • Configuration testing
  • Network state versioning
  • Monitoring and self-healing services

Later in this article we will analyze what means we will use, and in the following, the goals and means in detail.

The network is like a single organism

The defining phrase of the cycle, although at first glance it may not seem so significant: we will configure the network, not individual devices.
In recent years, we have seen a shift in emphasis towards treating the network as a single entity, hence the coming into our lives. Software Defined Networking, Intent Driven Networks ΠΈ Autonomous Networks.
After all, what applications globally need from the network: connectivity between points A and B (well, sometimes + C-Z) and isolation from other applications and users.

Automation for the little ones. Part zero. Planning

And thus, our task in this series is build a systemthat maintains the current configuration the entire network, which is already decomposed into the actual configuration on each device according to its role and location.
System network management implies that to make changes, we turn to it, and it, in turn, calculates the desired state for each device and configures it.
Thus, we minimize going to CLI by hand to almost zero - any changes in device settings or network design must be formalized and documented - and only then roll out to the necessary network elements.

That is, for example, if we decided that from now on, rack switches in Kazan should announce two networks instead of one, we

  1. Documenting changes to systems first
  2. We generate the target configuration of all network devices
  3. We run the network configuration update program, which calculates what needs to be removed on each node, what to add, and brings the nodes to the desired state.

At the same time, we make changes with our hands only at the first step.

Configuration testing

Known, that 80% of problems happen during a configuration change - indirect evidence of this is that during the New Year holidays, everything is usually calm.
I personally witnessed dozens of global downtimes due to a human error: wrong command, executed in the wrong configuration branch, forgot the community, demolished MPLS globally on the router, configured five pieces of hardware, and didn’t notice the error on the sixth, committed old changes made by another person . Darkness of scenarios.

Automation will allow us to make fewer mistakes, but on a larger scale. So you can brick not one device, but the entire network at once.

From time immemorial, our grandfathers checked the correctness of the changes made with a sharp eye, steel eggs and the performance of the network after they were rolled out.
Those grandfathers whose work led to downtime and catastrophic losses left fewer offspring and should eventually die out, but evolution is a slow process, and therefore not everyone still pre-checks changes in the laboratory.
However, at the forefront of progress are those who have automated the process of testing the configuration and its further application to the network. In other words, I borrowed the CI / CD procedure (Continuous Integration, Continuous Deployment) from developers.
In one of the parts, we will look at how to implement this using a version control system, probably github.

Once you get used to the idea of ​​network CI/CD, overnight the method of verifying a configuration by applying it to a production network will seem like early medieval ignorance to you. Like hitting a warhead with a hammer.

An organic continuation of ideas about system network management and CI / CD becomes a full-fledged versioning of the configuration.

Versioning

We will assume that with any changes, even the most insignificant, even on one imperceptible device, the entire network moves from one state to another.
And we are not always executing a command on the device, we are changing the state of the network.
Let's call these states versions, shall we?

Let's say the current version is 1.0.0.
Has the IP address of the Loopback interface changed on one of the ToRs? This is a minor version - will receive the number 1.0.1.
Revised policies for importing routes into BGP - a little more serious - already 1.1.0
We decided to get rid of IGP and switch only to BGP - this is already a radical change in design - 2.0.0.

At the same time, different DCs can have different versions - the network is developing, new equipment is being installed, new levels of spines are added somewhere, somewhere not, and so on.

About semantic versioning we will discuss in a separate article.

I repeat - any change (except for debugging commands) is a version update. Administrators should be notified of any deviations from the current version.

The same applies to the rollback of changes - this is not a cancellation of the last commands, this is not a rollback by the operating system of the device - this is bringing the entire network to the new (old) version.

Monitoring and self-healing services

This is a self-evident task in modern networks goes to a new level.
Often, large service providers practice the approach that a fallen service must be finished off very quickly and a new one raised, instead of figuring out what happened.
β€œVery” means that from all sides it is necessary to smear abundantly with monitoring, which within seconds will detect the slightest deviations from the norm.
And here, the usual metrics are no longer enough, such as interface load or node availability. It is not enough and manual tracking of the person on duty behind them.
For many things at all there should be self-healing - the monitors lit up red and went on their own, the plantain was applied where it hurts.

And here we also monitor not only individual devices, but the health of the entire network, both whitebox, which is relatively understandable, and blackbox, which is already more complicated.

What will we need to implement such ambitious plans?

  • Have a list of all devices on the network, their location, roles, models, software versions.
    kazan-leaf-1.lmu.net, Kazan, leaf, Juniper QFX 5120, R18.3.
  • Have a system for describing network services.
    IGP, BGP, L2/3VPN, Policy, ACL, NTP, SSH.
  • Be able to initialize the device.
    Hostname, Mgmt IP, Mgmt Route, Users, RSA-Keys, LLDP, NETCONF
  • Configure the device and bring the configuration to the required (including the old) version.
  • Test configuration
  • Periodically check the status of all devices for deviations from the current one and inform who should.
    At night, someone quietly added a rule to the ACL.
  • Monitor performance.

Facilities

Sounds complicated enough to start decomposing a project into components.

And there will be ten of them:

  1. inventory system
  2. IP space management system
  3. Network Service Description System
  4. Device initialization mechanism
  5. Vendor-agnostic configuration model
  6. Vendor-interface specific driver
  7. The mechanism for delivering the configuration to the device
  8. CI / CD
  9. Backup mechanism and deviation search
  10. Monitoring system

By the way, this is an example of how the view on the goals of the cycle has changed - there were 4 components in the draft.

Automation for the little ones. Part zero. Planning

In the illustration, I depicted all the components and the device itself.
Crossed components interact with each other.
The larger the block, the more attention should be paid to this component.

Component 1. Inventory system

Obviously, we want to know what equipment, where it is, what it is connected to.
The inventory system is an integral part of any enterprise.
Most often, for network devices, an enterprise has a separate inventory system that solves more specific tasks.
As part of a series of articles, we will call it DCIM - Data Center Infrastructure Management. Although the term DCIM itself, strictly speaking, includes much more.

For our tasks, we will store the following information about the device in it:

  • Inventory number
  • Title/Description
  • Model (Huawei CE12800, Juniper QFX5120 etc)
  • Characteristic parameters (boards, interfaces, etc.)
  • Role (Leaf, Spine, Border Router, etc.)
  • Location (region, city, data center, rack, unit)
  • Interconnects between devices
  • network topology

Automation for the little ones. Part zero. Planning

It is perfectly understandable that we ourselves want to know all this.
But will it help for automation purposes?
Certainly.
For example, we know that in a given data center on Leaf switches, if it is Huawei, ACLs to filter certain traffic should be applied on the VLAN, and if it is Juniper, then on unit 0 of the physical interface.
Or you need to roll out a new Syslog server to all borders in the region.

In it, we will store virtual network devices, such as virtual routers or root reflectors. We can add DNS servers, NTP, Syslog, and in general everything that is somehow related to the network.

Component 2. IP space management system

Yes, and in our time there are teams of people who keep track of prefixes and IP addresses in an Excel file. But the modern approach is still a database, with an nginx / apache frontend, API and extensive functions for accounting for IP addresses and networks divided into VRF.
IPAM stands for IP Address Management.

For our tasks, we will store the following information in it:

  • VLAN
  • VRF
  • Networks/Subnets
  • IP addresses
  • Binding addresses to devices, networks to locations and VLAN numbers

Automation for the little ones. Part zero. Planning

Again, it's clear that we want to make sure that when we allocate a new IP address for the ToR loopback, we don't trip over the fact that it has already been assigned to someone. Or that we used the same prefix twice at different ends of the network.
But how does this help with automation?
Easily.
We request a prefix with the Loopbacks role in the system, in which there are IP addresses available for allocation - if it is, we allocate the address, if not, we request the creation of a new prefix.
Or, when creating a device configuration, we can find out from the same system in which VRF the interface should be located.
And when you start a new server, the script goes into the system, finds out in which switch server, in which port and which subnet is assigned to the interface - it will extract the server address from it.

It begs the desire to combine DCIM and IPAM into one system so as not to duplicate functions and not to serve two similar entities.
So we'll do it.

Component 3. Network Service Description System

If the first two systems store variables that still need to be used somehow, then the third describes for each device role how it should be configured.
It is worth distinguishing two different types of network services:

  • infrastructure
  • Client.

The former are designed to provide basic connectivity and device management. These include VTY, SNMP, NTP, Syslog, AAA, routing protocols, CoPP, etc.
The latter organize the service for the client: MPLS L2/L3VPN, GRE, VXLAN, VLAN, L2TP, etc.
Of course, there are borderline cases - where to include MPLS LDP, BGP? Yes, and routing protocols can be used for clients. But this is not essential.

Both types of services are decomposed into configuration primitives:

  • physical and logical interfaces (tag/anteg, mtu)
  • IP addresses and VRF (IP, IPv6, VRF)
  • ACLs and traffic handling policies
  • Protocols (IGP, BGP, MPLS)
  • Routing policies (prefix lists, community, ASN filters).
  • Utility Services (SSH, NTP, LLDP, Syslog…)
  • Etc.

How exactly we will do this, I have no idea yet. We'll figure it out in a separate article.

Automation for the little ones. Part zero. Planning

If a little closer to life, then we could describe that
The leaf switch must have BGP sessions with all connected Spine switches, import connected networks into the process, and accept only networks from a certain prefix from Spine switches. Limit CoPP IPv6 ND to 10 pps etc.
In turn, the spines hold sessions with all connected bodice, acting as root-reflectors, and accept from them only routes of a certain length and with a certain community.

Component 4. Device Initialization Mechanism

Under this heading, I'm bringing together the many things that need to happen to get a device on the radar and remotely accessible.

  1. Enter the device in the inventory system.
  2. Allocate a management IP address.
  3. Set up basic access to it:
    Hostname, management IP address, route to the management network, users, SSH keys, protocols - telnet/SSH/NETCONF

There are three approaches here:

  • All completely by hand. The device is brought to the stand, where an ordinary organic person will bring it into the systems, connect with the console and configure it. May work on small static networks.
  • ZTP - Zero Touch Provisioning. The iron arrived, got up, received an address via DHCP, went to a special server, self-configured.
  • Infrastructure of console servers, where the initial configuration takes place through the console port in automatic mode.

We will talk about all three in a separate article.

Automation for the little ones. Part zero. Planning

Component 5: Vendor Agnostic Configuration Model

Until now, all systems have been disparate patches, providing variables and a declarative description of what we would like to see on the network. But sooner or later, you will have to deal with the specifics.
At this stage, for each particular device, primitives, services, and variables are combined into a configuration model that actually describes the complete configuration of a particular device, only in a vendor-independent manner.
What does this step do? Why not immediately create a device configuration that you can simply fill in?
In fact, this allows you to solve three problems:

  1. Do not adapt to a specific interface of interaction with the device. Whether it's CLI, NETCONF, RESTCONF, SNMP, the model will be the same.
  2. Do not keep the number of templates / scripts according to the number of vendors in the network, and in case of a design change, change the same thing in several places.
  3. Download the configuration from the device (backup), decompose it into exactly the same model and directly compare the target configuration with the existing one to calculate the delta and prepare a configuration patch that will change only those parts that are necessary or to detect deviations.

Automation for the little ones. Part zero. Planning

As a result of this step, we get a vendor-independent configuration.

Component 6. Vendor interface specific driver

You should not flatter yourself with hopes that someday it will be possible to configure a cisco in the same way as a juniper, simply by sending exactly the same calls to them. Despite the growing popularity of whiteboxes and the emergence of support for NETCONF, RESTCONF, OpenConfig, the specific content that these protocols deliver differs from vendor to vendor, and this is one of their competitive differences that they will not give up so easily.
This is pretty much exactly the same as OpenContrail and OpenStack, which have RestAPI as their NorthBound interface, expect completely different calls.

So, at the fifth step, the vendor-independent model should take the form in which it will go to hardware.
And here all the means are good (no): CLI, NETCONF, RESTCONF, SNMP is simple.

Therefore, we need a driver that will translate the result of the previous step into the required format for a specific vendor: a set of CLI commands, an XML structure.

Automation for the little ones. Part zero. Planning

Component 7: Mechanism for delivering the configuration to the device

We generated the configuration, but it still needs to be delivered to the devices - and, obviously, not by hand.
At first, before us the question arises, what kind of transport will we use? And today the choice is no longer small:

  • CLI (telnet, ssh)
  • SNMP
  • NETCONF
  • RESTCONF
  • REST API
  • OpenFlow (although it's off the list because it's a way to deliver FIB, not settings)

Let's dot the e's here. CLI is legacy. SNMP... heh heh.
RESTCONF is still an unknown little animal, REST API is supported by almost no one. Therefore, we will focus on NETCONF in the cycle.

In fact, as the reader has already understood, we have already decided on the interface by this point - the result of the previous step is already presented in the format of the interface that was selected.

Secondlyand what tools will we use to do this?
There is also a large selection:

  • Self-written script or platform. Let's arm ourselves with ncclient and asyncIO and do everything ourselves. What does it cost us to build a deployment system from scratch?
  • Ansible with its rich library of network modules.
  • Salt with his meager networking and Napalm connection.
  • Actually Napalm, who knows a couple of vendors and that's it, goodbye.
  • Nornir is another animal that we will dissect in the future.

Here, the favorite has not yet been chosen - we will be fooling around.

What else is important here? Consequences of applying the configuration.
Successful or not. Remained access to the piece of iron or not.
It seems that commit will help here with confirmation and validation of what was downloaded to the device.
This, together with the correct implementation of NETCONF, significantly narrows the range of suitable devices - not many manufacturers support normal commits. But this is just one of the prerequisites in RFP. In the end, no one is worried that not a single Russian vendor will pass under the condition of a 32 * 100GE interface. Or is he worried?

Automation for the little ones. Part zero. Planning

Component 8. CI/CD

At this point, we have already prepared the configuration for all network devices.
I write "everything" because we are talking about versioning the state of the network. And even if you need to change the settings of just one switch, changes are calculated for the entire network. Obviously, they can be zero for most nodes.

But, as already mentioned above, we are not some kind of barbarians to roll everything at once into production.
The generated configuration must first go through Pipeline CI/CD.

CI/CD stands for Continuous Integration, Continuous Deployment. This is an approach in which the team publishes a new major release more than once every six months, completely replacing the old one, but regularly incrementally implements (Deployment) new functionality in small portions, each of which is comprehensively tested for compatibility, security and performance (Integration).

To do this, we have a version control system that monitors configuration changes, a laboratory that checks if the client service breaks, a monitoring system that checks this fact, and the last step is to roll out the changes to the production network.

With the exception of debug commands, absolutely all changes on the network must go through the CI / CD Pipeline - this is our guarantee of a quiet life and a long happy career.

Automation for the little ones. Part zero. Planning

Component 9. System of backup and search for deviations

Well, there is no need to talk about backups once again.
We will simply add them according to the crown or upon the fact of a configuration change in the git.

But the second part is more interesting - someone should keep an eye on these backups. And in some cases, this someone has to go and turn everything as it was, and in others, meow to someone that there is a mess.
For example, if some new user appeared who is not registered in the variables, you need to remove him from the hack. And if it’s better not to touch the new firewall rule, maybe someone just turned on debugging, or maybe a new service, a muddler, didn’t register according to the regulations, but people already went into it.

From a certain small delta across the entire network, we still will not leave, despite any automation systems and the steel hand of management. To debug problems, anyway, no one will make the configuration into the systems. Moreover, they may not even be provided for by the configuration model.

For example, a firewall rule for counting the number of packets to a specific IP, for localizing a problem, is a completely ordinary temporary configuration.

Automation for the little ones. Part zero. Planning

Component 10. Monitoring system

At first, I did not intend to cover the topic of monitoring - after all, it is a voluminous, controversial and complex topic. But along the way, it turned out that this is an integral part of automation. And it is impossible to bypass it even without practice.

Developing thought is an organic part of the CI/CD process. After rolling out the configuration to the network, we need to be able to determine if everything is in order with it now.
And it’s not only and not so much about the schedules for using interfaces or the availability of nodes, but about more subtle things - the availability of the necessary routes, attributes on them, the number of BGP sessions, OSPF neighbors, End-to-End operability of overlying services.
Did the syslogs to the external server stop adding up, did the SFlow agent break down, did the drops start growing in the queues, and did the connectivity between any pair of prefixes break?

In a separate article, we will reflect on this.

Automation for the little ones. Part zero. Planning

Automation for the little ones. Part zero. Planning

Conclusion

As a basis, I chose one of the modern data center network designs - L3 Clos Fabric with BGP as the routing protocol.
We will build the network this time on Juniper, because now the JunOs interface is a vanlove.

Let's complicate our lives using only Open Source tools and a multi-vendor network - therefore, in addition to the juniper, along the way, I will choose another lucky one.

The plan for future publications is something like this:
First, I'll talk about virtual networks. First of all, because I want to, and secondly, because without this, the design of the infrastructure network will not be very clear.
Then actually about the design of the network: topology, routing, policies.
Let's assemble a laboratory stand.
Let's think and maybe practice initializing the device on the network.
And then about each component in intimate details.

And yes, I do not promise to gracefully end this cycle with a ready-made solution. πŸ™‚

Useful links

  • Before delving into the series, it is worth reading the book by Natasha Samoylenko Python for network engineers. And maybe pass course.
  • It will also be useful to read RFC about the design of data center factories from Facebook by Petr Lapukhov.
  • The architecture documentation will give you an idea of ​​how Overlay SDN works. Tungsten Fabric (formerly Open Contrail).
Thanks

Roman Gorge. For comments and edits.
Artyom Chernobay. For KDPV.

Source: habr.com

Add a comment