How to take control of your network infrastructure. Chapter first. Retention

This article is the first in a series of articles "How to take control of your network infrastructure." The content of all articles in the cycle and links can be found here.

I fully admit that there are a sufficient number of companies where a network downtime of one hour or even one day is not critical. I, unfortunately or fortunately, did not have a chance to work in such places. But, of course, networks are different, requirements are different, approaches are different, and yet, in one form or another, the list below in many cases will be in fact a "must-do".

So, initial conditions.

You are in a new job or you have a promotion or you have decided to take a fresh look at your responsibilities. The company network is your area of ​​responsibility. For you, in many ways, this is a challenge and a new one, which somewhat justifies the mentoring tone of this article :). But, I hope that the article can also be useful to any network engineer.

Your first strategic goal is to learn how to resist entropy and maintain the level of service provided.

Many of the tasks described below can be solved by various means. I deliberately do not raise the topic of technical implementation, tk. in principle, it is often not so important how you solved a particular problem, but how you use it and whether you use it at all. There is little use, for example, from your professionally built monitoring system if you do not look at it and do not respond to alerts.

Equipment

First you need to understand where the biggest risks are.

Again, it could be different. I admit that somewhere, for example, these will be security issues, and somewhere, issues related to the continuity of the service, and somewhere, maybe something else. Why not?

Let's assume for definiteness that this is still the continuity of the service (this was the case in all the companies where I worked).

Then you need to start with the equipment. Here is a list of topics to watch out for:

  • classification of equipment according to the degree of criticality
  • redundancy of critical equipment
  • support, licenses

You have to consider possible failures, especially with equipment at the top of your criticality rating. Usually, the possibility of double whammy is neglected, otherwise your solution and support can become prohibitively expensive, but in the case of really critical elements of the network, the failure of which can significantly affect the business, you should think about it.

Example

Let's say we're talking about a root switch in a data center.

Since we have agreed that the continuity of service is the most important criterion, it is reasonable to provide "hot" redundancy for this equipment. But that is not all. You also have to decide how long, in the event of a failure of the first switch, it is acceptable for you to live with only one remaining switch, because there is a risk that it will break.

Important! You do not have to decide this issue yourself. You must describe the risks, possible solutions and cost to your management or company management. They must make decisions.

So, if it was decided that, given the low probability of a double failure, working for 4 hours on one switch is, in principle, acceptable, then you can simply take the appropriate support (which will replace the equipment within 4 hours).

But there is a risk that they will not deliver. Unfortunately, once we found ourselves in such a situation. Instead of four hours, the equipment traveled for a week!!!

Therefore, this risk also needs to be discussed and it may be more correct for you to buy another switch (the third one) and keep it in spare parts (“cold” redundancy) or use it for laboratory purposes.

Important! Make a table of all the supports you have with end dates and add them to your calendar so that you get an email at least a month in advance telling you to start worrying about renewing your support.

You will not be forgiven if you forget to renew support and the next day after it ends, your equipment will break.

emergency work

Whatever happens on your network, ideally, you should keep access to your network equipment.

Important! You must have console access to all equipment and this access must not depend on the health of the user data network (data).

You should also foresee possible negative scenarios in advance and document the necessary actions. The availability of this document is also critical, so it should not only be posted on a shared resource for the department, but also saved locally on the computers of engineers.

There must necessarily be

  • information required to open a case in vendor or integrator support
  • information on how to get to any equipment (console, management)

Also, of course, any other useful information can be contained, for example, a description of the upgrade procedure for various equipment and useful diagnostic commands.

Partners

Now you need to assess the risks associated with partners. Usually this

  • Internet service providers and traffic exchange points (IX)
  • communication channel providers

What questions should you ask yourself? As with equipment, there are various emergency situations to consider. For example, for ISPs, this could be something like:

  • What happens if ISP X stops providing you service for some reason?
  • Do you have enough bandwidth from other providers?
  • how good will the connectivity be?
  • How independent are your ISPs and will a serious outage of one of them lead to problems with others?
  • how many optical inputs in your data center?
  • what happens if one of the inputs is completely destroyed?

Regarding inputs, in my practice in two different companies, in two different data centers, an excavator destroyed wells and only by a miracle our optics were not affected. This is not such a rare case.

And, of course, you need to not just ask these questions, but, again, with the support of management, ensure an acceptable solution in any situation.

Backup

The next priority might be a hardware configuration backup. In any case, this is a very important point. I will not list those cases when you can lose the configuration, it is better to make regular backups and not think about it. In addition, a regular backup can be very useful in controlling changes.

Important! Make a backup daily. This is not such a large amount of data to save on this. In the morning, the engineer on duty (or you) should receive a report from the system, which explicitly indicates the success or failure of the backup, and in case of an unsuccessful backup, the problem should be resolved or a ticket should be created (see the processes of the network department).

Software versions

The question of whether or not to upgrade the equipment software is not so unambiguous. On the one hand, old versions are known bugs and vulnerabilities, but on the other hand, new software is, firstly, not always a painless upgrade procedure, and secondly, new bugs and vulnerabilities.

Here you need to find the best option. A few obvious recommendations

  • install only stable versions
  • Still, you should not live on very old versions of software.
  • make a table with information where some software is located
  • periodically read reports on vulnerabilities and bugs in software versions, and in case of critical problems, you should think about upgrading

At this point, with console access to the hardware, support information, and a description of the upgrade procedure, you are basically ready for this step. The ideal option is when you have laboratory equipment where you can check the entire procedure, but unfortunately this is not often the case.

In the case of critical equipment, you can contact the vendor's support with a request to help you with the upgrade.

Ticket system

Now you can look around. You need to establish communication processes with other departments and within the department.

Perhaps this is not mandatory (for example, if your company is small), but I would highly recommend organizing work in such a way that all external and internal tasks go through the ticket system.

The ticket system is essentially your interface for internal and external communications, and you should describe this interface in sufficient detail.

Let's take as an example an important and frequently encountered task of opening access. I will describe an algorithm that worked perfectly in one of the companies.

Example

Let's start with the fact that access customers often formulate their desire in a language incomprehensible to a network engineer, namely, in the application language, for example, “give me access to 1C”.

Therefore, we have never accepted requests directly from such users.
And that was the first requirement

  • access requests should come from technical departments (in our case, these were unix, windows, helpdesk engineers)

The second requirement is that

  • this access must be logged (by the technical department from which we received this request) and as a request we receive a link to this logged access

The form of this request should be clear to us, i.e.

  • the request must contain information about from which and to which subnet access should be opened, as well as about the protocol and (in the case of tcp / udp) ports

It should also indicate

  • description of what this access is for
  • temporary or permanent (if temporary, until what date)

And a very important point is approvy

  • from the head of the department that initiated the access (for example, accounting)
  • from the head of the technical department, from where this request came to the network department (for example, helpdesk)

At the same time, the head of the department that initiated the access (accounting in our example) is considered the “owner” of this access, and he is responsible for ensuring that the page with logged accesses for this department remains relevant.

Logging

This is something you can drown in. But if you want to adopt a proactive approach, then you need to learn how to deal with this flood of data.

Here are some practical tips:

  • check logs daily
  • in the event of a planned review (rather than an emergency), you can limit yourself to severity levels (severity) 0, 1, 2 and add selected patterns from other levels if you see fit
  • write a script that parses logs and ignores those logs whose patterns you added to the ignore list

This approach will eventually allow you to create an ignore list of logs that you are not interested in and leave only those that you truly consider important.
It worked great for us.

Monitoring

It is not uncommon for a company to lack a monitoring system. You can, for example, rely on logs, but the equipment may simply “die” without having time to “say” anything, or the udp packet of the syslog protocol may be lost and not reach. In general, of course, active monitoring is important and necessary.

The two most requested examples in my practice are:

  • monitoring the load of communication channels, critical links (for example, connecting to providers). They allow you to proactively see the potential problem of service degradation due to traffic loss and avoid it accordingly.
  • graphs based on NetFlow. They make it easy to find anomalies in traffic and are very useful for detecting some simple but significant types of hacker attacks.

Important! Set up an SMS notification for the most critical events. This applies to both monitoring and logging. If you do not have a shift on duty, then sms should also come after hours.

Design the process in such a way that you don't wake up all the engineers. We had an engineer on duty for this.

Change control

In my opinion, it is not necessary to control all changes. But, in any case, you should be able to easily find, if necessary, who and why made certain changes in the network.

A few tips:

  • use the ticket system to describe in detail what has been done within this ticket, for example by copying the applied configuration to the ticket
  • use the commenting capabilities of the network equipment (for example, commit comment on Juniper). You can write down the ticket number
  • use diff of your config backups

You can enter this as a process by reviewing all tickets daily for changes.

Processes

You must formalize and describe the processes in your team. If you have reached this point, then at least the following processes should already be working in your team:

Daily processes:

  • work with tickets
  • working with logs
  • change control
  • daily checklist

Yearly processes:

  • renewal of warranties, licenses

Asynchronous processes:

  • response to various emergencies

Conclusion of the first part

You noticed that all this is not yet about configuring the network, not about design, not about network protocols, not about routing, not about security ... It's something around. But these are, although perhaps boring, but, of course, very important elements of the work of the network division.

So far, as you can see, you haven't improved anything on your network. If there were security vulnerabilities, then they remained, if there was bad design, then it remained. Until you have applied your skills and knowledge of a network engineer, which was most likely spent a lot of time, effort, and sometimes money. But first you need to create (or strengthen) the foundation, and then start building.

About how to find and fix errors, and then improve your infrastructure - this is the next part.

Of course, it is not necessary to do everything sequentially. Time can be critical. Do in parallel if resources allow.

And an important addition. Communicate, ask, consult with your team. In the end, it is for them to support and do all this.

Source: habr.com

Add a comment