MMS system in the data center: how we automated maintenance management

Imagine that you have a complete server room of engineering equipment: several dozen air conditioners, a bunch of DGUs and uninterruptibles. In order for the hardware to work as it should, you regularly check its performance and do not forget about prevention: conduct test runs, check the oil level, change parts. Even for one server room, you need to store a lot of information: a register of equipment, a list of consumables in stock, a schedule of preventive maintenance, as well as warranty documents, contracts with suppliers and contractors. 

Now multiply the number of rooms by ten. There were questions of logistics. In what warehouse what to store, so as not to run after every spare part? How to replenish stocks in time so that unscheduled repairs are not taken by surprise? If there is a lot of equipment, it is impossible to keep all the technical work in your head, but it is difficult on paper. This is where MMS, or maintenance management system, comes to the rescue. 

MMS system in the data center: how we automated maintenance management
In MMS, we draw up preventive and repair schedules, store instructions for engineers. Not all data centers have such a system, many consider it too expensive a solution. But in our experience we have found that it's not the tool that matters, it's the approach to work with information. We created the first system in Excel and gradually developed it into a software product. 

With alexdropp we decided to share our own MMS development experience. I will show how the system has evolved and how it has helped to implement the best maintenance practices. Alexey will tell how he inherited MMS, what has changed since then, and how the system makes life easier for engineers now. 

How we came to our own MMS

First there were folders. 8-10 years ago, information was stored in a fragmented form. After maintenance, we signed certificates of completion, stored paper originals in archives, and scanned copies on network folders. In the same way, information about spare parts, tools and accessories was collected in folders broken down by equipment. You can live like this if you build a structure and access levels for these folders.
But then you have three problems: 

  • navigation: long time to switch between different folders. If you want to see repairs on specific equipment for several years, you will have to make a lot of clicks.
  • statistics: you won’t have it, and without it it’s hard to predict how quickly different equipment fails or how many spare parts to plan for next year.  
  • timeliness of the reaction: no one will remind you that the components are already running out and you need to order more. And it is not obvious that the same equipment fails not for the first time.  

For a while we kept documents like this, but then we discovered Excel for ourselves πŸ™‚

MMS to Excel. Over time, the structure of the documentation migrated to Excel. The basis was a list of equipment, maintenance schedules, checklists and links to acts of work performed were tied to it: 

MMS system in the data center: how we automated maintenance management

The list of equipment indicated the main characteristics and place in the data center:
MMS system in the data center: how we automated maintenance management

It turned out to be a kind of navigator, from which you can quickly understand what is happening with the equipment and its maintenance. If necessary, from the maintenance schedule, you can look into individual acts using the links:

MMS system in the data center: how we automated maintenance management

If you conscientiously maintain a document in Excel, the solution is quite suitable for a small server room. But it is also temporary. Even if we use one air conditioner and do maintenance once a month, in five years we will accumulate hundreds of errors, and our Excel will swell. If you add one more air conditioner, one diesel generator, one UPS, then you need to make several sheets and link them together. The longer the story, the more difficult it is to grab the right information on the go. 

The first "adult" system. In 2014, we passed the first Management & Operations audit according to the Operational Sustainability standards from the Uptime Institute. We went through almost the same excel, but over the year we have greatly improved it: we added links to instructions and checklists for engineers. The auditors considered this format quite working. They were able to track all operations with the equipment and made sure that the information was up-to-date and the processes were built. The audit then passed with a bang, gaining 92 points out of 100 possible.

The question arose: how to live on. We decided that we needed a β€œserious” MMS, we looked at several paid programs, but in the end we decided to write the software ourselves. The same Excel was used as a deployed TK. These are the tasks we set for MMS. 

What we wanted from MMS

In most cases, MMS is a set of directories and reports. Our directory hierarchy looks something like this:

MMS system in the data center: how we automated maintenance management

The very first top-level reference is list of buildings: turbine halls, warehouses where the equipment is located.

MMS system in the data center: how we automated maintenance management

Next comes list of engineering equipment. We collected it according to the systems:

  • Air conditioning system: air conditioners, chillers, pumps.
  • Power supply system: UPS, diesel generator sets, switchboards.

MMS system in the data center: how we automated maintenance management
We collect basic data for each equipment: type, model, serial number, manufacturer's data, year of manufacture, commissioning date, warranty period.

When we have filled in the list of equipment, we make up for it maintenance program: how and how often to do maintenance. In the maintenance program we describe set of operations, for example: replace this battery, adjust the operation of a specific part, and so on. Operations are described in a separate reference book. If the operation is repeated in different programs, then you do not need to describe it again each time - just take the finished one from the directory:

MMS system in the data center: how we automated maintenance management
Operations "Change temperature settings" and "Replace cable quick connectors" will be common for chillers and air conditioning systems from the same manufacturer.

Now for each equipment we can create maintenance schedule. We link the maintenance program to specific equipment, and the system itself looks in the program how often maintenance needs to be done, and calculates the time of work from the date of commissioning:
MMS system in the data center: how we automated maintenance managementYou can even automate the compilation of such a schedule using Excel formulas.

Not quite an obvious story: we keep a separate directory pending work. A schedule is a schedule, but we are all living people and we understand that anything can happen. For example, the consumable did not arrive on time and the service needs to be rescheduled for a week. This is a normal situation if you follow it. We keep statistics on delayed and unfulfilled work and try to ensure that maintenance cancellations tend to zero.  

Also, statistics are kept for each equipment. accidents and unscheduled repairs. We use statistics for planning purchases, searching for weaknesses in the infrastructure. For example, if a compressor burns out in the same place three times in a row, this is a signal to look for the cause of breakdowns.   

MMS system in the data center: how we automated maintenance management
Such a history of maintenance and repairs has accumulated over 4 years for a specific air conditioner.

The following guide is SPTA. It takes into account what consumables are needed for equipment, where and in what quantity they are stored. We also save information about delivery times here in order to better plan receipts at the warehouse. 

The number of spare parts is calculated from the annual statistics of repairs per unit of equipment. For all spare parts, we indicate the minimum balance: what minimum spare parts are needed at each facility. If the SPTA ends, its quantity in the directory is highlighted:

MMS system in the data center: how we automated maintenance managementThe irreducible balance of high pressure sensors must be at least two, and only one remains. It's time to place an order. 

As soon as a consignment of spare parts arrives, we fill the directory with the data from the invoice and indicate the storage location. We immediately see the current balance of such spare parts in the warehouse: 
MMS system in the data center: how we automated maintenance management

We maintain a separate directory of contacts. We enter into it the data of suppliers and contractors who carry out maintenance: 

MMS system in the data center: how we automated maintenance management

Certificates and electrical safety approval groups are attached to the card of each contractor-engineer. When drawing up a schedule, we can see which of the specialists has the required clearance. 
MMS system in the data center: how we automated maintenance management

During the existence of MMS, work with site tolerances has changed. For example, documents with methodological instructions for maintenance were added. If earlier a set of operations fit into a small checklist, then everything is provided in detailed instructions: how to prepare, what conditions are needed, and so on.   

How the whole process is arranged now, will tell with an example alexdropp

How is maintenance in MMS

Once upon a time, the work done was documented after the fact. We just carried out maintenance and after it signed an act of work performed. This is what 99% of server servers do, but, from experience, this is not enough. In order not to forget anything, we first form work permit. This is a document describing the work and the conditions for their implementation. Any MOT and repair in our system begins with it. How does this happen: 

  1. We look at the nearest planned work in the maintenance schedule:
    MMS system in the data center: how we automated maintenance management
  2. We create a new work permit. We select a maintenance contractor who manages the process from our side and coordinates the work with us. We indicate where and when the work will be, select the type of equipment and the program we will follow: 
    MMS system in the data center: how we automated maintenance management
  3. After saving the card, let's move on to the details. We indicate the performer and check if he has permission for the necessary work. If there is no admission, the field is highlighted in red, and you cannot write out the outfit:  
    MMS system in the data center: how we automated maintenance management
  4. We specify the specific equipment. Depending on the type of work, preliminary measures are prescribed in the maintenance program, for example: order fuel to the site, schedule an introductory briefing for engineers and notify colleagues. The list of events will appear automatically, but we can add our own items, everything is quite flexible:
    MMS system in the data center: how we automated maintenance management
  5. We save the outfit, send a letter to the approver and wait for his answer:
    MMS system in the data center: how we automated maintenance management
  6. By the time the engineer arrives, we print the order directly from the system.
    MMS system in the data center: how we automated maintenance management
  7. The order has a checklist of operations for the maintenance program. The head of work in the data center controls maintenance and ticks off.
    MMS system in the data center: how we automated maintenance management

    For a while, a short checklist was enough. Then we introduced methodological instructions, or MOP (method of procedure). With the help of such a document, any certified engineer can perform an inspection of any equipment. 

    Everything is described in as much detail as possible, down to templates for notification letters and weather conditions: 

    MMS system in the data center: how we automated maintenance management

    The printed document looks like this:

    MMS system in the data center: how we automated maintenance management

    According to the standards of the Uptime Institute, such an MOP must be for all operations. This is quite a large amount of documentation. Based on experience, we recommend developing them gradually, for example, one MOP per month.

  8. After the work, the engineer issues a certificate of completion. We scan it and attach it to the card along with scans of other documents: work permit and MOP. 
    MMS system in the data center: how we automated maintenance management
  9. In the attire, we note the work performed: 
    MMS system in the data center: how we automated maintenance management
  10. The maintenance history is preserved in the equipment card:
    MMS system in the data center: how we automated maintenance management

We have shown how our system works now. But the work on MMS is not over: several improvements are already planned. For example, now we store a lot of information in scans. In the future, we plan to make TO paperless: connect a mobile application where an engineer can put down checkboxes and immediately save information in a card. 

Of course, there are many ready-made products on the market with similar functions. But we wanted to show that even a small excel can be developed into a full-fledged product. You can do it yourself or involve contractors, the main thing is the right approach. And it's never too late to start.

Source: habr.com