And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
The head of the operations department climbed into the hatch of the underground fuel storage to show the markings on the solenoid valve.

In early February, our largest Tier III data center NORD-4 Re-certified by Uptime Institute (UI) for Operational Sustainability. Today we will tell you what the auditors are looking at and with what results we finished.

For those who are familiar with data centers, let's briefly go through the materiel. Tier Standards evaluates and certifies data centers in three stages:

  • project (Design): a package of design documentation is checked. Here, the well-known Animal. There are 4 in total: Tier I–IV. The last one is the highest.
  • constructed object (Facility): the engineering infrastructure of the data center is checked and its compliance with the project. The data center is checked under full design load using a variety of tests with something like this: one of the UPSs (generator sets, chillers, precision air conditioners, distribution cabinets, bus ducts, etc.) is taken out of service for maintenance or repair, while the city power supply is turned off . A Tier III and above data center should be able to handle the situation without any impact on the IT payload.

    Facility can be taken if the data center has already passed Design certification.
    NORD-4 received its Design certification in 2015 and Facility in 2016.

  • operation (Operational Sustainability). In fact, the most important and difficult certification. It comprehensively assesses the processes and competencies of an operator for maintaining and managing a data center with an established Tier level (to pass Operational Sustainability, you must already have a Facility certificate). After all, without properly built operation processes and a qualified team, even a Tier IV data center can turn into a useless building with very expensive equipment.

    It also has its own levels: Bronze, Silver and Gold. At the last recertification, they finished with a score of 88,95 out of 100 possible points, and this is Silver. Before Gold, there was not enough just a little - 1,05 points. 

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

How to check that the necessary processes are built and working as they should? Moreover, how to do it in two days - this is how long the re-certification takes. In short, certification is based on a painstaking comparison of what is written in the regulations, stories β€œhow everything works” and real practices. Information about the latter is obtained from data center walks and conversations with data center engineers - β€œface-to-face confrontations,” as we affectionately call them. Here's what they're looking at.

Team

First of all, UI auditors check whether there are enough maintenance personnel in the data center. They take the staffing table, duty schedule and selectively compare it with shift reports and ACS data to make sure that the required number of engineers was really on site that day.

Auditors also look closely at the number of overtime hours. This sometimes happens when a large client calls in and dozens of racks need to be installed at the same time. At such moments, guys from other shifts come to the rescue, and they are paid extra money for this.

NORD-4 has 7 engineers on shift: 6 on duty and one senior engineer. These are those who monitor 24x7 monitoring, meet clients, help with equipment installation and other regular requests. This is the first line of customer support. Their duties also include fixing emergency situations and escalating them to specialized engineers. The work of the engineering infrastructure is monitored by individual people - infrastructure duty officers. Also 24x7.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
NORD's production director and site manager tells the auditors how many people are working on the site right now.

When the numbers are sorted out, the qualifications of the team are checked. Auditors randomly review engineers' files to make sure they have the necessary diplomas, certificates, permits (such as electrical safety certificates) to work in this position.

They also check how we train staff. Our training system for new duty engineers impressed UI specialists during the last audit. For them we spend a three-month training course in the paid internship mode, during which we introduce them to the processes and principles of work in our data center.

Already working engineers should also undergo regular training, including on working in emergency situations. Auditors will definitely check the curricula and materials of such trainings, and they will also selectively examine engineers. No one will be asked to switch to a DGU, but they will ask you to tell step by step what to do when the city power supply is turned off. Based on the results of the audit, we will bring all training and training programs to a single standard so that they do not differ for different teams.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
We show the auditors a rest room for shift engineers.

Operation and maintenance of engineering systems 

In this large section of the audit, we show that all engineering equipment and systems receive regular maintenance according to the schedule recommended by vendors, the warehouse has the necessary spare parts and accessories, valid maintenance contracts with contractors, and each operation with equipment has its own procedures and algorithms for working on different cases.

MMS When you operate dozens of UPSs, DGUs, air conditioners and other things, you need to collect all the information about this economy somewhere. Here is approximately such a dossier is created for each piece of equipment with us:

  • model and serial number;
  • marking;
  • technical characteristics and settings;
  • place of installation;
  • dates of production, commissioning, expiration of the warranty;
  • service contracts;
  • maintenance schedule and history;
  • and the whole "case history" - breakdowns, repairs.

How and where to collect all this information, each data center operator decides for himself. UI does not limit in tools. It can be a simple Excel (we started with this) or a self-written Maintenance Management System (MMS), as we have now. By the way, service desk, warehouse accounting, network log, monitoring are also self-written.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
There is such a β€œpersonal matter” for each piece of equipment.

We showed our practices in this area, including on the example of this infrastructure UPS (pictured), which donated one of its parts to the UPS serving the IT load. Yes, according to the standard, only infrastructure equipment that powers air conditioners, emergency lighting, but not the IT load, can be engaged in such a β€œdonation”.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

After the auditors asked to show the corresponding ticket in the Service Desk:

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

And the UPS profile in MMS:

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

SPTA. For timely maintenance and emergency repairs of engineering equipment, we keep our spare parts and accessories. There is a common warehouse with large spare parts for equipment and small cabinets with spare parts in the engineering rooms (so that you do not have to run far).

In the photo: we are checking the availability of spare parts for diesel generator sets. We counted 12 filters. Then we checked the data in MMS.  

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

A similar exercise was done in the main warehouse, where large spare parts are stored: compressors, controllers, automation, fans, steam humidifiers and hundreds of other items. Selectively rewrote the markings and "punched" them by MMS.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
Spare parts inventory data. Red β€” This is what is missing and needs to be added.

Precautionary maintenance. In addition to maintenance and repairs, UI recommends preventive maintenance. It helps to turn a potential accident into a planned repair. For each parameter, we set up threshold values ​​in monitoring. If they are exceeded, those responsible receive alarms and take the necessary actions. For example, we:

  • We check electrical panels with a thermal imager in order to find a defect in electrical installations in time: poor contact, local overheating of a conductor or machine. 
  • We monitor the vibration and current consumption of the pumps of the refrigeration system. This allows you to identify deviations in time and plan replacement parts without haste.
  • We do analyzes of fuel and oil of DGU, compressors.
  • We test glycol in the refrigeration system for concentration.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
Graph of pump vibration before and after repair.

Work with contractors. Equipment maintenance and repairs are carried out by external contractors. On our side, there are separate specialists in DGU, air conditioners, UPS, who control their work. They check if the contractors have the necessary tools and materials for repair / maintenance, professional certificates, electrical safety certificates, approvals. They accept all jobs.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
This is what the checklist for acceptance of work on maintenance of the air conditioner looks like.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
At the pass office, we check whether passes are issued for authorized representatives of contractors, whether they underwent maintenance at the specified time and whether they have read the rules.

Documentation. Built-in processes for the maintenance of systems and equipment are half the battle. All procedures that are performed by a person in a data center must be documented. The purpose of this is simple: so that everything is not limited to one specific person, and in the event of an accident, any engineer can take clear instructions and do all the necessary operations to eliminate it.

The UI has its own methodology for such documentation.

For simple and repetitive actions, standard operating procedures (Standard Operational Procedure, SOP) are compiled. For example, there are SOPs for turning on / off the chiller, putting the UPS on bypass.

For maintenance or complex operations, such as replacing batteries in a UPS, procedures for maintaining maintenance work (Methods of Procedures, MOP) are created. They may include SOPs. Each type of engineering equipment should have its own MOPs.

Finally, there are Emergency Operating Procedures (EOP) - instructions in case of an accident. A list of specific emergencies is compiled and instructions are written for them. Here is a part of the list of emergency situations, which describe in detail the signs of an accident, actions, responsible persons and persons for notification:

  • shutdown of city power supply: DGU started/did not start;
  • UPS failures; 
  • accidents on the data center monitoring system;
  • overheating of the machine room;
  • leakage of the refrigeration system;
  • failure on network and computing equipment;

and so on.

Compiling such a volume of documentation is a laborious task in itself. It is even more difficult to keep it up to date (by the way, auditors also check this). And most importantly, the staff must know these instructions, work according to them and make improvements if necessary.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
Yes, instructions should be available where they might be needed, and not just gather dust in the archives.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
Marks of changes in the regulations for the maintenance of engineering systems of the data center.

During the audit, they also look at the technical documentation on the systems, as-built and working documentation, acts of putting the systems into operation. 

Marking During the tour around the data center, they checked it everywhere they could reach. Where they could not reach, they reached from the stepladder :). We looked at its presence on every shield, machine, valve. We checked the uniqueness, unambiguity and compliance with the current schemes of executive documentation. In the photo below: we are comparing the markings on solenoid valves with the scheme of as-built documentation in the fuel storage pump room. 

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

Everything agreed with her, but with the local "decorative" axonometric scheme on the wall in one parameter did not match.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

In the premises of the data center, diagrams of the systems located there should also hang. In the event of an accident, they help you quickly find out where everything is and make an informed decision. In the photo, for example, a single-line diagram in the main switchboard room.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

The relevance of the schemes was checked as follows: they called the marking of the element on the scheme and asked to show it β€œin kind”. 

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

This is where the auditor takes pictures of the settings (settings) of the release of the main switchboard, in order to later compare it with the indicators on the single-line diagram in paper and electronic copies. On one of the machines, QF-3, the indicator did not match the paper scheme, and we earned a penalty point. Now two engineers will check the markings in one-line diagrams against the fact.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

This is not all that the auditors checked in terms of service processes. Here's what else was on the agenda:

  • monitoring system. Here we earned karma pluses with good visualization, the presence of a mobile application and situational screens placed in the corridors of data centers. Here they wrote in detail about how we have arranged monitoring.

    And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute
    Here is such a MCC with visual information about the state of the main engineering systems of NORD-4 and our other data centers working on the site.

  • life cycle planning of engineering equipment;
  • capacity management (capacity management);
  • budgeting (a little told here);
  • accident analysis procedure;
  • the process of acceptance, commissioning and testing of equipment (they wrote about the tests here).

What else was the UI looking at

Security and access control. The audit also checks the operation of safety and security systems. For example, the auditor tried to get into one of the premises where he does not have access, and then checked whether this was reflected in the ACS system and whether the guards had been notified about this (spoiler - there was).

If in our data centers the door to any room remains open for more than two minutes, then an alert is triggered at the guard post. To test this, the auditors propped up one of the doors with a fire extinguisher. True, we did not wait for the sirens - the guards saw something was wrong through the video cameras and arrived at the "crime scene" earlier.

Order and cleanliness. The auditors check to see if there is dust, randomly lying boxes of equipment, how often the premises are cleaned. Here, for example, the auditors were interested in an unidentified object in the ventilation corridor. This is a block from the ventilation system, which was already preparing to take its place. But they still asked to sign.

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

Still on the topic of order in the data center - these are the cabinets with all the necessary tools for emergency work on the equipment in the main switchboard room. 

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

Location. The data center is evaluated by location conditions - are there military bases, airports, rivers, volcanoes and other dangerous objects nearby. In the photo, we just show that since the last certification in 2017, no nuclear power plants and oil storage facilities have grown around the data center. But over there, a new NORD-5 data center is being built, which will also have to go through all the stages of Uptime Institute Tier III certification. But that's a completely different story.)

And demonstrate, or How we passed the Operational Sustainability audit at the Uptime Institute

Source: habr.com

Add a comment