Monitoring in the data center: how we changed the old BMS to the new one. Part 3

We continue our story about how we changed the BMS system in our data centers (Part 1, Part 2). At the same time, we did not just change the solution of one vendor for another, but developed the system from scratch to suit our requirements. At the end of our story, we share the results of the work done and interesting solutions that may be useful to you.

New interface

Here, as they say, it is better to see once.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3Racks.

Let's analyze the differences.

  • First, it is handsomely conveniently. Notice how easy it has become to keep track of the loads on the modules ("Banks" or simply "Banks") of the PDU and the sum of the parallel loads of the paired modules. On the rack model from the new BMS, we immediately see that the lower paired PDUs are overloaded (the total current is above the allowable 16A - a “blue” notification), and the upper ones are underloaded. If one of the inputs is disconnected, the entire load will be transferred to the second one, and the lower module remaining energized will be disconnected due to overload. To prevent this, the data center support service will warn the client in advance and send a recommendation on how to redistribute the load.
  • Easy addition of equipment. In the new BMS, virtual sensors for module current sums and rack power have already been added to typical rack templates and are created automatically after adding them to a PDU rack. In the old BMS, they had to be created manually and then dragged onto the map, which increased the likelihood of an error due to the "human factor".
  • Unlimited scope for creativity. Now we have no restrictions when creating virtual sensors. You can build absolutely any mathematical models of any variables. This means that we have the ability to create complex virtual sensors (previously it was only possible to add values) and better analyze the statistics and trends of engineering systems. This improves the quality of decisions made on system tuning, equipment replacement and resource management. 
  • Clear interface. The new interface does not have a clutter of icons, fans spin, switches “click”. And the most convenient is the ability to indicate the status of PDU Line A / B inside the racks. We tried to do something similar in the old BMS, but the number of merging icons per square centimeter of the map made us abandon it.

Now it's pleasing to the eye:

Monitoring in the data center: how we changed the old BMS to the new one. Part 3
Server.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3
Fragment of the main switchboard.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3
Ventilation control board.

And the new BMS can be decorated for the New Year 🙂
Monitoring in the data center: how we changed the old BMS to the new one. Part 3

One page - mutual understanding at a glance and without TK

For a very long time, we wanted to implement one more "trick" in BMS: to arrange the main parameters of the data center on one page, so that one glance at the screen was enough to assess the status of the main systems. However, we did not fully understand how it should look like.

Even before the start of the development of the new BMS, we visited a dozen data centers in the Netherlands with excursions. One of the goals was to see examples of the implementation of such a page.

And they didn’t show it to us in any data center - somewhere it wasn’t there, somewhere “it was being developed right now”, somewhere it was a “big trade secret”. Therefore, in our TOR for the creation of a new BMS, there was no exact description of this very important page for us.

As a result, we came up with it literally “on the go”. Just at that moment, I had to remotely advise colleagues in the data center. It was very inconvenient to flip through the BMS pages on the phone in search of disparate data, and in fact the first version was sketched on a napkin One page. It was implemented by the developers from the photo. 

Following the example of cautious Dutch colleagues, we will not demonstrate the final version of our main page, especially since each data center is unique and there is no point in copying. But we will describe two main principles of its formation:

  1. This is a table made up for the format of a vertically located smartphone screen (or a monitor, but with the preservation of a vertical arrangement), with the display of all important information on one screen. Above the table is a "summary" of active incidents, so placing them together turned out to be most convenient in a vertical format. 
  2. The arrangement of cells in the table repeats the architecture of the data center (physical or logical). We abandoned the arrangement of systems in alphabetical order, as we would like at first glance. The sequence reflects the visual associations of the data center staff - as if they are physically monitoring all the rooms and systems. This makes it easier to find information.

In fact, now absolutely all the key characteristics of the data center are grouped and presented on one screen of the smartphone / monitor of the responsible engineer and manager, while binding to the physical and logical topography of the data center is implemented. 

Here is a photo of that very first draft, although, of course, then this version was rethought and finalized.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3

Acknowledgment and summary of incidents

Let's talk about another new concept for us, which appeared as a result of a project to update the monitoring system.

Handshaking is a rather rare term that the developer of the new BMS suggested to use. It means confirmation that the operator saw the incident, confirmed it and assumed responsibility for its elimination.  

The word has taken root, and now we "acknowledge" incidents.

The algorithm included in the basic version of the new BMS did not suit us. In fact, these were comments to the event log, that is, resolved incidents did not disappear from the log, and accepted (“acknowledged”) were not sorted from new ones.

As a result, a window called "summary" was developed, in which:

  1. Only active incidents and devices in service mode are displayed (no commercial blue notifications).
  2. Clearly separates NEW and ACCEPTED incidents.
  3. It is indicated who accepted the incident.

The duty algorithm in the new BMS is as follows:

  1. New incidents are included in the summary and are waiting for acknowledgment. They cannot stay in this section for a long time, the duty officer responsible for the equipment must immediately take over the incident.
  2. The employee takes over the incident by clicking on the checkmark on the right. Since all employees are under unique accounts, it is automatically displayed who accepted the incident. Leave a comment if necessary.
  3. The incident is moved to the "Acknowledged" section, the rest of the attendants and the manager understand that the responsible employee is dealing with the incident.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3
An example of a summary window with a new and already acknowledged message.

By connecting the summary window with the One page table, we have a complete main screen BMS system, on which you can immediately see: 

  • the state of the main data center systems;
  • the presence of new unprocessed incidents;
  • the presence of accepted incidents and data on who specifically eliminates them.

Browser access and pop-up alerts on your phone

The web interface, accessible from any device from anywhere in the world, is a striking contrast to the "thick" client, completely closed to outside users. 

The old approach entailed a complex of inconveniences, from problems in organizing remote work of monitoring service employees to the need to install “thick” clients from distribution kits on staff workstations in the data center.

Now any page in BMS has a unique address, which allows you to share not only the direct address of the page or device, but also links to unique graphs/reports. 

Access to the system is now carried out through LDAP authentication through Active Directory, which enhances the level of its security. 

Mobility today is a key factor in the quality work of duty engineers. In addition to controlling monitoring in the duty shift room, engineers make rounds, perform routine work outside the "duty room" and, thanks to the BMS main screen optimized for mobile screen, do not lose control of what is happening in the turbine halls for a second. 

The quality of control is also enhanced by the functionality of work chats. They speed up workflows, allowing you to "tie" the correspondence of engineers on duty to the BMS. For example, we use the Teams application, which allows us to conduct internal correspondence and receive all messages from BMS on the phone in the form of pop-up Push notifications, which saves the attendant from having to constantly look at the phone screen.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3
 Push notification on smartphone screen.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3
And this is how notifications look in the Teams app.

At the same time, pop-up notifications are configured only to report incidents, thereby minimizing the distraction factor, the staff knows: if a Teams Push notification appears on the smartphone screen, then you need to go to the BMS page and accept the incident. Incident resolution messages are already tracked on the BMS page.

Monitoring in the data center: how we changed the old BMS to the new one. Part 3
The photo shows the BMS interface in a smartphone.

Summing up

With the cost of updating a BMS from our old vendor comparable to developing a new system from scratch (about $100), the difference in product functionality turned out to be enormous. We received a flexible system optimized for our business tasks and processes. We have also achieved significant savings in ongoing support and system upgrade costs. 

But, of course, there were difficulties. 

  • Firstly, we underestimated the amount of changes that needed to be made to the base version of the new BMS, and did not meet the predetermined deadlines. For us, this was not a critical problem, since we were insured to the last and worked on the old system, and the process was creative, complex, and therefore sometimes went slower than expected. In addition, we have always seen that our developer makes every effort to achieve the best result. But in fact, the story turned out to be very long, and our key specialists spent much more time and effort on it than planned. 
  • Secondly, it took us several stages of testing to fine-tune the algorithm for redundant virtual machines and communication channels. Initially, there were failures both on the side of the BMS system and on the side of setting up virtual machines and the network. This debugging also took time. Fortunately, the contractor was provided with a test site in the form of a cloud service, where all the settings and innovations were initially tested.
  • Thirdly, the final system turned out to be more difficult to edit by the end user. If earlier the map was a substrate (graphic file) and icons that were easy to change or move, now it is a complex graphical interface with animation that requires certain editing skills.

The radical update of our BMS system today can be called the most important project of the last year, which will seriously affect the quality of operational management of our sites in the future. 

Of course, we didn’t throw out the old iron server, but “lightened it”: we cleaned it of thousands of “commercial” virtual sensors and PDUs and left only a few dozen of the most critical devices in it, such as diesel generator sets, UPSs, air conditioners, pumps, leakage sensors and temperatures. In this mode, its former speed returned to it, and it can be a "reserve reserve". By the way, after removing the PDU from the old BMS, we have freed up about 1000 now unnecessary licenses, do you happen to know what to do with them?

Source: habr.com

Add a comment