Monitoring in the data center: how we changed the old BMS to the new one. Part 2

Monitoring in the data center: how we changed the old BMS to the new one. Part 2

In the first part, we talked about why we decided to change the old BMS system in our data centers to a new one. And not just change, but develop from scratch to suit your requirements. In the second part, we tell how we did it.

Market analysis

Taking into account the the first part wishes and decision to refuse to upgrade the existing system, we wrote a TOR to search for a solution on the market and made inquiries to several large companies engaged only in the creation of industrial SCADA systems. 

The very first answers from them showed that the leaders of the monitoring systems market mainly continue to work on iron servers, although the process of migration to the clouds in this segment has already begun. As for reserving virtual machines, no one supported this option. Moreover, there was a feeling that none of the developers notable on the market even demonstrated an understanding of the need for redundancy: “the cloud does not fall” was the most frequent answer. In fact, we were offered to place monitoring of the data center in the cloud, physically located in the same data center.

Here we need to make a small digression about the process of choosing a contractor. The price, of course, matters, but during any tender for the implementation of a complex project, at the stage of dialogue with suppliers, you begin to feel which of the candidates is more interested and able to implement it. 

This is especially noticeable on complex projects. 

By the nature of clarifying questions to the TOR, contractors can be divided into those interested in simply selling (there is a standard pressure from the sales manager) and those interested in developing the product, having heard and understood the customer, making constructive changes to the TOR even before the final choice (even despite the real the risk of improving someone else's TK and losing the tender), in the end, just ready to take on a professional challenge and make a good product.

All this made us pay attention to a relatively small local developer - the Sunline group of companies, which responded to most of our requirements immediately and was ready to fulfill all the needs regarding the new BMS. 

Risks

While the big players were trying to understand what we want and were chatting with us with the involvement of presale specialists, the local developer made an appointment in our office with the participation of their technical team. At this meeting, the contractor once again demonstrated his willingness to participate in the project and, most importantly, explained how the required system would be implemented.    

Before the meeting, we saw two risks of working with a team that does not have the resources of a large national or international company behind them:

  1. Specialists could overestimate their capabilities and, as a result, they simply could not cope, for example, they would use complex software or design impracticable redundancy algorithms.
  2. After the project is completed, the project team may disintegrate and, consequently, product support will be at risk.

To minimize these risks, we invited our own development specialists to the meeting. The employees of the potential contractor were thoroughly interviewed about what the system is built on, how it is planned to implement redundancy and other issues in which we, as an operation department, are not competent enough.

The verdict was positive: the architecture of the existing BMS platform is modern, simple and reliable, can be improved, the proposed redundancy and synchronization scheme is logical and workable. 

Managed the first risk. The second one was excluded, having received confirmation from the contractor that they were ready to transfer the source codes of the system code and documentation to us, as well as choosing the Python programming language, which is well known to our specialists. This guaranteed us the ability to maintain the system on our own without any difficulties and a long period of training for employees in the event that the developer company left the market.

An additional advantage of the platform was that it was implemented in Docker containers: the core, web interface, and product database function in this environment. This approach provides many benefits, including presetting for faster solution deployment compared to the "classic" and easy addition of new devices to the system. The “all together” principle simplifies the implementation of the system as much as possible: it is enough to unpack the system and you can immediately operate it. 

With such a solution, it is easier to make copies of the system, and you can improve it and implement upgrades in a separate environment, without stopping the operation of the solution as a whole.  

After both risks were minimized, the contractor provided a quotation. All the most important parameters of the BMS system were worked out in it.

Reservation

The new BMS system had to be in the cloud, on a virtual machine. 

No hardware, no servers, and all the inconveniences and risks associated with this deployment model - the cloud solution allowed us to get rid of them forever. It was decided that the system will work in our cloud at two data center sites in St. Petersburg and Moscow. These are two fully functional systems operating in active standby mode with access for all authorized specialists. 

The two systems insure each other, providing full redundancy for both computing power and data transmission channels. Additional security measures are also configured, including backup of data and channels, systems, virtual machines in general, and a separate backup of the database once a month (the most valuable resource in terms of management and analysis). 

Note that redundancy as an option of the BMS solution was developed specifically for our request. The reservation scheme itself looked like this:

Monitoring in the data center: how we changed the old BMS to the new one. Part 2

Support

The most important point for the effective operation of a BMS solution is technical support. 

Everything is simple here: the new system would cost us 35 rubles according to this indicator. per month for the "response within 000 hours" SLA, i.e. 8 x 35/000 = $12 per year. The first year is free. 

By comparison, maintaining the old BMS from the vendor cost $18 per year, with an increase in the amount for each new device added! At the same time, the company did not provide a dedicated manager, all interaction took place through a sales manager who is interested in us as a potential buyer with a corresponding emphasis in processing requests. 

For less money, we got full product support, with an account manager who would take part in product development, with a single entry point, etc. Support became an order of magnitude more flexible - thanks to direct access to developers for prompt adjustments on any aspect of the system, integration via API, etc.

Updates

According to the proposed CP in the new BMS, all updates are included in the cost of support, i.e. do not require additional payment. The exception is the development of additional functionality, in addition to that specified in the TOR. 

The old system involved paying for both free firmware updates (such as Java) and bug fixes. It was impossible to refuse this, in the absence of updates, the system as a whole “slowed down” due to old versions of internal components.

And, of course, it was impossible to update the software without buying a support package.

Flexible approach

Another fundamental requirement concerned the interface. We wanted to provide access to it through a web browser from anywhere, without the obligatory presence of an engineer on the territory of the data center. In addition, we sought to create an animated interface so that the dynamics of the infrastructure would be more visual for the engineers on duty. 

Also, in the new system, it was necessary to provide support for formulas for calculating the operation of virtual sensors in engineering systems - for example, for the optimal distribution of electrical power among equipment racks. To do this, you need to have at your disposal all the usual mathematical operations applicable to the indicators of sensors. 

Further, access to the SQL database was required with the ability to take from it the necessary data on the operation of the equipment - namely, all records of the monitoring of two thousand devices and two thousand virtual sensors that generate approximately 20 thousand variables. 

A rack inventory module was also needed, giving a graphical representation of the location of devices in each unit with a calculation of the total weight of the hardware, maintaining a library of devices, and detailed information about each element. 

Coordination of TOR and signing of the contract

At the time when it was necessary to start work on the new system, the correspondence with the "big" companies was still very far from discussing the cost of their proposals, so we compared the received quotation with the cost of updating the old BMS (see Fig. The first part), and as a result, it turned out to be more attractive in terms of price and meeting our requirements.

The choice has been made.

After choosing a contractor, lawyers began to draw up a contract, and technical teams from both sides polished the TOR. As you know, a detailed and competent technical specification is the basis for the success of any work. The more specifics in the TK, the less disappointment like “but we wanted it wrong.”

I will give two examples of the level of detail of requirements in the TOR:

  1. Duty data centers are empowered to add new devices to the BMS, most often these are PDUs. In the old BMS, this was the “administrator” level, which allowed, among other things, changing the settings of the variables of all devices, and it was impossible to separate the functions. It didn't suit us. In the existing basic version of the new platform, the scheme was similar. We immediately indicated in the TOR that we wanted to separate these roles: only an authorized employee should change the settings, but the people on duty should continue to be able to add devices. This scheme was accepted for implementation.
  2.  In any standard BMS, there are three typical categories of notifications: RED - you need to respond immediately, YELLOW - you can observe, BLUE - "Informational". We have traditionally used blue notifications to monitor when business parameters are exceeded, such as exceeding a customer's rack power limit. This type of notification in our case was intended for managers and was not of interest to the operation service, but in the old BMS it regularly clogged the list of active incidents and interfered with operational work. We considered the very logic and color differentiation of the notification pants to be successful and kept it, however, in the TOR it was specifically indicated that the “blue” notifications should, without distracting the attendants, silently “crumble” into a separate section where they will be dealt with by commercial specialists.

With a similar degree of detail, the formats for plotting graphs and displaying reports, the outlines of interfaces, a list of devices that needed to be monitored, and many other things were prescribed. 

It was a truly creative work of three working groups - the customer service, which dictated its requirements and conditions; technical specialists from both sides, whose task was to translate these conditions into technical documentation; teams of contractor programmers who implemented the customer's requirements according to the developed technical documentation ... As a result, we adapted some of our unprincipled requirements to the functionality of the existing platform, the contractor undertook to add something for us. 

Parallel operation of two systems

Monitoring in the data center: how we changed the old BMS to the new one. Part 2
It's time for implementation. In practice, this meant that we give the contractor the opportunity to deploy a BMS prototype in our virtual cloud and provide network access to all devices that require monitoring.

However, the new system was not yet ready for operation. At this stage, it was important for us to keep monitoring in the old system and at the same time give access to devices to the new system. It is impossible to build a system normally without seeing devices in it, which in turn cannot be disconnected from monitoring by the old system. 

Whether the devices would survive being polled by two systems at the same time was unclear without actual testing. There was a possibility that double simultaneous polling would lead to frequent denial of responses from devices and we would get a lot of device unavailability errors, which in turn would block the work of the old monitoring system.

The network department ran virtual routes from a prototype of a new BMS deployed in the cloud to the devices, and we got the results: 

  • devices connected via the SNMP protocol practically did not fall into disconnect due to simultaneous accesses, 
  • devices connected via modbas-TCP protocol gateways had problems that were solved by a reasonable decrease in their polling frequency.  

And then we began to observe how a new system was being built before our eyes, devices already familiar to us appeared in it, but in a different interface - convenient, fast, accessible even from a phone.

We will talk about what happened as a result in the third part of our article.

Source: habr.com

Add a comment