Load balancing in Openstack (Part 2)

Π’ last article we talked about attempts to use Watcher and presented a test report. We periodically conduct such tests to balance and other critical functions of a large corporate or carrier cloud.

The high complexity of the problem to be solved may require several articles to describe our project. Today we are publishing the second article in a series dedicated to balancing virtual machines in the cloud.

Some terminology

VmWare introduced the DRS (Distributed Resource Scheduler) utility to load balance the virtualization environment they developed and offered.

According to searchvmware.techtarget.com/definition/VMware-DRS
β€œVMware DRS (Distributed Resource Scheduler) is a utility that balances compute loads with available resources in a virtual environment. The utility is part of a virtualization package called VMware Infrastructure.

With VMware DRS, users define the rules for allocating physical resources between virtual machines (VMs). The utility can be configured for manual or automatic control. VMware resource pools can be easily added, removed or reorganized. If desired, resource pools can be isolated between different business units. If the workload on one or more virtual machines changes dramatically, VMware DRS redistributes the virtual machines across physical servers. If the overall workload decreases, some physical servers may be temporarily down and the workload consolidated."

Why do we need balancing?


In our opinion, DRS is a mandatory feature of the cloud, although this does not mean that DRS should be used always and everywhere. Depending on the purpose and needs of the cloud, there may be different requirements for DRS and balancing methods. Perhaps there are situations where balancing is not needed at all. Or even harmful.

To better understand where and for which clients DRS is needed, let's consider their goals and objectives. Clouds can be divided into public and private. Here are the main differences between these clouds and customer goals.

Private clouds / Large enterprise clients
Public clouds / Small and medium business, people

The main criterion and goals of the operator
Providing a reliable service or product
Reducing the cost of services in the fight in a competitive market

Service Requirements
Reliability at all levels and in all elements of the system

Guaranteed performance

Prioritize virtual machines into multiple categories 

Information and physical data security

SLA and XNUMX/XNUMX support
Maximum ease of service

Relatively simple services

Responsibility for the data lies with the client

VM prioritization is not required

Information security at the level of typical services, responsibility on the client

There may be crashes

No SLA, quality not guaranteed

Email Support

Backup not required

Client Features
A very wide range of applications.

Legacy applications inherited in the company.

Complex customized architectures for each client.

Affinity rules.

Software work without stopping in 7x24 mode. 

On-the-fly backup tools.

Predictable cyclic load of the client.
Typical applications - network balancing, Apache, WEB, VPN, SQL

It is possible to stop the application for a while

Arbitrary distribution of VMs in the cloud is allowed

Customer Backup

Statistically average load predictable with a large number of clients.

Implications for architecture
Geoclustering

Centralized or distributed storage

Reserved SRK
Local data storage on compute nodes

Balancing Goals
Uniform load distribution

Maximum application responsiveness 

Minimum delay time for balancing

Balancing only when clearly necessary

Taking part of the equipment for preventive maintenance
Reducing the cost of the service and the costs of the operator 

Shutdown of some resources in case of low load

Energy saving

Reducing staff costs

We draw the following conclusions for ourselves:

For private cloudsprovided to large corporate customers, DRS can be applied subject to the following restrictions:

  • information security and accounting for affinity rules when balancing;
  • availability of sufficient resources in reserve in case of an accident;
  • virtual machine data is located on a centralized or distributed storage system;
  • spacing in time of administration, backup and balancing procedures;
  • balancing only within the aggregate of client hosts;
  • balancing only with a strong imbalance, the most efficient and safest VM migrations (after all, migration may fail);
  • balancing relatively β€œcalm” virtual machines (migration of β€œnoisy” virtual machines can take a very long time);
  • balancing taking into account the "cost" - the load on the storage system and the network (with customized architectures for large customers);
  • balancing taking into account the individual characteristics of the behavior of each VM;
  • balancing preferably during non-working hours (night, weekends, holidays).

For public cloudsthat provide services to small customers, DRS can be applied much more often, with advanced features:

  • lack of information security restrictions and affinity rules;
  • balancing within the cloud;
  • balancing at any reasonable time;
  • balancing of any VM;
  • balancing "noisy" virtual machines (so as not to disturb the rest);
  • virtual machine data is often located on local disks;
  • accounting for the average performance of storage and network (the architecture of the cloud is unified);
  • balancing according to generalized rules and available data center behavior statistics.

Complexity of the problem

The complexity of balancing lies in the fact that DRS must work with a large number of uncertain factors:

  • behavior of users of each of information systems of clients;
  • algorithms for the operation of information system servers;
  • behavior of DBMS servers;
  • load on computing resources, storage systems, network;
  • the interaction of servers with each other in the struggle for cloud resources.

The load of a large number of virtual application servers and databases on cloud resources occurs over time, the consequences can manifest themselves and overlap each other with an unpredictable effect after an unpredictable time. Even to control relatively simple processes (e.g. motor control, domestic hot water heating system), automatic control systems need to use complex proportional-integral-differentiating feedback algorithms.

Load balancing in Openstack (Part 2)

Our task is many orders of magnitude more complicated, and there is a risk that the system will not be able to balance the load to stable values ​​in a reasonable time, even if there are no external influences from users.

Load balancing in Openstack (Part 2)

History of our developments

To solve this problem, we decided not to start from scratch, but to build on existing experience, and began to interact with specialists with experience in this area. Fortunately, our understanding of the problem completely coincided.

Step 1

We used a system based on neural network technology and tried to optimize our resources based on it.

The interest of this stage was in testing a new technology, and its importance was in applying a non-standard approach to solving a problem, where, all other things being equal, standard approaches have practically exhausted themselves.

We launched the system, and we really started balancing. The scale of our cloud did not allow us to get the optimistic results declared by the developers, but it was clear that the balancing was working.

At the same time, we had quite serious limitations:

  • To train a neural network, it is necessary that virtual machines run without significant changes for weeks or months.
  • The algorithm is designed for optimization based on the analysis of earlier "historical" data.
  • A sufficiently large amount of data and computational resources are required to train a neural network.
  • Optimization and balancing can be done relatively rarely - once every few hours, which is clearly not enough.

Step 2

Since we were not satisfied with the state of affairs, we decided to modify the system, and to do this, answer main question Who are we doing it for?

First - for corporate clients. This means that we need a system that works quickly, with those corporate restrictions that only simplify implementation.

The second question What is meant by the word "operationally"? As a result of a short debate, we decided that we can start with a response time of 5 - 10 minutes so that short-term jumps do not introduce the system into resonance.

Third question – what size of the balanced number of servers to choose?
This issue resolved itself. As a general rule, clients do not make server aggregations very large, and this is in line with the recommendations in the article to limit aggregations to 30-40 servers.

In addition, by segmenting the server pool, we simplify the task of the balancing algorithm.

Fourth question - how suitable is a neural network with its long learning process and rare balancing? We decided to abandon it in favor of simpler operational algorithms to get the result in seconds.

Load balancing in Openstack (Part 2)

A description of the system using such algorithms and its shortcomings can be found here

We implemented and launched this system and got encouraging results - now it regularly analyzes the load of the cloud and makes recommendations for moving virtual machines, which are largely correct. Even now it is clear that we can achieve a 10-15% release of resources for new virtual machines with an improvement in the quality of the existing ones.

Load balancing in Openstack (Part 2)

When an imbalance in RAM or CPU is detected, the system issues commands to the Tionix scheduler to perform a live migration of the required virtual machines. As you can see from the monitoring system, the virtual machine moved from one (upper) to another (lower) host and freed up memory on the upper host (highlighted in yellow circles), taking it accordingly on the lower one (highlighted in white circles).

Now we are trying to more accurately assess the effectiveness of the current algorithm and are trying to find possible errors in it.

Step 3

It would seem that you can calm down on this, wait for proven effectiveness and close the topic.
But we are pushed to carry out a new stage by the following clear optimization opportunities

  1. Statistics, for example here ΠΈ here shows that two- and four-processor systems are significantly lower in performance than single-processor systems. This means that all users get much less return on CPU, RAM, SSD, LAN, FC purchased in multiprocessor systems, compared to single-processor ones.
  2. The resource planners themselves can run into serious bugs, here is one of the articles about this theme.
  3. The technologies offered by Intel and AMD for monitoring RAM and cache make it possible to study the behavior of virtual machines and place them in such a way that "noisy" neighbors do not interfere with "calm" virtual machines.
  4. Expansion of the set of parameters (network, storage, virtual machine priority, cost of migration, its readiness for migration).

Total

The result of our work on improving balancing algorithms was an unequivocal conclusion that due to modern algorithms, it is possible to achieve significant resource optimization (25-30%) of data centers and at the same time improve the quality of customer service.

An algorithm based on neural networks is, of course, an interesting solution, but in need of further development, and due to existing limitations, it is not suitable for solving such problems on volumes typical for private clouds. At the same time, in public clouds of a significant size, the algorithm showed good results.

We will tell you more about the capabilities of processors, schedulers and high-level balancing in the following articles.

Source: habr.com

Add a comment