Orchestrator and VIP as HA Solution for MySQL Cluster

At Citymobil, we use a MySQL database as our main persistent data store. We have several database clusters for various services and purposes.

The constant availability of the wizard is a critical indicator of the health of the entire system and its individual parts. Automatic cluster recovery in the event of a master failure greatly reduces incident response time and system downtime. In this article, I will look at the High Availability (HA) scheme of a MySQL cluster based on MySQL Orchestrator and virtual IP addresses (VIPs).

Orchestrator and VIP as HA Solution for MySQL Cluster

VIP based HA solution

First, I will briefly talk about what our data storage system is.

We use the classic replication scheme with one writable master and many read-only replicas. A cluster may contain an intermediate master - a node that is both a replica and a master for others. Clients access replicas through HAProxy, which allows for even load distribution and easy scaling. The use of HAProxy is for historical reasons and we are currently in the process of migrating to ProxySQL.

Replication is performed in a semi-synchronous mode based on GTID. This means that at least one replica must log a transaction before it is considered successful. This replication mode provides an optimal balance between performance and data safety in the event of a master node failure. Basically, all changes are transferred from the master to the replicas using Row Based Replication (RBR), but some nodes may have mixed binlog format.

The orchestrator periodically updates the state of the cluster topology, analyzes the information received and, in case of problems, can start the automatic recovery procedure. The developer is responsible for the procedure itself, since it can be implemented in different ways: based on VIP, DNS, using service discovery services or self-written mechanisms.

One easy way to restore a master if it fails is to use floating VIP addresses.

Things to know about this solution before moving on:

  • VIP is an IP address that is not tied to a specific physical network interface. When a node goes down or during scheduled maintenance, we can switch the VIP to another resource with minimal downtime.
  • Releasing and issuing a virtual IP address is a cheap and fast operation.
  • To work with VIP, you need access to the server via SSH, or the use of special utilities, for example, keepalived.

Let's consider possible problems with our wizard and imagine how the automatic recovery mechanism should work out.

Lost network connectivity to the master, or there was a problem at the hardware level, and the server is unavailable

  1. The orchestrator updates the cluster topology, each replica reports the unavailability of the master. The orchestrator starts the process of selecting a suitable replica to be the new master and begins the restore.
  2. We are trying to remove the VIP from the old master - unsuccessfully.
  3. The replica switches to the master role. The topology is being rebuilt.
  4. Adding a new network interface with VIP. Since it was not possible to remove the VIP, we launch a periodic request in the background gratuitous ARP. This type of request/response allows you to update the IP/MAC address mapping table on the connected switches, thereby notifying our VIP about the move. This minimizes the chance split brain when returning the old master.
  5. All new connections are immediately redirected to the new master. Old connections fail, repeated calls to the database at the application level are performed.

The server is working normally, there was a failure at the DBMS level

The algorithm is similar to the previous case: updating the topology and starting the recovery process. Since the server is available, we successfully release the VIP on the old master, transfer it to the new one, and send a few ARP requests. The eventual reverting of the old master should not affect the rebuilt cluster and the operation of the application.

Other problems

Failure of replicas or intermediate masters does not lead to automatic actions and requires manual intervention.

The virtual network interface is always added temporarily, that is, after the server is rebooted, the VIP is not automatically assigned. Each DB instance starts in read-only mode by default, the orchestrator automatically switches the new master to write and tries to install read only on the old master. These actions are aimed at reducing the likelihood split brain.

During the recovery process, problems may arise, which should also be notified through the orchestrator UI in addition to standard monitoring tools. We have extended the REST API by adding this capability (PR currently under review).

The general scheme of the HA solution is presented below.

Orchestrator and VIP as HA Solution for MySQL Cluster

Choosing a new master

The orchestrator is smart enough and tries to choose the most appropriate replica as a new master according to the following criteria:

  • lagging behind the replica from the master;
  • version of MySQL master and replica;
  • type of replication (RBR, SBR or mixed);
  • location in one or different data centers;
  • availability errant GTID - transactions that were performed on the replica and are missing on the master;
  • custom selection rules are also taken into account.

Not every replica is an ideal candidate for the master role. For example, a replica might be used for data backup, or the server might have a weaker hardware configuration. Orchestrator supports the manual rules with which you can adjust your preferences for choosing a candidate from most preferred to ignored.

Response and Recovery Time

In the event of an incident, it is important to minimize system downtime, so let's look at the MySQL parameters that affect the construction and updating of the cluster topology by the orchestrator:

  • slave_net_timeout The number of seconds the replica waits for new data or a heartbeat from the master before the connection is considered lost and reconnects. The lower the value, the faster the replica will be able to determine that communication with the master has been broken. We set this value to 5 seconds.
  • MASTER_CONNECT_RETRY β€” the number of seconds between reconnection attempts. In case of network problems, a low value of this parameter will allow you to quickly reconnect and prevent the cluster recovery process from starting. The recommended value is 1 second.
  • MASTER_RETRY_COUNT β€” the maximum number of reconnection attempts.
  • MASTER_HEARTBEAT_PERIOD β€” interval in seconds after which the master sends a heartbeat signal. Default is half the value slave_net_timeout.

Orchestrator options:

  • DelayMasterPromotionIfSQLThreadNotUpToDate - if equal true, then the master role will not be applied on the candidate replica until the replica SQL thread has completed all unapplied transactions from the Relay Log. We use this option to avoid losing transactions when all candidate replicas are behind.
  • InstancePollSeconds - the frequency of building and updating the topology.
  • RecoveryPollSeconds β€” topology analysis frequency. If a problem is found, topology recovery is started. This constantequal to 1 second.

Each cluster node is polled by the orchestrator once every InstancePollSeconds seconds. When a problem is detected, the cluster state is forced updated, and then the final decision is made to perform the restore. By experimenting with various database and orchestrator parameters, we managed to reduce the response and recovery time to 30 seconds.

Test stand

We started testing the HA scheme by developing a local test stand and further implementation in test and production environments. The local bench is fully automated based on Docker and allows you to experiment with the configuration of the orchestrator and network, scale the cluster from 2-3 servers to several dozen, and conduct exercises in a secure environment.

During the exercise, we choose one of the methods to emulate the problem: instantly shoot the master with kill -9, gracefully end the process and stop the server (docker-compose stop), simulate network problems with iptables -j REJECT or iptables -j DROP. We expect these results:

  • the orchestrator will detect problems with the master and update the topology in no more than 10 seconds;
  • the recovery procedure will automatically start: the network configuration will change, the role of the master will be transferred to the replica, the topology will be rebuilt;
  • the new master will become writable, live replicas will not be lost during the rebuild process;
  • data will begin to be written to the new master and replicated;
  • the total recovery time will be no more than 30 seconds.

As you know, the system can behave differently in test and production environments due to different hardware and network configurations, differences in synthetic and real workloads, etc. Therefore, from time to time we conduct exercises in real conditions, checking how the system behaves in case of loss of network connectivity or degradation of its individual parts. In the future, we want to build a completely identical infrastructure for both environments and automate its testing.

Conclusions

The health of the main storage system node is one of the main tasks of the SRE team and operations. The implementation of an orchestrator and a VIP-based HA solution allowed us to achieve the following results:

  • reliable detection of problems with the database cluster topology;
  • automatic and rapid response to incidents associated with the master, which reduces system downtime.

However, the solution has its limitations and disadvantages:

  • scaling the HA scheme to several data centers will require a single L2 network between them;
  • before assigning a VIP on the new master, we need to release it on the old one. The process is sequential, which increases the recovery time;
  • releasing a VIP requires SSH access to the server, or any other way to call remote procedures. Since the server or database is experiencing problems that caused the recovery process, we cannot be sure that the removal of the VIP will succeed. And this can lead to the appearance of two servers with the same virtual IP address and the problem split brain.

To avoid split brain, you can use the method STONITH ("Shoot The Other Node In The Head"), which completely isolates or disables the problem node. There are other ways to implement cluster high availability: a combination of VIP and DNS, service discovery and proxy services, synchronous replication, and other methods that have their own disadvantages and advantages.

I talked about our approach to creating a MySQL Failover Cluster. It is easy to implement and provides an acceptable level of reliability in the current environment. As the whole system as a whole and the infrastructure in particular develop, this approach will undoubtedly evolve.

Source: habr.com

Add a comment