Distributed DBMS for the Enterprise

The CAP theorem is the cornerstone of distributed systems theory. Of course, the controversy around it does not subside: the definitions in it are not canonical, and there is no rigorous proof ... Nevertheless, standing firmly on the positions of everyday common sense™, we intuitively understand that the theorem is true.

Distributed DBMS for the Enterprise

The only thing that is not obvious is the meaning of the letter "P". When the cluster is divided, it decides whether not to respond until a quorum is reached, or whether to give the data that is available. Depending on the results of this selection, the system is classified as either CP or AP. Cassandra, for example, can behave this way and that, depending not even on the cluster settings, but on the parameters of each specific request. But if the system is not "P" and it splits, then what?

The answer to this question is somewhat unexpected: the CA cluster cannot split.
What is this cluster that cannot split?

An essential attribute of such a cluster is a shared data storage system. In the vast majority of cases, this means connection via a SAN, which limits the use of CA solutions to large enterprises capable of maintaining a SAN infrastructure. In order for several servers To work with the same data, a cluster file system is required. Such file systems are available in the portfolios of HPE (CFS), Veritas (VxCFS), and IBM (GPFS).

OracleRAC

The Real Application Cluster option first appeared in 2001 with the release of Oracle 9i. In such a cluster, multiple instances Server work with the same database.
Oracle can work both with a clustered file system and with its own solution - ASM, Automatic Storage Management.

Each instance maintains its own journal. The transaction is executed and committed by one instance. In the event of an instance failure, one of the surviving cluster nodes (instances) reads its log and recovers the lost data - due to this, availability is ensured.

All instances maintain their own cache, and the same pages (blocks) can reside in the caches of several instances at the same time. Moreover, if one instance needs some page, and it is in the cache of another instance, it can get it from the “neighbor” using the cache fusion mechanism instead of reading from disk.

Distributed DBMS for the Enterprise

But what happens if one of the instances needs to change the data?

The peculiarity of Oracle is that it does not have a dedicated lock service: if the server wants to lock a row, then the lock record is placed directly on the memory page where the row being locked is located. Thanks to this approach, Oracle is the performance champion among monolithic databases: the lock service never becomes a bottleneck. But in a clustered configuration, this architecture can lead to heavy network traffic and deadlocks.

Once a record is locked, the instance notifies all other instances that the page holding the record has been exclusive-locked. If another instance needs to change a record on the same page, it must wait until the changes to the page are committed, i.e. the change is logged to disk (while the transaction can continue). It may also happen that the page will be modified by several copies in succession, and then when writing the page to disk, you will have to find out who keeps the current version of this page.

Randomly updating the same pages through different RAC nodes results in a dramatic decrease in database performance, to the point that cluster performance can be lower than that of a single instance.

The correct use of Oracle RAC is to physically partition the data (for example, using the partitioned table mechanism) and access each set of partitions through a dedicated node. The main purpose of RAC was not horizontal scaling, but fault tolerance.

If a node stops responding to heartbeat, then the node that detected this first starts the disk voting procedure. If the missing node is not marked here either, then one of the nodes takes over the responsibility for data recovery:

  • “freezes” all pages that were in the cache of the missing node;
  • reads the logs (redo) of the missing node and reapplies the changes recorded in these logs, checking in the process if other nodes have more recent versions of the pages being changed;
  • rolls back pending transactions.

To simplify switching between nodes, Oracle has the concept of a service - a virtual instance. An instance can serve multiple services, and a service can move between nodes. An application instance serving a certain part of the database (for example, a group of clients) works with one service, and the service responsible for this part of the database moves to another node when a node fails.

IBM Pure Data Systems for Transactions

The cluster solution for DBMS appeared in the portfolio of the Blue Giant in 2009. Ideologically, it is the successor of the Parallel Sysplex cluster, built on "normal" hardware. In 2009, the DB2 pureScale product was released, which is a software suite, and in 2012, IBM offers an appliance called Pure Data Systems for Transactions. It should not be confused with Pure Data Systems for Analytics, which is nothing but a rebranded Netezza.

The pureScale architecture is similar at first glance to Oracle RAC: in the same way, several nodes are connected to a common storage system, and each node runs its own instance of the DBMS with its own memory areas and transaction logs. But, unlike Oracle, DB2 has a dedicated locking service, represented by a set of db2LLM* processes. In a cluster configuration, this service is placed on a separate node, which is called coupling facility (CF) in Parallel Sysplex, and PowerHA in Pure Data.

PowerHA provides the following services:

  • lock manager;
  • global buffer cache;
  • area of ​​interprocess communications.

To transfer data from PowerHA to database nodes and back, remote memory access is used, so the cluster interconnect must support the RDMA protocol. PureScale can use both Infiniband and RDMA over Ethernet.

Distributed DBMS for the Enterprise

If a node needs a page, and this page is not in the cache, then the node requests the page in the global cache, and only if it is not there, reads it from disk. Unlike Oracle, the request goes only to PowerHA, and not to neighboring nodes.

If the instance is going to change the row, it locks it in exclusive mode, and the page where the row is located in shared mode. All locks are registered in the global lock manager. When the transaction completes, the node sends a message to the lock manager, which copies the modified page to the global cache, releases locks, and invalidates the modified page in the caches of other nodes.

If the page containing the line being modified is already locked, then the lock manager will read the modified page from the memory of the node that made the changes, release the lock, invalidate the modified page in the caches of other nodes, and give the page lock to the node that requested it.

"Dirty", that is, modified, pages can be written to disk both from a regular node and from PowerHA (castout).

If one of the pureScale nodes fails, recovery is limited to only those transactions that were not yet completed at the time of the failure: pages modified by this node in completed transactions are in the global cache on PowerHA. The node restarts in a stripped-down configuration on one of the cluster servers, rolls back pending transactions, and releases locks.

PowerHA runs on two servers, and the master node replicates its state synchronously. If the primary PowerHA node fails, the cluster continues to work with the standby node.
Of course, if you access the data set through a single node, the overall performance of the cluster will be higher. PureScale can even notice that a certain area of ​​data is processed by one node, and then all locks related to this area will be processed locally by the node without communication with PowerHA. But as soon as the application tries to access this data through another node, centralized lock processing will be resumed.

IBM internal tests on a load of 90% reads and 10% writes, which is very similar to a real industrial workload, show almost linear scaling up to 128 nodes. The test conditions, alas, are not disclosed.

HPE NonStop SQL

The Hewlett-Packard Enterprise portfolio also has its own highly available platform. This is the NonStop platform launched on the market in 1976 by Tandem Computers. In 1997, the company was taken over by Compaq, which in turn merged with Hewlett-Packard in 2002.

NonStop is used to build critical applications such as HLR or bank card processing. The platform is delivered as a software and hardware complex (appliance), which includes computing nodes, a data storage system and communication equipment. The ServerNet network (in modern systems - Infiniband) serves both for exchange between nodes and for access to the data storage system.

In early versions of the system, proprietary processors were used, which were synchronized with each other: all operations were performed synchronously by several processors, and as soon as one of the processors made a mistake, it turned off, and the second one continued to work. Later, the system switched to conventional processors (first MIPS, then Itanium, and finally x86), and other mechanisms began to be used for synchronization:

  • messages: each system process has a “shadow” twin, to which the active process periodically sends messages about its state; if the main process fails, the shadow process starts from the moment specified by the last message;
  • voting: the storage system has a special hardware component that accepts several identical requests and executes them only if the requests match; instead of physical synchronization, processors work asynchronously, and the results of their work are compared only at I / O moments.

Since 1987, a relational DBMS has been running on the NonStop platform - first SQL / MP, and later - SQL / MX.

The entire database is divided into parts, and a separate Data Access Manager (DAM) process is responsible for each part. It provides data logging, caching, and a locking mechanism. Data processing is carried out by executor processes (Executor Server Process) running on the same nodes as the corresponding data managers. The SQL/MX scheduler divides tasks between workers and merges the results. If it is necessary to make consistent changes, the two-phase commit protocol provided by the Transaction Management Facility (TMF) library is used.

Distributed DBMS for the Enterprise

NonStop SQL is able to prioritize processes so that long analytical queries do not interfere with the execution of transactions. However, its purpose is precisely the processing of short transactions, and not analytics. The developer guarantees the availability of the NonStop cluster at the level of five "nines", that is, downtime is only 5 minutes per year.

SAP HANA

The first stable release of the HANA DBMS (1.0) took place in November 2010, and the SAP ERP package switched to HANA from May 2013. The platform is based on purchased technologies: TREX Search Engine (column storage search), P*TIME and MAX DB DBMS.

The word “HANA” itself is an acronym, High performance ANalytical Appliance. This DBMS is delivered in the form of code that can run on any x86 servers, however, industrial installations are only allowed on certified equipment. There are solutions from HP, Lenovo, Cisco, Dell, Fujitsu, Hitachi, NEC. Some Lenovo configurations even allow operation without a SAN - the GPFS cluster on local disks plays the role of a shared storage system.

Unlike the platforms listed above, HANA is an in-memory DBMS, i.e. the primary data image is stored in RAM, and only logs and periodic snapshots are written to disk for recovery in case of a disaster.

Distributed DBMS for the Enterprise

Each node of the HANA cluster is responsible for its part of the data, and the data map is stored in a special component - Name Server, located on the coordinator node. Data between nodes is not duplicated. Lock information is also stored on each node, but the system has a global deadlock detector.

When connecting to a cluster, the HANA client loads its topology and can then directly access any node, depending on what data it needs. If the transaction affects the data of a single node, then it can be executed by this node locally, but if the data of several nodes changes, then the initiating node contacts the coordinator node, which opens and coordinates the distributed transaction, fixing it using an optimized two-phase commit protocol.

The coordinator node is duplicated, so if the coordinator fails, the backup node immediately takes over. But if a node with data fails, then the only way to access its data is to restart the node. As a rule, a backup (spare) server is kept in HANA clusters in order to restart the lost node on it as soon as possible.

Source: habr.com

Buy reliable hosting for sites with DDoS protection, VPS VDS servers 🔥 Buy reliable website hosting with DDoS protection, VPS VDS servers | ProHoster