🥇Building a fault-tolerant solution based on Oracle RAC and AccelStor Shared-Nothing architecture

A considerable number of Enterprise applications and virtualization systems have their own mechanisms for building fault-tolerant solutions. Specifically, Oracle RAC (Oracle Real Application Cluster) is a cluster of two or more Oracle database servers working together to provide load balancing and fault tolerance at the server/application level. To work in this mode, you need a shared storage, which is usually the storage system.

As we have already discussed in one of our articles, the storage system itself, despite the presence of duplicated components (including controllers), still has points of failure - mainly in the form of a single data set. Therefore, in order to build an Oracle solution with increased requirements for reliability, the “N servers - one storage system” scheme needs to be complicated.

First, of course, we need to decide what risks we are trying to insure against. Within the framework of this article, we will not consider protection against threats such as "a meteorite has arrived." So building a geographically dispersed disaster recovery solution will remain a topic for one of the following articles. Here we will consider the so-called Cross-Rack disaster recovery solution, when protection is built at the level of server cabinets. The cabinets themselves can be located both in the same room and in different ones, but usually within the same building.

These cabinets should contain all the necessary set of hardware and software that will allow Oracle databases to work regardless of the state of the "neighbor". In other words, using the Cross-Rack disaster recovery solution, we eliminate the risks of failure:

Oracle Application Servers
Storage systems
Switching systems
Complete failure of all equipment in the cabinet:
- food denial
- Cooling system failure
- External factors (man, nature, etc.)

Duplication of Oracle servers implies the very principle of Oracle RAC operation and is implemented through the application. Duplication of switching facilities is also not a problem. But with the duplication of the storage system, everything is not so simple.

The easiest option is to replicate data from the main storage system to the backup one. Synchronous or asynchronous, depending on the storage capabilities. With asynchronous replication, the question immediately arises of ensuring data consistency with respect to Oracle. But even if there is software integration with the application, in any case, in the event of a failure on the main storage system, manual intervention by administrators will be required in order to switch the cluster to the backup storage.

A more complex option is software and / or hardware "virtualizers" of storage systems that will eliminate problems with consistency and manual intervention. But the complexity of deployment and subsequent administration, as well as the very indecent cost of such solutions, deters many.

Just for scenarios such as Cross-Rack disaster recovery, the AccelStor NeoSapphire™ All Flash array is perfect H710 using Shared-Nothing architecture. This model is a dual-node storage system that uses proprietary FlexiRemap® technology to work with flash drives. Thanks to FlexiRemap® NeoSapphire™ H710 is capable of delivering up to 600K IOPS@4K random write and 1M+ IOPS@4K random read performance, which is unattainable with classic RAID-based storage systems.

But the main feature of NeoSapphire™ H710 is the execution of two nodes in the form of separate cases, each of which has its own copy of the data. Synchronization of nodes is carried out through the external InfiniBand interface. Thanks to this architecture, it is possible to spread nodes to different locations at a distance of up to 100m, thereby providing a Cross-Rack disaster recovery solution. Both nodes work completely in synchronous mode. From the side of the hosts, the H710 looks like an ordinary two-controller storage system. Therefore, no additional software and hardware options and particularly complex settings do not need to be performed.

If we compare all the above Cross-Rack disaster recovery solutions, then the AccelStor option stands out noticeably from the rest:

AccelStor NeoSapphire™ Shared Nothing Architecture
Software or hardware "virtualizer" storage
Replication based solution

Availability

Server failure
No Downtime
No Downtime
No Downtime

Switch failure
No Downtime
No Downtime
No Downtime

Storage failure
No Downtime
No Downtime
Downtime

Failure of the entire cabinet
No Downtime
No Downtime
Downtime

Cost and complexity

Solution cost
Low*
High
High

Deployment complexity
Low
High
High

*AccelStor NeoSapphire™ is still an All Flash array, which, by definition, does not cost "3 kopecks", especially having a double storage capacity. However, when comparing the final cost of a solution based on it with similar ones from other vendors, the cost can be considered low.

The topology for connecting application servers and nodes of the All Flash array will look like this:

When planning the topology, it is also highly recommended to duplicate the management switches and server interconnects.

Hereinafter, we will talk about connecting via Fiber Channel. In the case of using iSCSI, everything will be the same, adjusted for the types of switches used and slightly different array settings.

Preparatory work on the array

Used hardware and software

Server and Switch Specifications

Components
Description

Oracle Database 11g servers
Two

server operating system
Oracle Linux

Oracle database version
11g (RAC)

Processors per server
Two 16 cores Intel® Xeon® CPU E5-2667 v2 @ 3.30GHz

physical memory per server
128GB

FC network
16Gb/s FC with multipathing

FC HBA
Emulex Lpe-16002B

Dedicated public 1GbE ports for cluster management
Intel ethernet adapter RJ45

16Gb/s FC switch
Brocade 6505

Dedicated private 10GbE ports for data synchonization
Intel X520

AccelStor NeoSapphhire™ All Flash Array Specification

Components
Description

storage system
NeoSapphire™ high availability model: H710

image version
4.0.1

Total number of drives
48

drive size
1.92TB

drive type
SSD

FC target ports
16x 16Gb ports (8 per node)

management ports
The 1GbE ethernet cable connecting to hosts via an ethernet switch

heartbeat port
The 1GbE ethernet cable connecting between two storage nodes

Data synchronization port
56Gb/s InfiniBand cable

Before an array can be used, it must be initialized. By default, the control address of both nodes is the same (192.168.1.1). You need to connect to them one by one and set new (already different) management addresses and set up time synchronization, after which the Management ports can be connected to a single network. After that, the nodes are combined into a HA pair by assigning subnets for Interlink connections.

After the initialization is completed, you can manage the array from any node.

Next, we create the necessary volumes and publish them to the application servers.

It is highly recommended to create multiple volumes for Oracle ASM, as this will increase the number of targets for servers, which will ultimately improve overall performance (more on queues in another article).

Test configuration

Storage Volume Name
Volume size

Data01
200GB

Data02
200GB

Data03
200GB

Data04
200GB

Data05
200GB

Data06
200GB

Data07
200GB

Data08
200GB

Data09
200GB

Data10
200GB

Grid01
1GB

Grid02
1GB

Grid03
1GB

Grid04
1GB

Grid05
1GB

Grid06
1GB

Redo01
100GB

Redo02
100GB

Redo03
100GB

Redo04
100GB

Redo05
100GB

Redo06
100GB

Redo07
100GB

Redo08
100GB

Redo09
100GB

Redo10
100GB

Some explanations about the operating modes of the array and the ongoing processes in emergency situations

Each node's data set has a "version number" parameter. After the initial initialization, it is the same and equals 1. If for some reason the version number is different, then the data is always synchronized from the older version to the younger one, after which the number of the younger version is equalized, i.e. this means that the copies are identical. Reasons why versions may be different:

Scheduled reboot of one of the nodes
An accident on one of the nodes due to a sudden shutdown (power, overheating, etc.).
Lost InfiniBand connection with inability to sync
An accident on one of the nodes due to data corruption. This will already require the creation of a new HA group and full synchronization of the dataset.

In either case, the node that remains online increments its version number by one so that when communication with the pair is restored, its data set will be synchronized.

If there is a break in the connection over the Ethernet link, then Heartbeat temporarily switches to InfiniBand and returns back within 10s when it is restored.

Hosts setup

For fault tolerance and better performance, MPIO support must be enabled for the array. To do this, add lines to the /etc/multipath.conf file, and then restart the multipath service

Hidden textdevices {
device {
vendor "AStor"
path_grouping_policy "group_by_prio"
path_selector "queue-length 0"
path_checker "tur"
features "0"
hardware_handler "0"
prio "const"
failback immediate
fast_io_fail_tmo 5
dev_loss_tmo 60
user_friendly_names yes
detect_prio yes
rr_min_io_rq 1
no_path_retry 0
}
}

Next, in order for ASM to work with MPIO through ASMLib, you need to change the /etc/sysconfig/oracleasm file and then run /etc/init.d/oracleasm scandisks

Hidden text

# ORACLESM_SCANORDER: Matching patterns to order disk scanning
ORACLESM_SCANORDER="dm"

# ORACLEASM_SCANEXCLUDE: Matching patterns to exclude disks from scan
ORACLEASM_SCANEXCLUDE="sd"

Note

If you don't want to use ASMLib, you can use the UDEV rules, which are the basis for ASMLib.

Starting with version 12.1.0.2 of Oracle Database, the option is available for installation as part of the ASMFD software.

Be sure to make sure that the disks you create for Oracle ASM are aligned with the block size that the array is physically working with (4K). Otherwise, performance problems are possible. Therefore, it is necessary to create volumes with the appropriate parameters:

parted /dev/mapper/device-name mklabel gpt mkpart primary 2048s 100% align-check optimal 1

Distribution of databases across created volumes for our test configuration

Storage Volume Name
Volume size
Volume LUNs mapping
ASM Volume Device Detail
Allocation Unit Size

Data01
200GB
Map all storage volumes to storage system all data ports
Redundancy: Normal
Name:DGDATA
Purpose:Data files

4MB

Data02
200GB

Data03
200GB

Data04
200GB

Data05
200GB

Data06
200GB

Data07
200GB

Data08
200GB

Data09
200GB

Data10
200GB

Grid01
1GB
Redundancy: Normal
Name: DGGRID1
Purpose:Grid: CRS and Voting

4MB

Grid02
1GB

Grid03
1GB

Grid04
1GB
Redundancy: Normal
Name: DGGRID2
Purpose:Grid: CRS and Voting

4MB

Grid05
1GB

Grid06
1GB

Redo01
100GB
Redundancy: Normal
Name: DGREDO1
Purpose: Redo log of thread 1

4MB

Redo02
100GB

Redo03
100GB

Redo04
100GB

Redo05
100GB

Redo06
100GB
Redundancy: Normal
Name: DGREDO2
Purpose: Redo log of thread 2

4MB

Redo07
100GB

Redo08
100GB

Redo09
100GB

Redo10
100GB

Database settings

block size = 8K
Swap space = 16GB
Disable AMM (Automatic Memory Management)
Disable Transparent Huge Pages

Other settings

#vi /etc/sysctl.conf
✓ fs.aio-max-nr = 1048576
✓ fs.file-max = 6815744
✓ kernel.shmmax 103079215104
✓ kernel.shmall 31457280
✓ kernel.shmmn 4096
✓ kernel.sem = 250 32000 100 128
✓ net.ipv4.ip_local_port_range = 9000 65500
✓ net.core.rmem_default = 262144
✓ net.core.rmem_max = 4194304
✓ net.core.wmem_default = 262144
✓ net.core.wmem_max = 1048586
✓vm.swappiness=10
✓ vm.min_free_kbytes=524288 # don't set this if you're using Linux x86
vm.vfs_cache_pressure=200
✓ vm.nr_hugepages = 57000

# vi /etc/security/limits.conf
✓ grid soft nproc 2047
✓ grid hard nproc 16384
✓ grid soft nofile 1024
✓ grid hard nofile 65536
✓ grid soft stack 10240
✓ grid hard stack 32768
✓ oracle soft nproc 2047
✓ oracle hard nproc 16384
✓ oracle soft nofile 1024
✓ oracle hard nofile 65536
✓ oracle soft stack 10240
✓ oracle hard stack 32768
✓ soft memlock 120795954
✓ hard memlock 120795954

sqlplus "/as sysdba"
alter system set processes=2000 scope=spfile;
alter system set open_cursors=2000 scope=spfile;
alter system set session_cached_cursors=300 scope=spfile;
alter system set db_files=8192 scope=spfile;

Fault tolerance test

For demonstration purposes, HammerDB was used to emulate an OLTP workload. HammerDB configuration:

Number of Warehouses
256

Total Transactions per User
1000000000000

Virtual Users
256

The result was 2.1M TPM, far from the performance limit of the array H710, but it is a "ceiling" for the current hardware configuration of servers (primarily due to processors) and their number. The purpose of this test is still to demonstrate the fault tolerance of the solution as a whole, and not to achieve maximum performance. Therefore, we will simply build on this figure.

Test for failure of one of the nodes

The hosts lost some of the paths to the storage, continuing to work through the remaining ones with the second node. Performance dropped for a few seconds due to path rebuilding and then returned to normal. There was no service interruption.

Cabinet failure test with all equipment

In this case, performance also dipped for a few seconds due to rebuilding paths, and then returned to half of the original value. The result was halved from the original due to the exclusion of one application server from work. There was also no service interruption.

If there is a need to implement a fault-tolerant Cross-Rack disaster recovery solution for Oracle at a reasonable cost and with little deployment / administration effort, then Oracle RAC and architecture work together AccelStor Shared-Nothing would be one of the best options. Instead of Oracle RAC, there can be any other software that provides for clustering, the same DBMS or virtualization systems, for example. The principle of constructing the solution will remain the same. And the bottom line is zero for RTO and RPO.

Source: habr.com

Building a fault-tolerant solution based on Oracle RAC and AccelStor Shared-Nothing architecture