ProHoster > Blog > Administration > Building a fault-tolerant solution based on Oracle RAC and AccelStor Shared-Nothing architecture
Building a fault-tolerant solution based on Oracle RAC and AccelStor Shared-Nothing architecture
A considerable number of Enterprise applications and virtualization systems have their own mechanisms for building fault-tolerant solutions. Specifically, Oracle RAC (Oracle Real Application Cluster) is a cluster of two or more Oracle database servers working together to provide load balancing and fault tolerance at the server/application level. To work in this mode, you need a shared storage, which is usually the storage system.
As we have already discussed in one of our articles, the storage system itself, despite the presence of duplicated components (including controllers), still has points of failure - mainly in the form of a single data set. Therefore, in order to build an Oracle solution with increased requirements for reliability, the “N servers - one storage system” scheme needs to be complicated.
First, of course, we need to decide what risks we are trying to insure against. Within the framework of this article, we will not consider protection against threats such as "a meteorite has arrived." So building a geographically dispersed disaster recovery solution will remain a topic for one of the following articles. Here we will consider the so-called Cross-Rack disaster recovery solution, when protection is built at the level of server cabinets. The cabinets themselves can be located both in the same room and in different ones, but usually within the same building.
These cabinets should contain all the necessary set of hardware and software that will allow Oracle databases to work regardless of the state of the "neighbor". In other words, using the Cross-Rack disaster recovery solution, we eliminate the risks of failure:
Oracle Application Servers
Storage systems
Switching systems
Complete failure of all equipment in the cabinet:
food denial
Cooling system failure
External factors (man, nature, etc.)
Duplication of Oracle servers implies the very principle of Oracle RAC operation and is implemented through the application. Duplication of switching facilities is also not a problem. But with the duplication of the storage system, everything is not so simple.
The easiest option is to replicate data from the main storage system to the backup one. Synchronous or asynchronous, depending on the storage capabilities. With asynchronous replication, the question immediately arises of ensuring data consistency with respect to Oracle. But even if there is software integration with the application, in any case, in the event of a failure on the main storage system, manual intervention by administrators will be required in order to switch the cluster to the backup storage.
A more complex option is software and / or hardware "virtualizers" of storage systems that will eliminate problems with consistency and manual intervention. But the complexity of deployment and subsequent administration, as well as the very indecent cost of such solutions, deters many.
Just for scenarios such as Cross-Rack disaster recovery, the AccelStor NeoSapphire™ All Flash array is perfect H710 using Shared-Nothing architecture. This model is a dual-node storage system that uses proprietary FlexiRemap® technology to work with flash drives. Thanks to FlexiRemap® NeoSapphire™ H710 is capable of delivering up to 600K IOPS@4K random write and 1M+ IOPS@4K random read performance, which is unattainable with classic RAID-based storage systems.
But the main feature of NeoSapphire™ H710 is the execution of two nodes in the form of separate cases, each of which has its own copy of the data. Synchronization of nodes is carried out through the external InfiniBand interface. Thanks to this architecture, it is possible to spread nodes to different locations at a distance of up to 100m, thereby providing a Cross-Rack disaster recovery solution. Both nodes work completely in synchronous mode. From the side of the hosts, the H710 looks like an ordinary two-controller storage system. Therefore, no additional software and hardware options and particularly complex settings do not need to be performed.
If we compare all the above Cross-Rack disaster recovery solutions, then the AccelStor option stands out noticeably from the rest:
AccelStor NeoSapphire™ Shared Nothing Architecture
Software or hardware "virtualizer" storage
Replication based solution
Availability
Server failure No Downtime No Downtime No Downtime
Switch failure No Downtime No Downtime No Downtime
Storage failure No Downtime No Downtime Downtime
Failure of the entire cabinet No Downtime No Downtime Downtime
Cost and complexity
Solution cost
Low*
High
High
Deployment complexity
Low
High
High
*AccelStor NeoSapphire™ is still an All Flash array, which, by definition, does not cost "3 kopecks", especially having a double storage capacity. However, when comparing the final cost of a solution based on it with similar ones from other vendors, the cost can be considered low.
The topology for connecting application servers and nodes of the All Flash array will look like this:
When planning the topology, it is also highly recommended to duplicate the management switches and server interconnects.
Hereinafter, we will talk about connecting via Fiber Channel. In the case of using iSCSI, everything will be the same, adjusted for the types of switches used and slightly different array settings.
Preparatory work on the array
Used hardware and software
Server and Switch Specifications
Components
Description
Oracle Database 11g servers
Two
server operating system
Oracle Linux
Oracle database version
11g (RAC)
Processors per server
Two 16 cores Intel® Xeon® CPU E5-2667 v2 @ 3.30GHz
physical memory per server
128GB
FC network
16Gb/s FC with multipathing
FC HBA
Emulex Lpe-16002B
Dedicated public 1GbE ports for cluster management
Intel ethernet adapter RJ45
16Gb/s FC switch
Brocade 6505
Dedicated private 10GbE ports for data synchonization
Intel X520
AccelStor NeoSapphhire™ All Flash Array Specification
Components
Description
storage system
NeoSapphire™ high availability model: H710
image version
4.0.1
Total number of drives
48
drive size
1.92TB
drive type
SSD
FC target ports
16x 16Gb ports (8 per node)
management ports
The 1GbE ethernet cable connecting to hosts via an ethernet switch
heartbeat port
The 1GbE ethernet cable connecting between two storage nodes
Data synchronization port
56Gb/s InfiniBand cable
Before an array can be used, it must be initialized. By default, the control address of both nodes is the same (192.168.1.1). You need to connect to them one by one and set new (already different) management addresses and set up time synchronization, after which the Management ports can be connected to a single network. After that, the nodes are combined into a HA pair by assigning subnets for Interlink connections.
After the initialization is completed, you can manage the array from any node.
Next, we create the necessary volumes and publish them to the application servers.
It is highly recommended to create multiple volumes for Oracle ASM, as this will increase the number of targets for servers, which will ultimately improve overall performance (more on queues in another article).
Test configuration
Storage Volume Name
Volume size
Data01
200GB
Data02
200GB
Data03
200GB
Data04
200GB
Data05
200GB
Data06
200GB
Data07
200GB
Data08
200GB
Data09
200GB
Data10
200GB
Grid01
1GB
Grid02
1GB
Grid03
1GB
Grid04
1GB
Grid05
1GB
Grid06
1GB
Redo01
100GB
Redo02
100GB
Redo03
100GB
Redo04
100GB
Redo05
100GB
Redo06
100GB
Redo07
100GB
Redo08
100GB
Redo09
100GB
Redo10
100GB
Some explanations about the operating modes of the array and the ongoing processes in emergency situations
Each node's data set has a "version number" parameter. After the initial initialization, it is the same and equals 1. If for some reason the version number is different, then the data is always synchronized from the older version to the younger one, after which the number of the younger version is equalized, i.e. this means that the copies are identical. Reasons why versions may be different:
Scheduled reboot of one of the nodes
An accident on one of the nodes due to a sudden shutdown (power, overheating, etc.).
Lost InfiniBand connection with inability to sync
An accident on one of the nodes due to data corruption. This will already require the creation of a new HA group and full synchronization of the dataset.
In either case, the node that remains online increments its version number by one so that when communication with the pair is restored, its data set will be synchronized.
If there is a break in the connection over the Ethernet link, then Heartbeat temporarily switches to InfiniBand and returns back within 10s when it is restored.
Hosts setup
For fault tolerance and better performance, MPIO support must be enabled for the array. To do this, add lines to the /etc/multipath.conf file, and then restart the multipath service
Next, in order for ASM to work with MPIO through ASMLib, you need to change the /etc/sysconfig/oracleasm file and then run /etc/init.d/oracleasm scandisks
Hidden text
# ORACLESM_SCANORDER: Matching patterns to order disk scanning
ORACLESM_SCANORDER="dm"
# ORACLEASM_SCANEXCLUDE: Matching patterns to exclude disks from scan
ORACLEASM_SCANEXCLUDE="sd"
Note
If you don't want to use ASMLib, you can use the UDEV rules, which are the basis for ASMLib.
Starting with version 12.1.0.2 of Oracle Database, the option is available for installation as part of the ASMFD software.
Be sure to make sure that the disks you create for Oracle ASM are aligned with the block size that the array is physically working with (4K). Otherwise, performance problems are possible. Therefore, it is necessary to create volumes with the appropriate parameters:
# vi /etc/security/limits.conf
✓ grid soft nproc 2047
✓ grid hard nproc 16384
✓ grid soft nofile 1024
✓ grid hard nofile 65536
✓ grid soft stack 10240
✓ grid hard stack 32768
✓ oracle soft nproc 2047
✓ oracle hard nproc 16384
✓ oracle soft nofile 1024
✓ oracle hard nofile 65536
✓ oracle soft stack 10240
✓ oracle hard stack 32768
✓ soft memlock 120795954
✓ hard memlock 120795954
sqlplus "/as sysdba"
alter system set processes=2000 scope=spfile;
alter system set open_cursors=2000 scope=spfile;
alter system set session_cached_cursors=300 scope=spfile;
alter system set db_files=8192 scope=spfile;
Fault tolerance test
For demonstration purposes, HammerDB was used to emulate an OLTP workload. HammerDB configuration:
Number of Warehouses
256
Total Transactions per User
1000000000000
Virtual Users
256
The result was 2.1M TPM, far from the performance limit of the array H710, but it is a "ceiling" for the current hardware configuration of servers (primarily due to processors) and their number. The purpose of this test is still to demonstrate the fault tolerance of the solution as a whole, and not to achieve maximum performance. Therefore, we will simply build on this figure.
Test for failure of one of the nodes
The hosts lost some of the paths to the storage, continuing to work through the remaining ones with the second node. Performance dropped for a few seconds due to path rebuilding and then returned to normal. There was no service interruption.
Cabinet failure test with all equipment
In this case, performance also dipped for a few seconds due to rebuilding paths, and then returned to half of the original value. The result was halved from the original due to the exclusion of one application server from work. There was also no service interruption.
If there is a need to implement a fault-tolerant Cross-Rack disaster recovery solution for Oracle at a reasonable cost and with little deployment / administration effort, then Oracle RAC and architecture work together AccelStor Shared-Nothing would be one of the best options. Instead of Oracle RAC, there can be any other software that provides for clustering, the same DBMS or virtualization systems, for example. The principle of constructing the solution will remain the same. And the bottom line is zero for RTO and RPO.