Whether on-premises or in the cloud, high availability (HA) and disaster recovery (DR) solutions are still required for many critical applications due to cloud outages, system updates, natural disasters, and various operational failures. Unplanned downtime can be very costly, but the complexity of ensuring 99.99% availability for critical solutions like SAP HANA can be very taxing for IT staff. SAP ERP systems are usually configured and deployed by IT architects and turned over to less technical IT admins for day-to-day operations. In the event of a failure, the skilled IT experts needed to find and correct the issue are often not available. They are also unavailable to do the scripting needed to configure DR protection to failover or to restore operation to normal when the issue is corrected.
How does one best to ensure the HA of an SAP HANA database? If something happens to the server supporting the in-memory database, you’ll need a way to fail over to another server, populate memory with a replicated instance of the database, and update all the references within the SAP landscape to point to SAP HANA on new infrastructure — all within seconds. That calls for a failover clustering strategy, which, particularly in a Linux environment, poses challenges of its own. Creating a failover cluster using open source-based tools such as SUSE HAE or Red Hat Enterprise Linux Pacemaker requires a significant investment in complicated scripting, and there are many opportunities for things to go wrong if there are unchecked errors in the script or if the script is out of date. If you want to add a third or fourth system to your infrastructure – as a third node in a failover cluster or as a remote system for DR support in case the entire HA failover cluster goes offline – the scripting grows even more complicated. The possibility that something will go wrong as a result of an inadvertent scripting error increases as well.
An alternative approach to the kind of scripting required by open-source tools is to rely on a third-party product designed to provide clustering support and failover automation. Such solutions can simplify the creation of a multi-node Linux cluster and provide various levels of support to monitor the overall health of the cluster. Some tools will monitor only the heartbeat of the cluster nodes, but others will provide full-stack monitoring support. The tools providing full-stack support provide obvious advantages insofar as they can monitor everything from the underlying hardware to storage-, network-, and even application-level events. More sophisticated failover cluster management applications can even provide proactive responses to event alerts. An intelligent solution capable of detecting and automatically rebooting stalled processes can preemptively mitigate issues which could potentially degrade application performance or trigger a chain of faults, eventually causing a failover to the secondary cluster node. Leveraging this smart technology, unnecessary failovers are eliminated, thereby ensuring the continuity of your operations without any interruptions.
In addition to a solution designed to facilitate clustering, automated failover, and full stack monitoring, your HA infrastructure for SAP HANA will need a data replication solution that can ensure that the data in memory on the secondary node of the failover cluster is an exact replica of the data that had been in memory on the primary node. If the secondary node is suddenly called into service, it can be brought online and made operational in a minimal amount of time because the data in the secondary instance of SAP HANA is identical to the data that had in use on the primary node.
SAP provides a solution called SAP HANA System Replication Services, which automatically replicates data from one node of your failover cluster to another. If the two nodes are in separate but geographically proximate datacenters or cloud availability zones (which they should be for an HA configuration), SAP HANA Replication Services will replicate any updates to the database synchronously. A transaction won’t be considered committed until SAP HANA Replication Services verifies that it has been written both to the active and secondary instances of the SAP HANA database. Synchronous replication services guarantee the uniformity of each database instance, maintaining an exact copy of data in memory across two nodes. This precise synchronization allows you to enable read operations on the SAP HANA database on the secondary node. Utilizing the secondary node for read-intensive tasks not only spares the primary node from potential performance degradation but also facilitates more efficient CPU load balancing. This strategic utilization of resources ensures optimal performance across both nodes.
As an aside, it’s worth noting that SAP HANA Replication Services can also replicate the SAP HANA database from a primary or secondary node to a remote DR infrastructure. In that scenario, though, one would typically replicate in an asynchronous manner due to the latency. The primary and secondary instances of SAP HANA would commit the transaction without waiting for confirmation from the DR site that the transaction had been committed there. If all the nodes of your failover cluster were to go offline, you’d still be able to bring up and run your SAP HANA database from your DR site, but the database on the DR site might be lacking transactions that had been committed on the primary and secondary instances of SAP HANA before they went offline (but had not yet been replicated to the DR site). That’s a risk that always accompanies an asynchronous replication mechanism and one reason the recovery point objectives (RPO) associated with a DR configuration are never as low as with RPOs associated with an HA configuration relying on synchronous replication.
However, SAP HANA System Replication Services do not inherently include a mechanism for orchestrating a failover to the secondary SAP HANA instance. That’s where you need to look for a failover cluster management solution specifically designed to work with SAP HANA and SAP HANA System Replication Services. As you already know all too well, each component in an SAP landscape needs to be brought online in a carefully choreographed sequence. An automated failover cluster management system designed to work with SAP HANA and SAP HANA System Replication Services will take the guesswork and the scripting out of the failover process. Its failover management services have been designed to bring up the new SAP HANA infrastructure automatically and with the choreography of your unique landscape in mind. Moreover, if the failover cluster management solution is fully SAP HANA-aware and supports the “takeover with handshake” feature of SAP HANA, your automated failover will take place that much faster because the solution can prepopulate the in-memory database and bring the entire landscape online with minimal interruption.
Finally, one question that is rarely asked when it comes to ensuring HA in an SAP HANA environment is this: After your infrastructure fails over to that secondary node, and after whatever crisis caused the primary node to go offline has been remediated, how will you restore your SAP HANA landscape to its original primary/secondary configuration? A well-designed failover cluster management solution should be capable of reconfiguring SAP HANA System Replication services, reversing the replication direction while the secondary infrastructure serves as the production node. This means that as soon as possible, the system starts replicating data from the secondary node back to the primary node, ensuring that the repaired primary node holds an exact replica of the database running in memory on the active secondary node. With this setup, you have the flexibility to initiate an automated failover back to the original primary/secondary configuration at your convenience. Absent such an automated solution, you would have to depend on scripts that may not have been written yet. However, an automated solution equips you to promptly return to normal business operations.
The power of an SAP HANA solution is unparalleled, but so is its complexity. Ensuring the HA of SAP HANA can be very challenging if one is not using the automation tools that can ensure full data replication and an elegant, predictable failover in the event of an emergency. Luckily, such tools are available, and the IT personnel you currently have are more than capable of using them to ensure ongoing access to your critical SAP HANA landscape.