Print this page
Monday, 05 August 2019 17:39

Protecting SQL Server Databases in Multiple AWS Cloud Zones and Regions

By Dave Bermingham, Technical Evangelist at SIOS Technology

High availability and disaster recovery protections both require redundant resources configured to minimize or eliminate single points of failure. Because failures sometimes occur on a large scale, a best practice is to put some geographical distance between some of these resources. Amazon Web Services meets this need by offering multiple Availability Zones and Regions to facilitate business continuity during all likely failures—from a single server crashing to a widespread natural disaster.

This article provides practical guidance to help database and system administrators tasked with protecting SQL Server databases running in the AWS cloud. The high availability (HA) and disaster recovery (DR) provisions available with the AWS cloud and the SQL Server software are covered first in separate sections. This is followed by a third section outlining how these provisions can be used in a cost-effective configuration that combines HA and DR protections in a failover cluster spanning multiple AWS Availability Zones and Regions.

Multiple Availability-Zones and Regions in the AWS Cloud

Fully protecting applications, including those with SQL Server databases, from all possible outages requires recognizing the differences between “failures” and “disasters” because those differences determine the different provisions needed for HA and DR. Failures are short in duration and small in scale, affecting a server, rack, or the power or cooling in a datacenter. Disasters have more widespread and enduring impacts, affecting multiple facilities, including offices and datacenters alike, in ways that preclude rapid localized recovery.

The most consequential difference involves the location of the redundant resources (systems, software and data), which can be local—on a Local Area Network—for recovering from a localized failure. By contrast, the redundant resources required to recover from a widespread disaster must span a Wide Area Network. For database applications that require high transactional throughput performance, the ability to replicate the active instance’s data synchronously across the LAN enables the standby instance to be “hot” and ready to take over immediately and automatically in the event of a failure. Such rapid response should be the goal of all HA provisions.

Because latency inherent in the WAN would adversely impact on the throughput performance in the active instance when using synchronous replication, data is usually replicated asynchronously in DR configurations. This means that updates being made to the standby instance always lag behind updates being made to the active instance, which makes the standby instance “warm” and results in an unavoidable delay during the manual recovery process.

AWS Availability Zones (AZs) offer the best of both by combining the synchronous replication available on a LAN with some geographical separation previously possible only in the WAN. AZs connect multiple datacenters within an AWS region via a low latency, high throughput network that facilitates synchronous commit with negligible impact on database performance. In many regions, the latency across AZs is less than one millisecond, which has made the use of multi-zone configurations a new best practice for HA failover clusters.

For additional protection against major disasters that could affect multiple Availability Zones, AWS operates multiple Regions throughout the world. Amazon employs encrypted Virtual Private Cloud (VPC) peering among Regions to deliver highly reliable and secure communications. As expected, replicating data across AWS Regions will need to be done asynchronously for SQL Server databases, and to ensure minimal or no data loss, the recovery will need to be performed manually. The resulting delay in DR provisions is tolerable, however, because Region-wide disasters are rare.

SQL Server’s Always On Availability Groups and Failover Cluster Instances

SQL Server offers two of its own options for HA and DR protections: Failover Cluster Instances (FCIs) and Always On Availability Groups. FCIs have two notable advantages: The feature is included in the less expensive Standard Edition; and they protect the entire SQL Server instance, including user and system databases. A major disadvantage is the requirement Windows Server Failover Clustering (WSFC) has for shared storage, such as a storage area network (SAN), as a means to replicate (or actually share) data between the active and standby instances. The problem is: Shared storage has not historically been available in the AWS cloud, or in any other public cloud.

The lack of shared storage in the cloud was addressed in the Datacenter Edition of Windows Server 2016 with Storage Spaces Direct (S2D), which also received concurrent support in SQL Server 2016. S2D is software-defined storage that creates a virtual SAN, enabling data to be shared between multiple instances. S2D requires that the servers reside within a single datacenter, however, making it incompatible with Availability Zones. For this reason, using FCI for HA and/or DR protections across multiple AWS AZs and Regions requires using a third-party solution for data replication.

The other SQL Server option is Always On Availability Groups. This option is more capable than FCIs for both HA and DR, and it possesses some other notable advantages, such as readable secondaries (with appropriate licensing) and no restrictions on the size of databases. But it requires licensing the more expensive Enterprise Edition, and that makes this option cost-prohibitive for many database applications. Another limitation is that only the user database is replicated, creating the need for separate provisions to protect the entire SQL Server instance.

Using an application-specific HA/DR solution like Always On Availability Groups has another disadvantage: Separate HA and/or DR provisions will be needed to protect all other applications, including those using a different database. Having multiple HA/DR solutions can substantially increase complexity and costs for licensing, training, implementation and ongoing operations. This is yet another reason why both database and system administrators increasingly prefer to use general-purpose failover clustering solutions.

Consolidating HA and DR Protections in a SANless Failover Cluster

The lack of shared storage in the cloud has long been addressed by third-party failover clustering solutions purpose-built for HA and DR protections in private, public and hybrid cloud environments. These solutions are implemented entirely in software to enable creating, as their designation implies, a cluster of servers and storage—sans SANs—and with rapid, automatic failover to assure high availability at the application level.

Versions for Windows Server are designed to work seamlessly with WSFC by providing real-time block-level data replication both on-premises and in a cloud-based SANless environment. A major advantage with SQL Server is support for FCIs without imposing any need to compromise availability or performance. These solutions usually overcome another limitation, this one imposed by the Standard Edition of SQL Server, of being able to configure only two FCI nodes in a failover cluster. As will be shown in the example below, the ability to have a two-node cluster spanning Availability Zones, along with a third instance in a different Region, affords mission-critical HA/DR protections in a single configuration.

Versions for Linux, which lacks a fundamental clustering capability equivalent to WSFC, must provide a total HA/DR solution that includes data replication, continuous application-level monitoring and configurable failover/failback recovery policies. Linux is becoming increasingly popular for SQL Server databases and other enterprise applications, and third-party failover clustering solutions now make configuring HA/DR protections nearly as easy as it is for Windows Server. Without such a solution, administrators would be forced to struggle making open source software work dependably in full, application-specific HA/DR stacks. It is for this reason that only the very largest organizations have the wherewithal (skill set and staffing) needed to even consider taking on such ongoing efforts.

While specific to the operating system, most failover clustering software is application-agnostic, enabling administrators to have a single, universal HA/DR solution. Most such solutions also offer a variety of value-added capabilities. Examples include data compression and other forms of WAN optimization to reduce bandwidth utilization in multi-region clusters, minimalist “warm” standby configurations that also reduce costs, and manual switchover of active and standby instances to facilitate planned maintenance and routine backups with minimal disruption to the applications.

“Undersizing” standby instances can afford considerable savings. Because the standby instance rarely runs a production workload, it is possible to reduce costs by allocating minimal resources (e.g. CPU, memory and network bandwidth) while it functions in its normal standby mode. The tradeoff is that, in the event a failover, the allocation will need to be resized before the instance can become the active node. This extra step adds to the recovery time because it requires a reboot. There are other factors to consider, as well, such as I/O requirements and the storage limitations of smaller instance types. But when viable, the cost saving can be significant.

Additional savings is afforded by compressing the data that transverses the WAN, especially in hybrid cloud configurations. The higher the compression, the higher the CPU utilization, so some tweaking is usually needed to achieve the optimal balance.

The diagram shows a popular AWS configuration that provides both HA and DR protections in a VPC that distributes three SQL Server instances across multiple Availability Zones and Regions. For clusters spanning multiple Availability Zones within a single AWS Region, the data replication is synchronous, enabling rapid automatic failovers from all localized failures. For clusters spanning multiple AWS Regions, the data replication must be asynchronous to avoid adversely impacting on throughput performance, and failovers will need to employ manual processes to minimize the potential for data loss.

SIOS AWS Multi ZoneRegion 190726

This popular SANless failover cluster configuration consists of a two-node HA cluster spanning two AWS Availability Zones, along with a third instance deployed in a separate AWS Region to facilitate a full recovery after a widespread disaster.

It is also possible to have two- and three-node configurations in a hybrid cloud environment for HA and/or DR purposes. One such three-node configuration is a two-node HA cluster located in an enterprise datacenter with a third instance located in the AWS cloud for DR protection—or vice versa.

Confidence in the AWS Cloud

As of this writing, AWS has 61 Availability Zones deployed in 20 Regions, making the AWS Global Infrastructure eminently capable of providing carrier-class HA/DR protection for SQL Server databases. But with a purpose-built failover clustering solution, such carrier-class high availability need not mean paying a carrier-like high cost. Because SANless failover clustering software makes effective and efficient use of all AWS compute, storage and networking resources, while also being easy to implement and operate, these solutions minimize ongoing costs, resulting in robust HA and DR protections being more affordable than ever before.

The security, agility, scalability and high availability made possible by overlaying SANless failover clusters atop multiple, geographically-dispersed Availability Zones and Regions should give even the most risk-adverse administrators the confidence needed to migrate mission-critical SQL Server databases and other applications to the AWS cloud.

About the Author

David Bermingham is Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 8 years: 6 years as a Cluster MVP and 2 years as a Cloud and Datacenter Management MVP. David holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare and education.