From an IT perspective, ensuring business continuity involves much more than building out a disaster recovery infrastructure at a geographically remote site. That’s the obvious part. What’s less obvious—and far more important—is the need to understand and address the underlying expectations about SQL Server availability that inform the broader notion of business continuity. Organizational leadership needs to examine and determine what constitutes acceptable SQL Server availability goals before IT can effectively deploy a solution designed to meet those goals.
That involves a lot of hard questions: What level of SQL Server availability does the organization really require? How is availability defined—in terms of unplanned application downtime only? In terms of total application uptime? Who defines these metrics (and does that person control the organization’s purse strings)?
Assessing your organization’s real availability requirements
What level of SQL Server availability does your organization really require for business continuity? If your mission-critical applications are relying on SQL Server on a 24×7 basis and if those interactions are generating millions of dollars of revenue every day, then your C-Suite occupants may be expecting ongoing, uninterrupted availability. That’s understandable, but it’s unrealistic. Even a “fault tolerant” system promises only 99.999% availability (allowing for as much as 5 minutes of annual downtime), and to achieve that such systems require very expensive specialty hardware.
For most organizations, the expense of a fully fault tolerant solution is not justifiable, particularly when configured for disaster recovery in geographically separate regions. Much more affordable (and accessible via industry standard hardware), is a SQL Server infrastructure configured for high availability (HA) and disaster recovery (DR). An HA configuration can ensure 99.99% availability—more on that in a moment—while the DR aspect of the configuration provides for the inclusion of a geographically remote instance of your SQL Server configuration which can come online quickly if a disaster takes down the entire region.
The nuances of high availability
Still, there’s a lot in that 99.99% availability figure you need to unpack. First, are you configuring your SQL Server infrastructure on-premises or in the cloud? We’ll assume the latter because what you do to configure for HA in the cloud is easily transferable to an on-premises configuration, while what you might do to configure for HA on-premises does not always transfer easily to the cloud—and all of this makes a difference when it comes to the application availability expectations of your C-suite.
All the major cloud providers offer configuration options designed to ensure 99.99% availability. However, what the cloud providers mean by 99.99% availability is infrastructure availability. You can configure a Windows Failover Cluster Instance (FCI), for example, with virtual machines (VMs) running on servers in physically separate data centers, and the 99.99% availability service level agreement (SLA) will promise that 99.99% of the time at least one of the VMs in your FCI will be available.
That sounds great, but VM availability isn’t the same as application availability. To achieve SQL Server availability at 99.99% you need to take extra steps to ensure the replication of your SQL Server data among all the nodes in your FCI. That’s an important consideration because in the cloud you can’t configure an FCI with shared storage the way you might in a non-cloud configuration. There are ways to accomplish that, which we’ll cover shortly.
Who owns the objective?
There are several other elements to this 99.99% availability metric that are important. First, what constitutes availability and who within your organization owns that definition? Does a 99.99% availability objective – which less than 52 minutes of downtime per year in total – allow for scheduled application unavailability (for maintenance and updates)? Or does that objective apply only to unscheduled downtime (application unavailability due to hardware failure, software bugs, etc.)?
Another question: what does application availability mean to key stakeholders? Does it mean that SQL Server (and not just the VM) is up and running? Does it mean that SQL Server is fully operational and able to handle queries and transactions? Again, the distinctions are important. Depending on how you configure your HA/DR solution, you could have a failover instance of SQL Server up and running well within the maximum 52 minutes of annual downtime. But if your data replication strategy depends on restoring the database from a backup or from SQL Server log files, then the failover instance of SQL Server may be up and running but unable to provide full production support for a period of time which could stretch beyond the maximum acceptable period of downtime. In that scenario, SQL Server may be technically available within the terms of the SLA, but in practical terms it may not be because the total time to failover and recover the database takes far longer. That’s not a failure of the underlying hardware provider, whose VM was available within the scheduled window, but from a business continuity perspective it’s definitely not the availability expected of the HA/DR configuration you’ve developed.
Defining what constitutes availability and quantifying the expectations for application availability are important for all these reasons. Moreover, IT really needs to have the organization’s critical stakeholders buy into and own these definitions—because they are the ones who will be screaming at the IT team if SQL Server becomes unavailable. And since people are known to change their minds about what constitutes availability when the mission critical applications cannot access the HA database, it’s important to have a signed policy document you can point to that shows they’ve agreed to the terms under which you’re working.
Building the solution that meets your business continuity objectives
That signed policy document also plays a critical role when it comes to configuring your SQL Server infrastructure for HA and DR. Once IT has a clear mandate to provide SQL Server availability to a specified level, options can be weighed and proper configuration decisions can be made.
As noted earlier, all the major cloud providers offer options which enable you to configure SQL Server for HA and DR. An HA configuration is going to involve building at least two failover cluster nodes in separate data centers within a single region, so if the infrastructure in one data center goes offline or SQL Server becomes unresponsive the system will failover to the cluster node configured in the second data center. A DR configuration is going to add at least one more cluster node—optimally in a geographically distinct region that can reasonably be expected to remain operational even if all the data centers in the original region go offline due to a region-wide catastrophe.
This, then, brings you to the question of how best to ensure each node in your SQL Server cluster has access to the same SQL Server data. In the cloud, you have basically two options for replicating data among your HA/DR cluster nodes: Either use SQL Server Availability Groups (AGs) or use a third-party SANLess Clustering solution.
SQL Server Availability Groups
If you have multiple SQL Server databases and are configuring for HA and DR, you would need to use the Always On AG solution that is built into SQL Server Enterprise Edition 2012 and later. Always On AGs ensure that SQL Server will automatically replicate all user-defined databases to all the nodes in the cluster. The network latencies between data centers in a single AWS, Azure, or Google region are so low you can configure the Always On AG solution to perform synchronous database replication among all the nodes configured for HA. You will need to use asynchronous replication to the node(s) configured in a remote region.
There are only two obvious downsides to Always On AG replication. First, it does not replicate all your SQL Server databases to the other nodes in your SQL Server cluster. System databases holding jobs, passwords, and other data stores are not replicated, which could present you with problems if the circumstances that led to failover are sustained and you cannot fail back to the original configuration in minimal time. You can run without these databases for some period of time, but eventually you will want to have them in your production environment.
The other downside to Always On AG replication is that it requires the deployment of SQL Server Enterprise Edition 2012 or later. If your applications do not require SQL Server Enterprise Edition or if they have been tuned for a version of SQL Server earlier than 2012, opting to use the Always On AG feature to provide the data replication services required for HA/DR is going to force you to pay for an expensive SQL Server Enterprise license your applications don’t otherwise demand.
SQL Server Standard Edition offers a Basic AG option, but a Basic AG can support only two nodes in a cluster (and a single SQL database per AG). It’s not a viable replication solution if you are building a business continuity solution designed for HA and DR because you’ll need at least two SQL Server cluster nodes in one region for HA and at least one more in a separate region for DR.
An alternative to either AG replication solution involves configuring your HA/DR infrastructure as a SANLess failover cluster. Here, you would license a third-party SANless clustering tool, which provides block-level data replication among all the nodes in your failover cluster. Like the AG approach, a SANLess cluster performs synchronous data replication between all the nodes within a region (and asynchronous replication to the DR nodes outside the region).
Unlike the AG approach, however, the SANLess cluster approach replicates all SQL Server databases. There is also no limit to the number of SQL databases you can synchronize among nodes, nor a limit on the number of nodes you can synchronize (as in the Basic AG offering). Moreover, a SANless cluster can be configured with any edition of SQL Server—even SQL Server Standard Edition back to 2008. If you don’t otherwise need SQL Server Enterprise licenses, taking the SANLess cluster approach can save you a significant amount of money when configuring for HA and DR—without in any way compromising your ability to meet the business continuity needs of your organization.