Practically speaking, high availability (HA) and disaster recovery (DR) are two distinct elements of resilience that make up a greater whole. The key differentiating factor is that one is pre-emptive, while the other is reactive.

While high availability focuses on preventing downtime, disaster recovery is a reactive process intended to restore normal operations as quickly as possible following a significant failure. It presumes disruptions will happen and prepares for swift recovery when needed.

Together, they form a comprehensive approach to IT service continuity management.

High availability – pre-empting disruption

High availability refers to a system’s ability to remain operational and always-on, leveraging built-in redundancy and fault tolerance. HA is pre-emptive, involving strategies and technologies that detect potential failures and take immediate action to prevent downtime.

Key components of high availability:

  • Hardware redundancy: Redundant storage, error-correcting memory, backup power sources and other technology ensures a single hardware failure doesn’t bring down the entire system.
  • Software redundancy: Techniques such as clustering, load balancing and self-healing systems distribute workloads across multiple servers, enhancing the system’s ability to withstand failures.
  • Environmental redundancy: Utilizing data centers that are geographically or virtually dispersed ensures localized issues don’t affect the entire system.

By incorporating these redundancies, HA systems provide the resilience needed to keep critical applications and services running without interruption.

Disaster Recovery – responding to disruption

Disaster recovery encompasses the tools and procedures designed to enable the recovery or continuation of vital systems and infrastructure. As the final port of call in a disaster, some form of DR should always be in place – no matter the availability of existing systems.

DR strategies typically involve maintaining secondary systems that can be failed over from either another data center or the cloud.

Core concepts of disaster recovery:

  • Recovery time objective (RTO): The maximum acceptable delay between the interruption of service and its restoration. It represents the targeted duration for recovery after a disruption (e.g. the recovery takes 15 mins).
  • Recovery point objective (RPO): The maximum acceptable amount of data loss measured in time. It determines how far back in time data recovery should go (e.g. we recover to 4 hours ago).

Two halves of resilience

To minimize interruption – be it from a minor fault or major disaster – organizations must first ensure their infrastructure meets all of their needs for availability. This ensures the impact from minor faults or disruptions is not felt by end users or your customers.

But in the event of more significant disruption, procedures for disaster recovery should be in place to meet exact requirements for both RTO and RPO.

Changing priorities

Historically, architecting a system for high availability was an expensive undertaking, but it solved most causes of IT downtime. Forward thinking and uptime-focused organizations invested heavily in HA, and as a result, reduced their spend on DR.

In many instances, DR for these organizations simply meant recovering from backups, rather than failing over to replicated systems with recovery points prior to infection. But emerging cyber threats have highlighted the flaws in this approach.

Today, cyber is the leading cause of downtime and data loss, and organizations which have invested disproportionately into HA are unprotected against system-wide attacks and breaches.

Security, availability and recoverability are not absolutes – there is a risk-cost balance to be struck. With even the best designed HA systems subject to long recovery times in the event of an attack, the question is how best to allocate resources to maintain business continuity.

To this end, IT service continuity budgets must either be rebalanced or increased to accommodate the adoption or improvement of dedicated solutions for disaster recovery.

ABOUT THE AUTHOR

James Watts

James Watts is the managing director of Databarracks, the business and technology resilience specialists.

Tips to Selecting the Right BaaS Solution to Protect Your IT Environment
Backup-as-a-service (BaaS) solutions address a common, recurring problem in organizations: backup deployment and ongoing maintenance. They minimize the need for...
READ MORE >
Healthcare’s Remedy for Treating Cybercrime
According to a recent update from the International Monetary Fund (IMF), cyberattacks have more than doubled in volume since the...
READ MORE >
Data Stored in Cloud-based Applications: The Next Frontier in Data Protection
Ask any organization about which cloud-based applications, platforms, and resources they use, and their responses will vary. Some may immediately...
READ MORE >
Backup Power: How Innovation is Keeping the Lights on at Data Centers
The sheer number of moving pieces and parts in a data center is staggering. Cabinets, servers, CRACs, CRAHs, chillers, UPS,...
READ MORE >