DRJ's Spring 2019

Conference & Exhibit

Attend The #1 BC/DR Event!

Spring Journal

Volume 32, Issue 1

Full Contents Now Available!

Thursday, 07 February 2019 18:15

Disaster Recovery: Past, Present, and Future

By Alex Winokur, founder of Axxana

 

Disaster recovery is now on the list of top concerns of every CIO. In this article we review the evolution of the disaster recovery landscape, from its inception until today. We look at the current understanding of disaster behavior and as a result the disaster recovery processes. We also try to cautiously anticipate the future, outlining the main challenges associated with disaster recovery.

The Past

The computer industry is relatively young. The first commercial computers appeared somewhere in the 1950s—not even seventy years ago. The history of disaster recovery (DR) is even younger. Table 1 outlines the appearance of the various technologies necessary to construct a modern DR solution.

AxxanaTable

Table 1 – Early history of DR technology development

 

From Magnetic Tapes to Data Networks

The first magnetic tapes for computers were used as input/output devices. That is, input was punched onto punch cards that were then stored offline to magnetic tapes. Later, UNIVAC I, one of the first commercial computers, was able to read these tapes and process their data. Later still, output was similarly directed to magnetic tapes that were connected offline to printers for printing purposes. Tapes began to be used as a backup medium only after 1954, with the

Axxana1

Figure 1: First Storage System - RAMAC


Although modern wide-area communication networks date back to 1974, data has been transmitted over long-distance communication lines since 1837 via telegraphy systems. These telegraphy communications have since evolved to data transmission over telephone lines using modems.
introduction of the mass storage device (RAMAC).

Modems were massively introduced in 1958 to connect United States air defense systems; however, their throughput was very low compared to what we have today. The FAA clustered system deployed communication that was designed for computers to communicate with their peripherals (e.g., tapes). Local area networks (LANs) as we now know them had not been invented yet.

Early Attempts at Disaster Recovery

It wasn’t until the 1970s that concerns about disaster recovery started to emerge. In that decade, the deployment of IBM 360 computers reached a critical mass, and they became a vital part of almost every organization. Until the mid-1970s, the perception was that if a computer failed, it would be possible to fail back to paper-based operation as was done in the 1960s. However, the wide-spread rise of digital technologies in the 1970s led to a corresponding increase in technological failures on one hand; while on the other hand, theoretical calculations, backed by real-world evidence, showed that switching back to paper-based work was not practical.

The emergence of terrorist groups in Europe like the Red Brigades in Italy and the Baader-Meinhof Group in Germany further escalated concerns about the disruption of computer operations. These left-wing organizations specifically targeted financial institutions. The fear was that one of them would try to blow up a bank’s data centers.

At that time, communication networks were in their infancy, and replication between data centers was not practical.

Parallel workloads. IBM came up with the idea to use the FAA clustering technology to build two adjoining computer rooms that were separated by a steel wall and had one node cluster in each room. The idea was to run the same workload twice and to be able to immediately fail over from one system to the other in case one system was attacked. A closer analysis revealed that in a case of a terror attack, the only surviving object would be the steel wall, so the plan was abandoned.

Hot, warm, and cold sites. The inability of computer vendors (IBM was the main vendor at the time) to provide an adequate DR solution made way for dedicated DR firms like SunGard to provide hot, warm, or cold alternate site. Hot sites, for example, were duplicates of the primary site; they independently ran the same workloads as the primary site, as communication between the two sites was not available at the time. Cold sites served as repositories for backup tapes. Following a disaster at the primary site, operations would resume at the cold site by allocating equipment, executing a restore from backup operations, and restarting the applications. Warms sites were a compromise between a hot site and a cold site. These sites had hardware and connectivity already established; however, recovery was still done by restoring the data from backups before the applications could be restarted.

Backups and high availability. The major advances in the 1980s were around backups and high availability. On the backup side, regulations requiring banks to have a testable backup plan were enacted. These were probably the first DR regulations to be imposed on banks; many more followed through the years. On the high availability side, Digital Equipment Corporation (DEC) made the most significant advances in LAN communications (DECnet) and clustering (VAXcluster).

The Turning Point

On February 26, 1993 the first bombing of the World Trade Center (WTC) took place. This was probably the most significant event shaping the disaster recovery solution architectures of today. People realized that the existing disaster recovery solutions, which were mainly based on tape backups, were not sufficient. They understood that too much data would be lost in a real disaster event.

SRDF. By this time, communication networks had matured, and EMC became the first to introduce a storage-to-storage replication software called Symmetrix Remote Data Facility (SRDF).

 

Behind the Scenes at IBM

At the beginning of the nineties, I was with IBM’s research division. At the time, we were busy developing a very innovative solution to shorten the backup window, as backups were the foundation for all DR and the existing backup windows (dead hours during the night) started to be insufficient to complete the daily backup. The solution, called concurrent copy, was the ancestor of all snapshotting technologies, and it was the first intelligent function running within the storage subsystem. The WTC event in 1993 left IBM fighting the “yesterday battles” of developing a backup solution, while giving EMC the opportunity to introduce storage-based replication and become the leader in the storage industry.

 

The first few years of the 21st century will always be remembered for the events of September 11, 2001—the date of the complete annihilation of the World Trade Center. Government, industry, and technology leaders realized then that some disasters can affect the whole nation, and therefore DR had to be taken much more seriously. In particular, the attack demonstrated that existing DR plans were not adequate to cope with disasters of such magnitude. The notion of local, regional, and nationwide disasters crystalized, and it was realized that recovery methods that work for local disasters don’t necessarily work for regional ones.

SEC directives. In response, the Securities Exchange Commission (SEC) issued a set of very specific directives in the form of the “Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S.” These regulations, still intact today, bind all financial institutions. The DR practices that were codified in the SEC regulations quickly propagated to other sectors, and disaster recovery became a major area of activity for all organizations relying on IT infrastructure.

The essence of these regulations is as follows:

  1. The economic stance of the United States cannot be compromised under any circumstance.
  2. Relevant financial institutions are obliged to correctly, without any data loss, resume operations by the next business day following a disaster.
  3. Alternate disaster recovery sites must use different physical infrastructure (electricity, communication, water, transportation, and so on) than the primary site.

Note that Requirements 2 and 3 above are somewhat contradictory. Requirement 2 necessitates synchronous replication to facilitate zero data loss, while Requirement 3 basically dictates long distances between sites—thereby making the use of synchronous replication impossible. This contradiction is not addressed within the regulations and is left to each implementer to deal with at its own discretion.

The secret to resolving this contradiction lies in the ability to reconstruct missing data if or when data loss occurs. The nature of most critical data is such that there is always at least one other instance of this data somewhere in the universe. The trick is to locate it, determine how much of it is missing in the database, and augment the surviving instance of the database with this data. This process is called data reconciliation, and it has become a critical component of modern disaster recovery. [See The Data Reconciliation Process sidebar.]

 

The Data Reconciliation Process

If data is lost as a result of a disaster, the database becomes misaligned with the real world. The longer this misalignment exists, the greater the risk of application inconsistencies and operational disruptions. Therefore, following a disaster, it is very important to align back the databases with the real world as soon as possible. This process of alignment is called data reconciliation.

The reconciliation process has two important characteristics:

  1. It is based on the fact that the data lost in a disaster exists somewhere in the real word, and thus it can be reconstructed in the database.
  2. The duration and complexity of the reconciliation is proportional to the recovery point objective (RPO); that is, it’s proportional to the amount of data lost.

One of the most common misconceptions in disaster recovery is that RPO (for example, RPO = 5) refers to how many minutes of data the organization is willing to lose. What RPO really means is that the organization must be able to reconstruct and reconsolidate (i.e., reconcile) that last five minutes of missing data. Note that the higher the RPO (and therefore, the greater the data loss), the longer the RTO and the costlier the reconciliation process. Catastrophes typically occur when RPO is compromised and the reconciliation process takes much longer.

In most cases, the reconciliation process is quite complicated, consisting of time-consuming processes to identify the data gaps and then resubmitting the missing transactions to realign the databases with real-world status. This is a costly, mainly manual, error-prone process that greatly prolongs the recovery time of the systems and magnifies risks associated with downtime.

 

The Present

The second decade of the 21st century has been characterized by new types of disaster threats, including sophisticated cyberattacks and extreme weather hazards caused by global warming. It is also characterized by new DR paradigms, like DR automation, disaster recovery as a service (DRaaS), and active-active configurations.

These new technologies are for the most part still in their infancy. DR automation tools attempt to orchestrate a complete site recovery through invocation of one “site failover” command, but they are still very limited in scope. A typical tool in this category is the VMware Site Recovery Manager (SRM). DRaaS attempts to reduce the cost of DR-compliant installation by locating the secondary site in the cloud. The new active-active configurations try to reduce equipment costs and recovery time by utilizing techniques that are used in the context of high availability; that is, to recover from a component failure rather than a complete site failure.

Disasters vs. Catastrophes

The following definitions of disasters and disaster recovery have been refined over the years to make a clear distinction between the two main aspects of business continuity: high availability protection and disaster recovery. This distinction is important because it crystalizes the difference between disaster recovery and a single component failure recovery covered by highly available configurations, and in doing so also accounts for the limitations of using active-active solutions for DR.

A disaster in the context of IT is either a significant adverse event that causes an inability to continue operation of the data center or a data loss event where recovery cannot be based on equipment at the data center. In essence, disaster recovery is a set of procedures aimed to resume operations following a disaster by failing over to a secondary site.

From a DR procedures perspective, it is customary to classify disasters into 1) regional disasters like weather hazards, earthquakes, floods, and electricity blackouts and 2) local disasters like local fires, onsite electrical failures, and cooling system failures.

Over the years, I have also noticed a third, independent classification of disasters. Disasters can also be classified as catastrophes. In principal, a catastrophe is a disastrous event where in the course of a disaster, something very unexpected happens that causes the disaster recovery plans to dramatically miss their service level agreement (SLA); that is, they typically exceed their recovery time objective (RTO).

When DR procedures go as planned for regional and local disasters, organizations fail over to a secondary site and resume operations within pre-determined parameters for recovery time (i.e., RTO) and data loss (i.e., RPO). The organization’s SLAs, business continuity plans, and risk management goals align with these objectives, and the organization is prepared to accept the consequent outcomes. A catastrophe occurs when these SLAs are compromised.

Catastrophes can also result from simply failing to execute the DR procedures as specified, typically due to human errors. However, for the sake of this article, let’s be optimistic and assume that DR plans are always executed flawlessly. We shall concentrate only on unexpected events that are beyond human control.

Most of the disaster events that have been reported in the news recently (for example, the Amazon Prime Day outage in July 2018 and the British Airways bank holiday outage in 2017) have been catastrophes related to local disasters. If DR could have been properly applied to the disruptions at hand, nobody would have noticed that there had been a problem, as the DR procedures were designed to provide almost zero recovery time and hence zero down time.

The following two examples provide a closer look at how catastrophes occur.

9/11 – Following the September 11 attack, several banks experienced major outages. Most of them had a fully equipped alternate site in Jersey City—no more than five miles away from their primary site. However, the failover failed miserably because the banks’ DR plans called for critical personnel to travel from their primary site to their alternate site, but nobody could get out of Manhattan.

A data center power failure during a major snow storm in New England – Under normal DR operations at this organization, the data was synchronously replicated to an alternate site. However, 90 seconds prior to a power failure at the primary site, the central communication switch in the area lost power too, which cut all WAN communications. As a result, the primary site continued to produce data for 90 seconds without replication to the secondary site; that is, until it experienced the power failure. When it finally failed over to the alternate site, 90 seconds of transactions were missing; and because the DR procedures were not designed to address recovery where data loss has occurred, the organization experienced catastrophic down time.

The common theme of these two examples is that in addition to the disaster at the data center there was some additional—unrelated—malfunction that turned a “normal” disaster into a catastrophe. In the first case, it was a transportation failure; in the second case, it was a central switch failure. Interestingly, both failures occurred to infrastructure elements that were completely outside the control of the organizations that experienced the catastrophe. Failure of the surrounding infrastructure is indeed one of the major causes for catastrophes. This is also the reason why the SEC regulations put so much emphasis on infrastructure separation between the primary and secondary data center.

Current DR Configurations

In this section, I’ve included examples of two traditional DR configurations that separate the primary and secondary center, as stipulated by the SEC. These configurations have predominated in the past decade or so, but they cannot ensure zero data loss in rolling disasters and other disaster scenarios, and they are being challenged by new paradigms such as that introduced by Axxana’s Phoenix. While a detailed discussion would be outside the scope of this article, suffice it to say that Axxana’s Phoenix makes it possible to avoid catastrophes such as those just described—something that is not possible with traditional synchronous replication models.

AxxanaFig2

Figure 2 – Typical DR configuration

 

Typical DR configuration. Figure 2 presents a typical disaster recovery configuration. It consists of a primary site, a remote site, and another set of equipment at the primary site, which serves as a local standby.

The main goal of the local standby installation is to provide redundancy to the production equipment at the primary site. The standby equipment is designed to provide nearly seamless failover capabilities in case of an equipment failure—not in a disaster scenario. The remote site is typically located at a distance that guarantees infrastructure independence (communication, power, water, transportation, etc.) to minimize the chances of a catastrophe. It should be noted that the typical DR configuration is very wasteful. Essentially, an organization has to triple the cost of equipment and software licenses—not to mention the increased personnel costs and the cost of high-bandwidth communications—to support the configuration of Figure 2.

AxxanaFig2

Figure 3 – DR cost-saving configuration

 

Traditional ideal DR configuration. Figure 3 illustrates the traditional ideal DR configuration. Here, the remote site serves both for DR purposes and high availability purposes. Such configurations are sometimes realized in the form of extended clusters like Oracle RAC One Node on Extended Distance. Although traditionally considered the ideal, they are a trade-off between survivability, performance, and cost. The organization saves on the cost of one set of equipment and licenses, but it compromises survivability and performance. That’s because the two sites have to be in close proximity to share the same infrastructure, so they are more likely to both be affected by the same regional disasters; at the same time, performance is compromised due to the increased latency caused by separating the two cluster nodes from each other.

AxxanaFig4

Figure 4 – Consolidation of DR and high availability configurations with Axxana’s Phoenix


True zero-data-loss configuration. Figure 4 represents a cost-saving solution with Axxana’s Phoenix. In case of a disaster, Axxana’s Phoenix provides a zero-data-loss recovery to any distance. So, with the help of Oracle’s high availability support (fast start failover and transparent application failover), Phoenix provides functionality very similar to extended cluster functionality. With Phoenix, however, it can be implemented over much longer distances and with much smaller latency, providing true cost savings over the typical configuration shown in Figure 3.

The Future

In my view, the future is going to be a constant race between new threats and new disaster recovery technologies.

New Threats and Challenges

In terms of threats, global warming creates new weather hazards that are fiercer, more frequent, and far more damaging than in the past—and in areas that have not previously experienced such events. Terror attacks are on the rise, thereby increasing threats to national infrastructures (potential regional disasters). Cyberattacks—in particular ransomware, which destroys data—are a new type of disaster. They are becoming more prolific, more sophisticated and targeted, and more damaging.

At the same time, data center operations are becoming more and more complex. Data is growing exponentially. Instead of getting simpler and more robust, infrastructures are getting more diversified and fragmented. In addition to legacy architectures that aren’t likely to be replaced for a number of years to come, new paradigms like public, hybrid, and private clouds; hyperconverged systems; and software-defined storage are being introduced. Adding to that are an increasing scarcity of qualified IT workers and economic pressures that limit IT spending. All combined, these factors contribute to data center vulnerabilities and to more frequent events requiring disaster recovery.

So, this is on the threat side. What is there for us on the technology side?

New Technologies

Of course, Axxana’s Phoenix is at the forefront of new technologies that guarantee zero data loss in any DR configuration (and therefore ensure rapid recovery), but I will leave the details of our solution to a different discussion.

AI and machine learning. Apart from Axxana’s Phoenix, the most promising technologies on the horizon revolve around artificial intelligence (AI) and machine learning. These technologies enable DR processes to become more “intelligent,” efficient, and predictive by using data from DR tests, real-world DR operations, and past disaster scenarios; in doing so, disaster recovery processes can be designed to better anticipate and respond to unexpected catastrophic events.These technologies, if correctly applied, can shorten RTO and significantly increase the success rate of disaster recovery operations. The following examples suggest only a few of their potential applications in various phases of disaster recovery:

  • They can be applied to improve the DR planning stage, resulting in more robust DR procedures.
  • When a disaster occurs, they can assist in the assessment phase to provide faster and better decision-making regarding failover operations.
  • They can significantly improve the failover process itself, monitoring its progress and automatically invoking corrective actions if something goes wrong.

When these technologies mature, the entire DR cycle from planning to execution can be fully automated. They carry the promise of much better outcomes than processes done by humans because they can process and better “comprehend” far more data in very complex environments with hundreds of components and thousands of different failure sequences and disaster scenarios.

New models of protection against cyberattacks. The second front where technology can greatly help with disaster recovery is on the cyberattack front. Right now, organizations are spending millions of dollars on various intrusion prevention, intrusion detection, and asset protection tools. The evolution should be from protecting individual organizations to protecting the global network. Instead of fragmented, per-organization defense measures, the global communication network should be “cleaned” of threats that can create data center disasters. So, for example, phishing attacks that would compromise a data center’s access control mechanisms should be filtered out in the network—or in the cloud— instead of reaching and being filtered at the end points.

Conclusion

Disaster recovery has come a long way—from naive tape backup operations to complex site recovery operations and data reconciliation techniques. The expenses associated with disaster protection don’t seem to go down over the years; on the contrary, they are only increasing.

The major challenge of DR readiness is in its return on investment (ROI) model. On one hand, a traditional zero-data-loss DR configuration requires organizations to implement and manage not only a primary site, but also a local standby and remote standby; doing so essentially triples the costs of critical infrastructure, even though only one third of it (the primary site) is utilized in normal operation.

On the other hand, if a disaster occurs and the proper measures are not in place, the financial losses, reputation damage, regulatory backlash, and other risks can be devastating. As organizations move into the future, they will need to address the increasing volumes and criticality of data. The right disaster recovery solution will no longer be an option; it will be essential for mitigating risk, and ultimately, for staying in business.