From hurricanes, wildfires and floods, to brown-outs and network failures, disasters – whether they are natural or manmade, innocent or malicious, are going to occur. The ability to protect against, while ensuring the ability to recover quickly from such outages and site incidents, remains at the top of virtually all IT priority lists. In fact, millions upon millions are being spent on disaster recovery (DR) solutions because of the “what if” question. “What if we were to face a disaster – could we protect our data and service availability?” “What if we faced the disaster and lost data and/or access?” “What if we couldn’t recover certain data and/or services for hours… days… weeks… ever?” The answers to the last two questions are rather unpleasant for most. Consequently, it is clear why building a bullet-proof DR strategy, backed by the best technology possible, remains such a high priority.
The Importance of Creating a Data Value Hierarchy
The first step in creating a sound disaster recovery plan must be to quantify the value of your organization’s data, by breaking it into classes based on importance. The following figure represents a typical tiered, pyramidal hierarchy illustrating the relative value of the five classes of data within an organization.
During the process of determining the data value hierarchy, it becomes clear that some services and applications are more disaster-resistant than others. Certain applications, such as emergency services, sales and order processing, as well as other customer-facing applications demand virtually instant recovery, while others can withstand downtime of up to a few hours. Therefore, a key step during this initial planning stage is for the stakeholders in your organization to agree on recovery priorities and benchmarks for successful recovery at each level of the data hierarchy.
It is likewise important that your organization recognize that the value of data may change over time, and it is the value of the data that should determine the type of recovery (not the other way around). In other words, data that is mission-critical just prior to the point of disaster (i.e., a potential sale on an e-commerce application 30 seconds before disaster strikes) may not hold the same value for your business after power is lost and cannot be restored for four hours. Clearly, the application running these types of transactions is critical and must be restored immediately, even if that particular customer does not return immediately to complete his/her transaction. Therefore, careful thinking and a long-term, wide vision is critical at this stage of the planning process.
Once your organization’s data is bucketed into classes based on value, the next step of designing a plan for the parameters for the protection and recovery of data at these different levels can begin. There are of course many different DR strategies and technologies available today. So how do you decide? The best approach for your organization is one that utilizes as many relevant methods/technologies as possible, and is tailored to the data values predominant in your organization, as well as your budget. In other words – one size does not fit all when it comes to DR. However, there are core capabilities that your solution should include.
Core DR Capabilities
Regardless of the DR method(s)/technology(s) selected, every sound DR solution should at a minimum include:
- Able to failback after failover
- Able to resume data backup operations quickly (and may also be required by regulations)
- Address the limited capacity, performance and high cost of operation from the disaster site
- Protect against the loss of cached data in the event of a power failure, through one of more of the following methods:
- Automatic switchover to cache write-through mode (cached data is flushed from the cache at the moment of disaster)
- UPS integration (for automatic, graceful shutdown, and coordination with UPS shutdown services at point of disaster)
- Battery backup of cached data (if the system continues operating in write-back mode)
A particularly challenging disaster situation is a power failure, as many application servers operate in write-back cache mode in order to maximize I/O performance. In write-back mode, as the initiator sends data to the controller to write to the storage device, these writes are stored/buffered in cache, while the controller sends a confirmation to the initiator. The buffered data is only written to the storage when absolutely necessary, so this configuration has the advantage of offering a very high level of throughput, since the timing of buffered writes can be optimized for high performance. However, one of the dangers of this arrangement in a power failure is that this cached data can be easily lost, since it has oftentimes not been written to storage at the time of the crash. Therefore, a system operating in write-back mode must utilize a safeguard such as one of the aforementioned methods, in order to protect itself from a data loss in this type of situation.
DR Methods in IP-Storage
Today, there are several DR methods that exist relating to IP-SAN (storage area network) appliances that balance efficiency, reliability and speed, with cost, in protecting applications and data during disasters. The following diagram illustrates the major types of DR available for SANs, and their corresponding value for each level of the data value hierarchy.
Here we see that realtime replication with file versioning (also called remote replication or mirroring) is at the pinnacle in terms of performance, covers the most sensitive data and is (not surprisingly) the most expensive solution. At the other end of the spectrum, tape-based backups are quite affordable, but lack speed in restore and recover operations (and have a propensity for error), so are therefore generally only suitable for the least sensitive data types.
Close-up: Remote Replication / Mirroring
There are two methods that are most commonly utilized to perform DR through remote replication. The first is synchronous replication, in which an input/output (I/O) operation is completed for an initiating agent only after it has been written to both the local, as well as the remote storage device. The second is asynchronous replication, in which the remote data write is initiated before the I/O is completed, but the I/O is completed after the local write finishes without waiting for the remote completion.
Synchronous replication offers the minimum window for disaster data loss, since the initiator will be assured of the mirror’s integrity when it receives a completion signal for an I/O operation. Asynchronous replication improves latency, and consequently response time for I/O operations, but in many cases it does not significantly decrease the amount of bandwidth required. In addition, asynchronous replication allows a slightly greater window for data loss during DR, because I/O operations that have been completed at the initiator but not at the remote site may be lost in the event of a disaster.
Balancing RTO and RPO
When the goal is simply the lowest recovery time objective (RTO) and recovery point objective (RPO) possible, in order to restore applications that require continuous uptime no matter what the cost, remote replication/mirroring is a great way to go. Remote replication of backup data protects against both disasters and complete site failures.
As I said before however, DR should not be approached as one size fits all. From the data hierarchy exercise it is clear that in most cases a balanced combination of several techniques is needed for near-complete DR capability, since not all data is equally served by all types of data protection. Therefore, for want of a better term, a “graduated” approach is usually best to achieve the balance laid out as a goal in the DR planning stages. This balanced approach is often found in a software-based DR solution that encompasses several of the different DR strategies within a single software stack.
A less expensive, yet similarly robust method of DR, point-in-time replication (also known as snapshot-assisted replication) represents a good balance in terms of cost, performance, and the types of data and applications it can protect and restore. As shown in Figure 3, it covers nearly the complete spectrum of data types and sits near the middle in terms of cost, yet offers excellent DR capability, low RTO and RPO, and the ability to get organizations back up and running within a matter of seconds or minutes, in most cases. In comparison to remote replication, snapshot-assisted replication boasts similar performance in DR, lower total operating cost, support for heterogeneous storage, and vastly better rollback capability. In fact, the underlying architecture of snapshot-assisted remote replication, which operates by sending incremental snapshot information in a compressed form over the link between the remote and local systems, makes it ideal for many types of recovery situations.
One important distinction between snapshot-assisted replication and real-time replication is that while real-time replication cannot protect against data corruption, snapshots (in both copy and write mode) handle this extremely well because they can recover an uncorrupted picture of the data from the numerous snapshots they have stored, and can generally do this extremely quickly.
Close-up: Snapshot-Assisted Replication (SAR)
While it is difficult to generalize, since the data protection and recovery needs of each organization do vary so greatly, snapshot-assisted replication is very attractive as a DR methodology from the perspective of overall balance of RPO, RTO and cost. The diagram below highlights the architecture that underpins Snapshot-assisted replication:
Classic SAR implementations take repeated snapshots of the storage system at close intervals, and rebuild the data onto the remote mirror by mounting successive snapshots and writing them to the remote system. This method has the advantage of allowing the process of data backup to be performed independently of the I/O operations. It also saves bandwidth in the event of multiple repeated or frequent writes occurring at the same I/O location, by preventing all except the final write to be transmitted as a snapshot. The drawback that this method incurs, however, in comparison to remote replication, is the same as other asynchronous replication methods, in that it provides a slightly larger time window for data loss during disasters. Additionally, since each snapshot is to be rebuilt at the remote system, bandwidth is not necessarily reduced by a large amount. However, these concerns are often outweighed by the speed and ease of recovery that SAR offers.
Point-in-time replication of backup data to an off-site facility offers the added benefit of the reduction or elimination the need for tape-based backups. Appliances protected with snapshot-assisted replication enable near-instant recovery of access to backup data sets from a different location with no backup data loss. Your organization can seamlessly recover from site outages almost instantaneously through quick re-syncs of out-of-sync backup storage repositories.
While snapshots present a consistent state of the volume to the user in terms of the outstanding I/Os to the storage server, this may not be consistent with respect to the application using the volume, in terms of its recoverability. Furthermore, the application might be using multiple volumes on the storage server for various purposes. For example, Microsoft® Exchange server uses a log volume and a data volume to maintain, manage and protect its data. Thus a snapshot on both volumes must be synchronized in order to achieve application recoverability. Also, often the I/O’s may be getting cached on the initiator in write-back mode, causing delayed writes to the volume. Thus the snapshot taken from within the storage server does not always assure application- consistent data recoverability on a disaster.
For these reasons there needs to be some sort of application- and host-aware agent on the application server which communicates with the storage server in order to create application-aware, consistent snapshots. This host agent must be able to quiesce the application and trigger the snapshots on all the application-specific volumes (i.e., the consistency group).
When faced with the task of selecting a method of DR, it is not unlikely that it will be met with a bit of trepidation. After all, there are so many potential choices to consider. Simply keep in mind that since no one type of DR offers a complete solution by itself, undoubtedly the best solution is a combination of many different elements of DR, balancing cost, RPO and RTO. Whatever the choice, no final decisions should be made before key individuals within your organization are in agreement on the data value hierarchy, and acceptable recovery benchmarks have been set.
Justin Bagby, director of the StorTrends Division of American Megatrends (AMI), has more than 15 years of technology experience in storage, RAID, virtualization (VMware, MS Hyper-V, Citrix and RHVS), BIOS, remote management, utilities, and server db applications (MS SQL, Exchange and Oracle). Bagby is responsible for the sales, business development, marketing, marketing programs, solutions engineering, and support engineering facets of the StorTrends product division. Bagby earned his bachelor of arts degree in business management at Georgia Southern University. He enjoys spending time with his family, hunting, fishing, and playing the role of a part-time farmer on their farm outside of Millen, Ga.