DR Object Stores Evolve Beyond the “Cold Data Tier”
The cost-efficient scalability of object storage makes it an ideal resource for data protection, especially when delivered through a cloud-based service. However, current approaches to data protection with object storage typically employ the object store as a “cold data tier” that is tightly coupled to the disaster recovery (DR) platform’s datastore in the recovery site. It’s essentially an archive layer for the DR platform, which continues to run with redundant compute, storage and network resources, waiting idly for a failover operation.
A New Role for Object Storage in DR
An alternative approach is to make the object store the primary target for all replicated data. As data is written to primary storage in the protected environment, it is concurrently replicated directly to the object store. There are two critical requirements here: First, data must be continuously intercepted and replicated to the object store from the moment it is written to primary storage in the protected site; and second, data must be presented to the appropriate container in the object store directly, as a continuous stream of objects rather than blocks or files. In this manner, the object store can serve as the exclusive data repository for a cloud-based continuous data protection (CDP) platform.
In response to an actual disaster recovery incident, the objects in the object store container must be self-describing so that their “rehydration” — the extraction and deployment of systems and data into a recovery environment — has no dependencies on either the protected site (which may not be accessible at all) nor the destination recovery environment (which may be selected on the fly during the execution of the failover). In other words, everything required to re-create the protected cluster must be accessible from the object store container. This includes virtual machines (VMs) along with their operating systems, applications and data, but it also includes configuration information, which may include network configurations, virtual resource allocations, access permissions, and so on.
Integrated and Decoupled
When a DR software in the protected environment has the ability to connect directly to a remote cloud object store — and stream data to it in compressed objects — taking full advantage of the object store’s security, scalability and multi-tenancy features, we say that the DR software is integrated with the object store. When the object store does not need to receive data through intermediary DR software in its own environment, and the protected systems, data and configuration information can be extracted wholly from the object store for deployment into any recovery environment with no additional metadata required, we say that the DR software is decoupled from the object store. Hence, we have the concept of a relationship between the DR software and the object store in which the two are simultaneously integrated and yet decoupled.
In an integrated/decoupled DR service offering based on an object storage platform, the service provider has the advantage of maintaining recovery copies of their customers’ systems and data in a secure, highly scalable, multi-tenant and cost-efficient object store and rehydrating protected systems and data into a recovery runtime environment “on demand” — that is, at the time of a failover operation. Importantly, the object store and recovery runtime environment can be located in physically separate locations — and even maintained by different service providers. This is a true multi-cloud model for DR services, capable of supporting a variety of deployment models.
Garbage Collection in the Object Store
Two key requirements of this operational model are garbage collection and protection domain management. Let’s look at garbage collection first. In the primary storage environment, data is constantly being overwritten and deleted. But in the object store, data is just continuously appended in new objects. Without some means of deleting objects containing obsolete or invalidated data, the object store containers would grow infinitely. However, our model calls for no DR-specific intelligence coupled to the object store. So how to manage the growth of the data in the object store? The solution is to monitor the data that is overwritten and/or deleted in the primary storage environment and request deletion of their corresponding objects. In cases where an object’s data is “mostly” obsolete, the still-valid data may be written into a new object that is sent to the object store before the “stale” object is deleted. In this manner, garbage collection in the object store may be executed from the DR software running in the protected site.
Protection Domains and Recovery
A protection domain (or just “domain”) is a set of VMs that are protected together. The VMs that share a domain typically have the same level of criticality, they may be connected (e.g., vApps), and they share a common datastore. Additionally, all VMs and data in a protection domain are replicated to a single, dedicated container within the object store. In the event of a disaster event, the protection domain is the unit of failover granularity. VMs in the same domain will all fail over together.
A key objective for the DR solution is that following a failover, the continuous protection of the VMs must remain uninterrupted, regardless of the condition of the protected site, including a “whole site failure,” in which the protected site cannot be reached in any way. This introduces several requirements. First, the DR software should also be available in the recovery environment and should be able to continue replication into the object store. Second, the protection domain should obtain the information necessary to locate and authenticate access to its container from special “domain information objects” that are obtained from the container along with the objects containing the VMs and their data, etc. Finally, following failover, the recovery site should take over exclusive ownership of the protection domain and its access to the container.
Changing domain ownership is straightforward if the protected site fails completely and the rehydration into the recovery environment is not interrupted. However, conflicts can occur in which the ownership of the protection domain is contested. For example, if the failure is partial, the domain in the protected site may continue to try to update the container. Also, if the protected site is recovered quickly before the failover has completed, the protected site may attempt to reclaim ownership as it’s being requested from the recovery site. The domain should be owned by only one site at any time, but in the decoupled model, sites do not communicate with each other, only the container. Therefore, domain ownership and status are included in the domain information objects in the container. When ownership or status change, the object is updated appropriately. When a change in ownership is requested, the request is granted or denied based on the metadata obtained from domain information objects.
Object storage has proved to be a secure, scalable and cost-effective resource for data protection based on traditional backups. With advances in DR software, it can also bring these advantages to solutions for continuous data protection and disaster recovery. By applying an approach in which the DR software is integrated with yet decoupled from the object store, service providers have increased options for providing DR services for their customers.