By Leigh Sleeman, partner and senior IT program manager of T2 Tech Group
Being proactive in a worst-case scenario can save your organization an incredible amount of time and costs when it’s needed most. An IT disaster recovery (DR) solution is instrumental in ensuring systems and data are recuperated in the event of an emergency. All too often, business stakeholders find themselves in need of a DR solution only after an emergency situation has occurred with their company or a competitor. In order to be proactive, a company must first realize that even the most resilient data centers can experience power failures or operational errors that result in unforeseeable outages, leaving the organization without applications for hours, days and possibly weeks.
What is a Disaster Recovery Solution?
A DR solution involves a set of policies, tools and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural disaster or human-induced error. The solution outlines the process required to restore functionality of an applicational system after an unanticipated event has caused an outage in the primary system running the application. Such plans detail the actions to take before, during and after disaster strikes. Simply put, a DR solution eases anxieties and reassures an organization that systems will be restored and running properly in a timely fashion.
Overall, leaders must put a workable DR plan — this includes proper documentation — in place to reduce the chance of decreasing customer satisfaction and revenue during moments of unpredictable system failure.
What is the biggest challenge an organization faces when it comes to deciding to invest in a comprehensive DR plan? Prioritization. There is no shortage of priorities and projects, and DR plans tend to be a lower priority, as it’s difficult to showcase an immediate benefit. However, it is crucial for senior leadership within an organization to understand the importance of being prepared for a potential system loss. IT leaders are encouraged to have open dialogue with key stakeholders to stress the importance of creating a DR plan. By using examples from peers that have had detrimental system failures in the past, discussions don’t have to be limited to planning for a hypothetical system failure.
Steps for Creating a Disaster Recovery Solution
Disaster recovery situations vary dramatically based on a variety of uncontrollable variables specific to an organization, but there are three common factors that organizations face when most system failures occur: outdated documentation, incomplete or missing technology, and lack of DR testing and training.
Before focusing on implementing a DR solution, first ask yourself, “how can our organization have a DR solution that incorporates these three crucial elements without repeatedly impacting productivity or disrupting business operations?” And the answer isn’t as hard as it seems. Simply, organizations must use a multi-phased approach to minimize system downtime and protect costs during the creation and maintenance of a DR solution.
- Pre-planning Phase
- Creating and implementing a DR solution requires pre-planning work to understand the organization’s system priorities in case of an emergency situation. During this phase, numerous questions about the organization’s business operations, potential financial impact and customer experience must be answered. A careful analysis and ranking of criticality for each application empowers a fundamental understanding of what is vital to the business. Once the pre-planning phase is complete, each application will go through three steps: assessment, documentation and testing.
- Creation Phase
- The three steps outlined below should be completed for each application deemed critical during the pre-planning phase:
- Step 1: Assessment
- The goal of the assessment step is to gather information pertaining to the identified application and its associated documentation. The teams should obtain an accurate list of all servers associated with each application and determine the appropriate replication/data protection tool required for the types of servers identified. This phase is critical to provide teams with the necessary information to successfully start the disaster recovery process without any unanticipated roadblocks.
- Step 2: Building and Documentation
- In the past, DR plans were often voluminous with a lot of robust information; however, the challenge with that is no one has time to keep it up to date. We recommend including the least amount of information required to successfully perform a failover, which includes the following components:
- DR process flow diagram: This document outlines a workflow between each departmental and technical component
- Architecture diagram: To create a comprehensive architecture diagram, the diagram should include items like:
- Server name
- Core service dependencies
- Replication method
- Playbook: This provides a sequenced set of tasks that can be used to track the progress during an actual failover event
- Runbook: This will be used by the technical staff as a guide to instructing them on specific steps required to successfully failover the critical applications to the DR site
- Business impact analysis: This establishes and provides the acceptable amount of time the application will not be available (recovery time objective – RTO), and the amount of data that could potentially be lost (recovery point objective – RPO)
- Keep in mind that, as you build out your DR technologies, you should be simultaneously updating the documentation as outlined above.
- In the past, DR plans were often voluminous with a lot of robust information; however, the challenge with that is no one has time to keep it up to date. We recommend including the least amount of information required to successfully perform a failover, which includes the following components:
- Step 3: Test
- This step includes testing the individual application by running the playbooks to determine and verify if the documentation and system work properly. If the testing is not successful, troubleshooting must occur. Ideally, successful testing is performed without taking down the production site, causing any operational disruption or experiencing degradation of normal business activity.
- Step 3: Test
- Master Planning and Tabletop Testing Phase (ongoing)
- Once each application has gone through the assessment, documentation and testing steps, each individual plan needs to be stitched together to create a masterplan. This process involves prioritizing the sequence in which the applications will be restored. The masterplan includes designated roles for the organization team that shows who is responsible for communication, coordination, liaison and beyond. It is worth noting here that some of the steps identified for an application may take a considerable amount of time to complete; therefore, they may need to be the first items started in a master plan, even if they are not the first priority to finish.
- Lastly, it is finally time to perform a tabletop or mock disaster. It should be conducted to educate the appropriate resources and validate readiness for a potential disaster. The goal of this exercise is to validate that the three key areas of up-to-date documentation, working technology and trained staff are all in place. This tabletop DR scenario should be performed every 6-12 months so that the technical staff and organization leadership can be confident that they are ready for the unexpected.
Case Example: Sharp HealthCare
Sharp HealthCare, a T2 Tech Group customer, is a not-for-profit integrated regional health care delivery company with four acute-care hospitals, three specialty hospitals and three affiliated medical groups comprised of 18,000 employees and 2,700 affiliated physicians. In 2016, Sharp entrusted T2 Tech Group to create and implement a disaster recovery solution. At the time, Sharp had a primary data center and a secondary data center, which are the two essential pieces needed to avoid data loss in the event of a disaster. However, Sharp needed to improve their detailed DR solution and processes to follow, should a disaster situation occur.
T2 Tech Group used the multi-phased approach outlined above to create a new DR solution with Sharp. The team developed architectural diagrams, built out solutions and developed runbooks and playbooks for 30 of Sharp’s most critical applications.
The highest priority during the process was maintaining hospital operations and delivering uninterrupted services to patients and staff while testing a disaster scenario. Together, the two teams efficiently went through the steps of creating a DR solution for each application without taking down the production site or causing any disruption or degradation of normal clinical activity. This process was streamlined by a combination of Sharp’s IT team knowledge of the applications, and the T2 Tech hybrid-Agile methodology that balances upfront planning with an iterative execution approach. The new DR solution implemented with T2 Tech Group will help ensure the health care leader can serve its patients without interruption even in a disastrous situation.
Conclusion
For organizations that rely on IT to provide core services to serve communities and patients, such as health care systems, it’s critical to have a DR solution that details the people, processes and technology needed to restore system function. Without a solution in place, the consequences of being unable to effectively manage an unexpected outage can threaten an organization’s ability to fulfill its ongoing mission. Maintaining an up-to-date DR solution is essential for long-term organizational success, which includes testing the plan every 6-12 months and making updates or adjustments as needed. Development of DR solutions is not a one-time exercise, but rather an ongoing commitment of due diligence that needs to be made by an organization and its IT department.
About Leigh Sleeman:
Leigh Sleeman is a partner and senior IT program manager at T2 Tech Group, with over 19 years of experience in healthcare information technology. Throughout his career, Leigh has seamlessly transitioned business needs into technical requirements, forged excellent relationships with executive leaders, functioned as a primary IT liaison, and directed teams of high-caliber senior project managers.