Print this page
Wednesday, 10 October 2018 07:45

Disaster Recovery Components within Policies and Procedures


While many would consider a discussion about disaster recovery policies and procedures boring (I certainly don’t), in reality, policies and procedures are 110% vital to a successful DR. Your organization could have the greatest technology in the world, but without a solid plan and policy guide in place, your disaster recovery efforts are doomed to fail.

A tad hyperbolic, perhaps. But the lack of properly updated documentation is one of the biggest flaws I see in most companies’ DR plans.

A disaster recovery plan is a master plan of a company’s approach to disaster recovery. It includes or references items like runbooks, test plans, communications plan, and more. These plans detail the steps an organization will take before, during, and after a disaster, and are usually related specifically to technology or information. Having it all written down ahead of time helps streamline complex scenarios, ensures no steps are missing from each process, and provides guidance around all elements associated with the DR plan (e.g. runbooks and test plans).

Creating a plan also provides the opportunity for discussion around topics that have likely not been considered before or are assumed to be generally understood.

*Which applications or hardware should be protected?
*When, specifically, should a disaster be declared, who can make that declaration, and who needs to be notified?
*Have response-tiers been identified depending on the type of disaster?

*Which applications correspond to each tier?

The most critical condition of a successful DR plan is that it be kept updated and current—frequently. An outdated DR plan is a weak DR plan. Applications change. Hardware changes. And organizations change, both in terms of people and locations. Dealing with a disaster is hard enough, but no one needs the added pressure of trying to correlate an outdated organization chart with a current one. Or trying to map old server names and locations to existing ones. Pick a time-metric and a change-metric for when your DR plan will be update (e.g. every six months, every year, upon a major application update to a mission-critical system). Pick some conditions and stick to them.

1) Runbooks
Runbooks are step-by-step procedure guides for select tasks within an IT organization. These reference guides are tailored to describe how your organization configured and implemented a specific technology or software and focuses on the tasks the relevant teams would need to perform in the event of a disaster.

*How to startup or shutdown an application/database/server.
*How to fail-over a server/database/storage array to another site.
*How check if an application/database has started-up correctly.

The goal is to make your runbooks detailed enough that any proficient IT professional could successfully execute the instructions, regardless of their association with your organization. A runbook can consist one big book, or several smaller ones. They can be physical or electronic (or both). Ideally, they are stored in multiple locations.

Nobody likes documentation. But in a disaster, emotions and stress can run very high. So why leave it all up to memory? Having it all documented gives you a reliable back-up option.

Depending on the type of disaster, it’s possible the necessary staff members wouldn’t be able to get online, specifically the person who specializes in Server X, Y, or Z. Perhaps the entire regional team is offline, and a server/application has failed. A Linux admin is available, but he doesn’t support this server day in and day out. Now suddenly, he’s tasked with starting up the server and applications. Providing this admin with guide on what to do, what scripts to call, and in what order, might just be the thing that literally saves your company.

And if your startup is automated—first off, great. But how do you check to be sure everything started up correctly? Which processes should be running? Or what log to check for errors? Is there a status code that can referenced? Maybe this is a failover scenario: the server is no longer located in Philadelphia, and as such, certain configuration values need to be changed. Which values are they and what should they be changed to?

Runbooks leave nothing to memory or chance. They are the ultimate reference guide and as such should detail each detail of your organization’s DR plan.

2) Test Plans
Test Plans are documents that detail the objects, resources, and processes necessary to test a specific piece of software or hardware. Like runbooks, they serve as a framework or guideline to aide in testing, and can help eliminate the unreliable memory-factor from the disaster equation. Usually, test plans are synonymous with Quality Assurance departments. But in a disaster, they can be a massive help in organization and accuracy.

Test Plans catalog the test’s objectives, and the steps needed to test those objectives. They also define acceptable pass/fail criteria, and provide a means of documenting any deviations or issues encountered during testing. They are generally not as detailed as runbooks, and in many cases will reference the runbooks required for a specific step. 

3) Crisis Communication Plan
A Crisis Communication Plan outlines the basics of who, what, where, when, and how information gets communicated in a crisis. As with the above, the goal of a Crisis Communication Plan is to get many items sorted out beforehand, so they don’t need to be made up and/or decided upon in the midst of a trying situation. Information should be communicated accurately and consistently, and made available to everyone who needs it. This includes not only technical engineers but also your Marketing or Public Relations teams.

Pre-defined roles and responsibilities help alleviate the pressure on engineers to work in many different directions at once and can allow them to focus on fixing the problems while providing a nexus for higher-level managers to gather information and make decisions.
Remember, the best DR plans prepare your organization before, during and after a disaster,  are focused equally on people as well as data and computers, and its creators have taken the time to and money to test, implement, and update it over time – engaging the entire company for a holistic approach.

Hank YeeAs an Anexinet Delivery Manager in Hybrid IT & Cloud Services, Hank Yee helps design, implement and deliver quality solutions to clients. Hank has over a decade of experience with Oracle database technologies and Data Center Operations for the Pharmaceutical industry, with a focus on disaster recovery, and datacenter and enterprise data migrations.