Increasingly complex operations have resulted in a proportional expansion in the volume of priorities to be juggled across an enterprise. With the level of resources needed to keep up with day-to-day requirements it is growingly more difficult to ensure that potential threats get sufficient attention. So further goes the challenge of those tasked with insuring operational resilience through activities that consume time and energy with seemingly no near term pay off.
For environments in which there is limited tolerance for disruption resilience should itself be a well-established priority: something that needs to be done, not done as time permits. It needs to be on the A-list, embedded in the fabric of essential processes, a necessary component rather than a desirable accessory.
In many organizations, resilience is a collection of activities that get inserted into plans that were established without giving it consideration. For example there is documentation, testing, and awareness, all initiated as an external requirement. When justified, what can be done to elevate continuity to the status of fundamental requirement?
Perhaps the administration of resilience within an organization through a series of seemingly unrelated tasks is partially to blame. By finding a common thread among these activities they can be strung together to establish an annual “cycle,” a series of things that need to be delivered predictably each year to support resilience across the enterprise.
The following schedule seeks to raise the profile of resilience in this way:
The foundation of any program is the completion and/or review of the business impact analysis (BIA). Ideally these documents would be reviewed annually or as changes are introduced. For annual reviews, limit the requirement to include only changes made from the previous year and set the delivery date sometime early in a 12-month cycle.
An essential element of the BIA is a determination of process criticality, as this supports the recovery time objectives for the corresponding applications. Consequently, recovery time objective (RTO) is itself a fundamental parameter in the development of a test plan. A solid test not only demonstrates the recoverability of an application, but also the ability to complete that recovery within expectations. For this reason, annual test planning intersects with the BIA process through the identification of system criticality as defined by an RTO.
Some organizations test the recovery of only the most critical applications: those with an RTO of less than 12 hours. From the BIA analysis the list of applications in scope for a test is easily established. This list can then be reconciled against the recovery documentation for the corresponding applications. This check ensures that such documentation exists and is current. Once this process is annualized, the addition of an attestation for an engineer using the document to actually recover the application, either in an actual event or during a test, provides an expert validation of its accuracy and completeness.
In this environment it would be helpful that staff understand that if documentation is incomplete or incorrect it is simply reported as needing work, the consequence of which is simply the opening of a related tracking item to ensure the now identified flaw is resolved.
At this point, an application recovery test, including the validation of related documentation. In line with the concept of an annual resilience cycle, conducting tests at a similar time each year offers many advantages. It permits staff to anticipate something that needs to be done annually as well as keeping a date free of potential conflicts with other equally important priorities. Additional considerations: complete annual testing before vacation season and avoid testing late in the year. In this way if a test needed to be postponed there will be sufficient time to ensure it is still completed during the year. It would also leave room for the re-testing of applications that either completely failed to recover or did not meet the RTO.
The results from the executed test will yield a list of items requiring correction. Ideally, these issues would be assigned and tracked to ensure problems revealed through the test found their way into workflows and could be prioritized for resolution.
From a disaster recovery standpoint, a validation of the resolutions for each of the issues raised in testing would serve as the close of the cycle.
Outside of the cycle, the resilience team can both enhance visibility and provide value through regular participation in the change management process. This is especially important in environments where the project “go-live” process doesn’t actively ensure that continuity has been considered. While it could prove impractical to block changes that have incompletely considered resilience, participation in the process provides a level of oversight to ensure that the decisions to proceed are fully informed and consciously include all potential risks.
Unfortunately, the explicit value of resilience is only appreciated when something comes close to failing or actually fails. Elevating the profile for this very important insurance reinforces the implied value of this essential operational component.