As an industry professional, you're eligible to receive a printed copy of the journal.

Fill out your address below.






Please reset your password to access the new DRJ.com
Reset my password
Welcome aboard, !
You're all set. We've send you an email confirmation to
just to confirm you're you.

Welcome to DRJ

Already registered user? Please login here

Existing Users Log In
   

Create new account
(it's completely free). Subscribe

Companies across all industries and verticals have an ever-increasing dependence on technology. Secure access and robust, minute-to-minute performance from IT is an expectation if not a competitive necessity. Downtime is not an option. Ever-increasing service interruptions are causing significant economic and competitive marketplace disruption, tarnishing brand reputations significantly.

All events are not created equally. Incidents are not events. So, when a critical event or service outage occurs, how prepared and confident are you in deciding to invoke or declare at the time of the event? Has the time to assess the type of event, your decision criteria, and the implications and outcomes of invoking been considered? Impacts on your established recovery point and time objectives? Are all your key stakeholders cognizant and aligned around go/no-go decisions? Have scenario-based tests been conducted to explore the known implications? Let’s delve into this a bit.

Being ill-prepared, slow, or reacting poorly to glitches, performance issues, and/or outages needlessly impacts customer confidence, revenues, and marketplace competitiveness.

Common service interruptions occur across several categories:

  • Human error
  • Release management/patching
  • Cyber events
  • Power/utility
  • IT environmental software/hardware performance
  • Service provider degradation/outages

Issues like regulatory compliance, pandemics, weather events, and supply chain disruptions generally fall under a broader business continuity management (BCM) umbrella of risk. Regardless, it is critical those involved are cognizant of all risk categories and secure a funded enterprise resiliency strategy that proactively plans for demonstrable and timely execution of a comprehensive resiliency response framework.

For decades, regulated organizations have been led by compliance when it comes to technology resilience. This approach is reflective of the previously available technology, tools, and methods, and it was the right practice for the times.

The preferred indicators for measuring application resilience have typically been the recovery point objective (RPO) and the recovery time objective (RTO). However, these application-based measures often bear little relation to what the end customer cares about – which is service availability – the delivery of uninterrupted, exceptional customer experiences.

Even in catastrophic events, many organizations choose to not invoke or failover, knowing for a multitude of reasons (unproven capability, dynamic environments, misaligned configurations) that to do so would exacerbate the situation and/or jeopardize or prolong the event.

A related and material challenge is that for heterogeneous, complex, and/or legacy environments, testing ‘as you would recover’ is significantly more complex than most actual disruptive events, hence most organizations adhere to structured elemental testing, with financial institutions being regulator-driven. As always, there are significant, and evolving challenges that continue to add complexity to your preparation and confidence in making an invocation determination.

Regulators are pushing hard for compliance

Regulators are pushing for organizations to take a more proactive approach to operational resilience. Both in the US and the UK, recent papers, such as the FCA discussion paper on operational resilience, have focused on the need for true operational resilience. For example, the FCA paper urges firms to focus on how their response to disruptions impacts the end-user and points toward greater accountability for decision-makers.

The IT estate is evolving

The tech stack is changing rapidly. In the era of cloud, and all that comes with it, the future is going to be fundamentally different. The exponential growth of data and analytics with IoT and SMART venues presents new risks and complexities. However, with risks comes new opportunities to make data-driven decisions and combine new technologies with human orchestration.

Your eco-partners matter

The range of service providers continues to grow, both in the choices available and in their ability and desire to provide ‘mission-critical’ products and services. The architecture which defines the way your customers consume your revenue-generating products and services must have resilience built in to maintain a competitive differentiation that protects your brand.

More change means more risk

The pace of change driven by the race to digital transformation is creating more risk. Most outages have their root cause in change. The more change an organization must make to keep up, the more complexity they face, resulting in significant economic and operational impacts when an outage occurs. Just look at the 2018 TSB Bank failure as an example of this. During an attempt to move to a new IT system, the bank’s computer systems failed resulting in nearly 1.9 million customers being locked out of their accounts for weeks. The debacle cost the company £366 million, of which £130 million went toward customer compensation and £25 million paid for an incident report ordered by the Financial Conduct Authority (FCA) and Prudential Regulation Authority that found TSB’s Spanish parent company Sabadell had “cut corners” with critical IT testing. In addition, the incident caused the company to lose an estimated 80,000 customers.

Evolving to a resilience culture

Evolving from a compliance-driven posture of resilience capabilities that can be exercised, measured, and validated to a true operational resilience execution posture is a formidable challenge. Key success criteria will include executive sponsorship and governance, a revised operational framework, and focused commitment of resources, the automation and integration of services, and tools – to provide visibility across service, infrastructure, and hosted solutions.

Invocation considerations

So, when a significant event occurs, when do you pull the trigger? Increasing your confidence levels sufficiently to invoke or declare will continue to be challenging. Some of the key deciding factors are as follows, from effectively testing, to achieving demonstrable compliance, having the confidence in your operational resilience posture, and finally to determining if you should invoke based on the event and your pre-emptive analysis preparation.

My experience around this critical element of resilience evolution continues to focus on isolated active-active workloads or environments, pre-defined, predictable, scenario ‘decision criteria’ checklists, and extensive stress testing of interdependencies to identify gaps or exposures. Analyzing previous testing events and making required revisions to plan and process is an absolute must. Implementing the ability to identify and flag material changes to the production environment that must immediately result in runbook/plan/process updates will result in increased confidence in execution. There’s no getting around it, being prepared matters.

Conclusion

These challenges are the impetus for all of us to work collaboratively to develop near-real-time executable resiliency strategies, leveraging dynamic, easily customizable, ready-for-use runbooks and plans based on the innovative platform and tool capabilities available today.

It’s easy to become overwhelmed by the considerations, but it is crucial you see your move toward better operational resilience as a journey, not a destination. Tackling a consideration at a time, improving preparation and practice, and leveraging new tooling available to provide automation, control, and advanced visibility are three steps you can take, setting you on the path to effective resilience you can be fully confident in.

ABOUT THE AUTHOR

Steve Piggott

Steve Piggott is the head of enterprise resilience - global accounts at Cutover. He works within all facets of Cutover to drive market awareness, revenue growth, and to align with our customers’ requirements and expectations in receiving exceptional enterprise resiliency results. Piggott brings a wealth of experience in driving successful customer outcomes across business transformation, operational resilience, disaster recovery, and business continuity program development.

Ask the Executive: An Interview with Melanie Lucht of Carnegie Mellon University
Melanie Lucht, MBCP, MBCI, CIC, CCM is the associate vice president and chief risk officer at Carnegie Mellon University. She...
READ MORE
Young Professional Spotlight: Tina Klaskala
Tina Klaskala has transitioned through a fascinating professional evolution. Having started her response career as a wild land firefighter with...
READ MORE
Build a roadmap for success.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus eu venenatis felis. Etiam interdum ligula…
READ MORE
Out of Office Reply: When the CEO Takes Emergency Leave
In 2016, I went to see the doctor about a headache I could not shake. After a CT scan, I...
READ MORE