Companies across all industries and verticals have an ever-increasing dependence on technology. Secure access and robust, minute-to-minute performance from IT is an expectation if not a competitive necessity. Downtime is not an option. Ever-increasing service interruptions are causing significant economic and competitive marketplace disruption, tarnishing brand reputations significantly.
All events are not created equally. Incidents are not events. So, when a critical event or service outage occurs, how prepared and confident are you in deciding to invoke or declare at the time of the event? Has the time to assess the type of event, your decision criteria, and the implications and outcomes of invoking been considered? Impacts on your established recovery point and time objectives? Are all your key stakeholders cognizant and aligned around go/no-go decisions? Have scenario-based tests been conducted to explore the known implications? Let’s delve into this a bit.
Being ill-prepared, slow, or reacting poorly to glitches, performance issues, and/or outages needlessly impacts customer confidence, revenues, and marketplace competitiveness.
Common service interruptions occur across several categories:
- Human error
- Release management/patching
- Cyber events
- IT environmental software/hardware performance
- Service provider degradation/outages
Issues like regulatory compliance, pandemics, weather events, and supply chain disruptions generally fall under a broader business continuity management (BCM) umbrella of risk. Regardless, it is critical those involved are cognizant of all risk categories and secure a funded enterprise resiliency strategy that proactively plans for demonstrable and timely execution of a comprehensive resiliency response framework.
For decades, regulated organizations have been led by compliance when it comes to technology resilience. This approach is reflective of the previously available technology, tools, and methods, and it was the right practice for the times.
The preferred indicators for measuring application resilience have typically been the recovery point objective (RPO) and the recovery time objective (RTO). However, these application-based measures often bear little relation to what the end customer cares about – which is service availability – the delivery of uninterrupted, exceptional customer experiences.
Even in catastrophic events, many organizations choose to not invoke or failover, knowing for a multitude of reasons (unproven capability, dynamic environments, misaligned configurations) that to do so would exacerbate the situation and/or jeopardize or prolong the event.
A related and material challenge is that for heterogeneous, complex, and/or legacy environments, testing ‘as you would recover’ is significantly more complex than most actual disruptive events, hence most organizations adhere to structured elemental testing, with financial institutions being regulator-driven. As always, there are significant, and evolving challenges that continue to add complexity to your preparation and confidence in making an invocation determination.
Regulators are pushing hard for compliance
Regulators are pushing for organizations to take a more proactive approach to operational resilience. Both in the US and the UK, recent papers, such as the FCA discussion paper on operational resilience, have focused on the need for true operational resilience. For example, the FCA paper urges firms to focus on how their response to disruptions impacts the end-user and points toward greater accountability for decision-makers.
The IT estate is evolving
The tech stack is changing rapidly. In the era of cloud, and all that comes with it, the future is going to be fundamentally different. The exponential growth of data and analytics with IoT and SMART venues presents new risks and complexities. However, with risks comes new opportunities to make data-driven decisions and combine new technologies with human orchestration.
Your eco-partners matter
The range of service providers continues to grow, both in the choices available and in their ability and desire to provide ‘mission-critical’ products and services. The architecture which defines the way your customers consume your revenue-generating products and services must have resilience built in to maintain a competitive differentiation that protects your brand.
More change means more risk
The pace of change driven by the race to digital transformation is creating more risk. Most outages have their root cause in change. The more change an organization must make to keep up, the more complexity they face, resulting in significant economic and operational impacts when an outage occurs. Just look at the 2018 TSB Bank failure as an example of this. During an attempt to move to a new IT system, the bank’s computer systems failed resulting in nearly 1.9 million customers being locked out of their accounts for weeks. The debacle cost the company £366 million, of which £130 million went toward customer compensation and £25 million paid for an incident report ordered by the Financial Conduct Authority (FCA) and Prudential Regulation Authority that found TSB’s Spanish parent company Sabadell had “cut corners” with critical IT testing. In addition, the incident caused the company to lose an estimated 80,000 customers.
Evolving to a resilience culture
Evolving from a compliance-driven posture of resilience capabilities that can be exercised, measured, and validated to a true operational resilience execution posture is a formidable challenge. Key success criteria will include executive sponsorship and governance, a revised operational framework, and focused commitment of resources, the automation and integration of services, and tools – to provide visibility across service, infrastructure, and hosted solutions.
So, when a significant event occurs, when do you pull the trigger? Increasing your confidence levels sufficiently to invoke or declare will continue to be challenging. Some of the key deciding factors are as follows, from effectively testing, to achieving demonstrable compliance, having the confidence in your operational resilience posture, and finally to determining if you should invoke based on the event and your pre-emptive analysis preparation.
My experience around this critical element of resilience evolution continues to focus on isolated active-active workloads or environments, pre-defined, predictable, scenario ‘decision criteria’ checklists, and extensive stress testing of interdependencies to identify gaps or exposures. Analyzing previous testing events and making required revisions to plan and process is an absolute must. Implementing the ability to identify and flag material changes to the production environment that must immediately result in runbook/plan/process updates will result in increased confidence in execution. There’s no getting around it, being prepared matters.
These challenges are the impetus for all of us to work collaboratively to develop near-real-time executable resiliency strategies, leveraging dynamic, easily customizable, ready-for-use runbooks and plans based on the innovative platform and tool capabilities available today.
It’s easy to become overwhelmed by the considerations, but it is crucial you see your move toward better operational resilience as a journey, not a destination. Tackling a consideration at a time, improving preparation and practice, and leveraging new tooling available to provide automation, control, and advanced visibility are three steps you can take, setting you on the path to effective resilience you can be fully confident in.