As the world gears for 5G and next level of cloud computing, resiliency is a cornerstone of scaling the capabilities. Although, in many cases, resiliency is still an afterthought. This paper provides a multi-layered model to make resiliency a design consideration.
Introduction
Why resiliency becomes extremely important today? How is cloud influencing resiliency? Does Telco cloud involve different aspects of resiliency as comparted to IT Cloud? More such enquiries are genuine and right considerations to mull upon – to the point that it might be essential to overturn the hegemony of infrastructure as the only ontological construct required to build ‘resiliency’ as a capability within the organization.
There is a gross scepticism about how Telcos will scale their capability. The prospect of download speeds between 1 Gbps and 10 Gbps, and upload speed or latency of just 1 millisecond, makes the prospect of a 5G network fascinating for industry. It is worth noting that in most of failure cases on the Internet, there is an alternate path that has better characteristics in terms of bandwidth, packet losses or round-trip time (Fressancourt and Gagnaire, 2015). This leads to a pressing hegemony that resiliency is being ignored as a design consideration or not being availed to its full capability, owing to lack of awareness and seriousness on operationalizing it.
The scope of resiliency also broadens with the advent of cloud as critical workloads must be resilient to a variety of potential threats. We find support from the work of Forrester in their series “Seven Best Practices for Cloud Resilience.” In some organizations, “resilience” is a catch-all term that covers everything from guaranteeing maximum uptime to implementing recovery time objectives (RTOs) and recovery point objectives (RPOs) for business continuity procedures that go beyond IT (Sustar, Ellis and Chhabra, 2022) . There is evidence and sound advice about building in another aspect of resiliency – operational resiliency. “We believe now is the right time for organizations to take a longer view and consider what the operational resilience agenda means for the target operating model in four to five years’ time” (Galaski, Allard and Ruys, 2021).
Operational resilience is defined as the ability of an organization to continue operations through adverse events or changing business conditions. This could mean a cyber incident, natural disaster, system failure, or sudden change in market conditions. (Cisco)
These observations allude to the point of utilizing an agreed model for resiliency for the cloud. One can mitigate the risk of outages by building critical customer-facing applications at the web layer while maintaining data between two public clouds. That’s easier said than done, but for critical applications, it can be a necessary step.
We can then see that there is no single dimension in which we can optimize our ontological assumption about resiliency. Rather, it is essential that we take resiliency as a multi-layered approach that mitigates these risks. This paper attempts to do that by building a generalizable multi-layered resiliency model which can help the incumbent organizations build resiliency as a core tenet and practice within their strategy.
This depends on crisply defining the resiliency as a concept that builds high availability and multi-site resiliency. In words of Sarah Garrington, “the road to organizational resilience is a journey and not a destination – a state of utopian resiliency is not something which can ever be achieved. It must be an objective which is embedded into all elements of organizational strategy; understanding the goalposts will continuously move as the organization seeks to learn and evolve along the way. Organizational resilience can only be achieved where a business chooses to build resilience into all of its functions, decisions, and strategies.” (Journal of Disaster Recovery)
Resiliency
In today’s realm, in this socially connected and mobile world, users expect to be connected wherever and whenever they want, across multiple platforms and locations.
Organization’s ability to maintain acceptable service levels by withstanding severe disruptions to its people, process and the systems which support it, is its resilience.
Organizations are based on three key mainstays – people, process and technology, and their goal is to have resilience at all three tenets (Leong, 2022).
Resilient people are aware of situations, their own emotional reactions, and the behaviour of those around them. By remaining aware, they can maintain control of a situation and think of new ways to tackle problems. They know how to strike the right balance across personal and professional life, and not to stop living just because they are working. They cultivate the required emotional intelligence to show empathy yet stay firm on the ask. They earn respect from co-workers while leading by example, instead of operating from a position of authority.
Synchronized, standardized, and consolidated business processes are key to growth and resilience in times of disruption. Non-standardized processes results in leakages, inefficiencies, increasing operational costs, and dropping margins, making them inefficient and resistant to change.
To overcome these challenges, organizations resort to quick fixes without sufficient groundwork to understand their current state and emerging risks.
The third mainstay technology is the supporting system and platform in place by which services to the end user are delivered. This includes applications and database among other infrastructural components. Resiliency at this level is achieved with different methodologies.
One way of being resilient is to make your service highly available. High Availability is a subset of resiliency.
Figure 1: Main Idea
In above Figure 1: Main Idea; it illustrates that achieving 100% system availability comes with an infinite cost, since there is no limit to attain maximum possible availability, the organization decides up to what acceptable level the need is to be resilient and it is based on certain SLAs, and KPIs. This is the balance organizations need to achieve for the cost of being highly resilient to failure versus potential loss of revenue due to outages.
Disaster Recovery
We term Disaster Recovery as a dimension of resiliency. Resiliency is made up of both disaster recovery and availability as shown in Figure 2: Two Dimensions of Resiliency. Ultimately the goal is to make sure application continues working and serving its business functions. To achieve these business functions, there is a need to architect for resiliency.
Figure 2: Two Dimensions of Resiliency
Disaster recovery is an important part of resiliency. This should be approached as a mechanism to restore the service as shown in Figure 3 below.
The objective of disaster recovery is business continuity after the loss of service. Disaster recovery is seen as something with a geographical impact, and can be measured as:
Large Scale, less frequent events
- Natural Disasters
- Technical Failures
- Human Actions/Errors
One-time event
- Recovery Time objective (RTO)
- Recovery Point Objective (RPO)
- Recovery Access Objective (RAO)
The DR strategy should consider the above points when being devised to mitigate the risk of business continuity.
Figure 3: Disaster Recovery Strategies for Service Restoration
Small Scale, more frequent events
- Component Failures
- Network Issues
- Storage Issues
Measures mean over time “The Nines”
Recovery Point Objective(RPO) is the point in time (prior to outage) in which systems and data must be restored to
Tolerable lost of data in the event of disaster or failure
Recovery Time Objective(RTO) is the period of time after an outage in which systems and data must be restored to the predtermined RPO. The maximum tolerable outage time.
Recovery Access Objective(RAO) is time required to reconnect user to the recovered application, regardless where it is recovered. (Bocian, M, 2009)
High Availability
A concept of eliminating single points of failure to make sure that if one of the elements, such as a server fails, the service is still available. It is of great significance for mission-critical systems, where any service disruption could lead to an adverse business impact.
Downtime can have adverse effects on an organization’s business health irrespective of what may have caused it. The impact of downtime can manifest in multiple different ways including lost productivity, lost business opportunities, lost data and damage to image (Lema et al., 2017). The costs linked with downtime can also cause budget imbalance that can be a major dent to the organization’s financial health. However, avoiding downtime is just one of the major reasons why you need high availability. Some of the other reasons are:
- Planned Outages
- Unplanned outages
- Disaster recovery
- Keeping up with SLAs
- Delighted customers
- Brand reputation
- Data security
High availability eliminates single points of failure for ensuring minimal service interruption. On the other hand, disaster recovery is the process of getting a disrupted system back to an operational state after a service outage. As such, we can say that when high availability fails, disaster recovery kicks in.
High Availability is not Fault Tolerance
Fault Tolerance (FT) is one of the critical approaches for cloud computing to improve the reliability of cloud systems. These FT techniques will improve both the efficiency and availability of cloud applications (Namdari and Tunc, 2021). A fault tolerant environment has no service interruption but a significantly higher cost, while a highly available environment has a minimal service interruption.
Fault tolerance relies on specialized hardware to detect a hardware fault and instantaneously switch to a redundant hardware component—whether the failed component is a processor, memory board, power supply, I/O subsystem, or storage subsystem.
High Availability is not Redundancy
As mentioned earlier, high availability is a level of service that comes with minimal probability of downtime. The primary goal of high availability is to ensure system uptime even in the event of a failure.
Redundancy, on the other hand, is the use of additional software or hardware to be used as backup if the main software or hardware fails. It can be achieved via high availability, load balancing, failover, or load clustering in an automated fashion.
n-Way Redundancy
Workloads that aim at achieving cloud-based resiliency should be deployed on highly reliable cloudified system that provides high availability redundancy design. Consider a snapshot of telco cloud which fosters fast fault reporting, overload control, and VNF data backup and restoration. The high availability of NFs within DC can be considered from the following dimensions.
Figure 4: n-Way Redundancy in Telco Cloud Exhibit
Hardware Redundancy: The internal components (including network adapters and hard disks) of a server are redundantly configured to render reliable services. Servers are also redundantly configured so that services can be taken over by another server if a server is faulty.
Virtual machine (VM) Redundancy: A potentially large number of virtual machines (VMs) must be either restarted or evacuated, especially in scenarios where multiple physical servers are affected by the event and only a short time window is available to complete the resiliency processes (Salapura and Harper, 2018). VM redundancy can be implemented by 1+1 backup or N+M load sharing.1+1 backup: Active and standby VMs are deployed on different devices (Affinity/Anti Affinity consideration). If a VM is faulty, VMs or related processes can be switched over to restore services.
n-Way Load Sharing: The number of VMs can be dynamically calculated based on the traffic model and system capacity (Saxena, Bharti and Bhagaria, 2021). The minimum value of n is the number of VMs required by the maximum system load plus m. When n VMs work at the same time, the load is evenly distributed. If one VM is faulty, other remaining VMs carry all the load of the current system. The faulty VM then performs self-healing. If a VM is faulty due to a host fault, the VM will be migrated to another host and the faulty host will be rebuilt.
Process Redundancy: Key processes work in active/standby mode, and active and standby processes are configured on different VMs. If a process is faulty, its standby process on another VM will take over services. Active and standby processes are configured on different VMs to ensure service provision in the event of a single point of failure (SPOF).
Multi-layered Resiliency
Build multiple layers of redundancy for optimal consideration of resiliency across the lifecycle as described below in Figure 5.
Figure 5: Multi-layered Resiliency
Site Resiliency
Local High Availability Implemented for both cloud and non-cloud applications Backup Features implemented for both cloud & Non-Cloud Applications.
Disaster Recovery only at application level i.e., per application irrespective of being hosted on the Cloud or non-Cloud platform.
While this achieves the objective of resiliency from the business/Business Impact Multi-layered Resiliency Model, these are not through the multi-site resiliency at cloud infrastructure level:
Local High availability
- Host Level High availability in virtualized environment
- Clustered Nodes (DB)
- Load Balanced Application and Web Servers
Backup
- File Level Backup
- Image Level Backup
- Database Backup
- Cloud Backup (Limited Applications on request basis)
Disaster Recovery
- Active Passive (Hot Standby, Warm Standby and Cold Standby)
- Site Recovery Manager
- SAN to SAN replication
- SQL Mirroring
Multisite Resiliency
Multisite availability is a site level defence which includes technology solutions that will provide either non-disruptive proactive avoidance or fast recovery from outages or major disaster related events that threaten the productivity of an entire site. These site level availability solutions can fall into proactive disaster avoidance solutions or rapid recovery models under the category of disaster recovery.
Multisite End to End Cloud strategy aligns with a roadmap to design, build/migrate & deploy on cloud the business services to meet enhanced resiliency requirements and to achieve the following corporate strategic goals:
- Minimize or mitigate CAPEX & OPEX Cost
- Minimize or mitigate any impact on Organization’s Reputation impact
- Improve better Customer Satisfaction impact
- Minimize or Mitigate any risk of Non-Compliance Regulations
Multisite Resiliency can be further categorized into non-Cloud resiliency or Cloud Architecture Resiliency.
Non-cloud IT infrastructure:
Implementation will be through on-demand Infrastructure and Services.
- Application Tier can be multi-zone
- Application Tier can be Active-Active
- Data tier can’t be multi-zoned and should act as Active-Passive, switch-over time will be slower
- Stretched clusters can be used for achieving Active-Active with certain stringent conditions due to complex application component integrations & dependencies
Cloud Resiliency:
Availability Zone is a Single Physical Location that contains all the infrastructure necessary to run a cloud.
Multi Zone Region is a single geography, where we have two or more availability zones.
- Application Layer & Data Layer will be cloud-enabled
- Network Layer will be configured as VPN that utilizes premium/regulated network resources
- The Application & Data Tiers will be in Active-Active Mode
- Switchover time will be faster
Cloud Multisite Recommendations
- Explore & Improve Multi-site Security Architecture
- Explore possible failover Options with benchmarking details for data recovery
- Properly plan Applications Disposition and its required placement as per business needs
- In-place multi-site governance & Operations Model
- Automation & Managing resource based on workload requirement in Multi-cloud Environments
- Develop Multi-Cloud Application Onboarding Roadmap & Journey
Hybrid Cloud Resiliency
Resiliency across actives sites and different deployment zones of cloud infrastructure. Architecture should include the following considerations.
- It is recommended to opt for Hybrid cloud resiliency approach to meet DC/DR Business requirements
- Moving & Onboarding resource can reap the cost and efficiency benefits of public cloud services
- Introduction of Hybrid cloud model help with you integrating and leveraging security framework of public cloud with existing one
- Hybrid cloud model help to migrate the load on need basis with improved efficiency
- Hybrid cloud model help you to process workload in public cloud where additional resources are low cost and easily accessible
- All Big data load/Processing can be leverage public cloud services on need basis while adhering to data resiliency architecture framework
Data Resiliency
Following diagram explains the need for a data resilient architecture. Data resiliency works across three main considerations – Protection, Economics and Simplicity, and Efficiency.
Protection
- Copy data where it originates, wherever it is hosted: physical, virtual, containers
- Instant recovery
- Encrypt data at rest and in flight
- Continued support for legacy applications
- Grow support for new applications
- Modern data protection services: reuse, agility
Economics and Simplicity
- Extreme scalability
- Convergence
- Common infrastructure
- Common management
Efficiency
- Single instancing, compression, deduplication
- Automation and orchestration
- Leverage AI / ML, Virtualization, Virtual data pipelines
- Extensive partner ecosystem
Figure 6: Key considerations for a Data resilient infrastructure
We can thus derive the main drivers of the Data Resiliency Architecture as follows:
- Cloud adoption for compute, storage, DR, SaaS and cost optimization
- Application modernization driving the adoption of containers and Kubernetes
- Data resilience against ransomware and malware, as well as compliance concerns
- Simplicity driving automation, AI, and ML capabilities for intelligent data management
Figure 7: Data Resiliency Best Practices
Telco & IT Cloud Resiliency
Cloud-native architecture helps with rapid launch of new services/opportunities with minimum downtime. With its increased bandwidth of up to 20 Gigabits per second (Gbps), low latency of 1 millisecond (MS), high device density of one million devices per square kilometer, and virtualization technologies, 5G is generating new opportunities in Telco cloud (Loghin et al., 2020). This requires for building Telco and IT specific resiliency in the architecture.
Telco and IT Cloud Resiliency differ
Measurements of inter-region cloud connections show that the throughput is almost always lower than 100 Mbps, while the latency exceeds 300 MS in some cases. These values are far behind the requirements of 5G. In a study by McKinsey, it is estimated that an operator needs to spend up to 300 percent more on infrastructure to cope with a 50 percent increase in data volume (Grijpink, Ménard and Vucevic, 2019). The following questions need consideration (Reul, 2021):
- Will the provider become primary or sole form of storage for critical data?
- Does your enterprise require an ability to operate amid a major service disruption from your cloud service provider? If so, does your IT department support failover to (or already operates in) data centres or other CSPs?
Figure 8: Telco Cloud to support 5G resiliency
Figure 9: Resiliency considerations for Hybrid Cloud Platform
- Have contingency plans been created to address unrecoverable data loss?
- Are service-level agreements and liability terms understood and acceptable?
- Is there an available mechanism to measure and report on service-level delivery?
- Does your organization have up-to-date disaster recovery and business continuity plans?
Telco Cloud Application Requirement
We can consider resiliency as a design consideration and identify the following requirements:
- Agile Launch – Interestingly, it is important to understand, as demonstrated by (Goldenberg and Oreg, 2007) that even if a small portion of lagging migrations can be persuaded to leapfrog earlier than they would have by nature, firms’ profits increase substantially because of the acceleration in the entire adoption process.
- Flexible & Differentiated User experience – It’s easy to assume that shifts underpinning the age of the customer only apply to interactions between a consumer-facing business and its customers. But those shifts also touch the industrial sector and transform the internal operation of a business and its relationship with suppliers, distributors, resellers, aggregators, and more (O’Donnell et al., 2021).
IT Cloud Resiliency | Telco Cloud Resiliency | |
Bandwidth Requirement | Variable Bandwidth Data Center | High Bandwidth Data Center |
Edge Deployments | Centralized & Orchestrated workflows that leads to complex application recovery/resiliency frameworks | Data Sites close to User for UPF/PGW virtualized Functions thus edge sites vastly considered |
Failover to Alternate Site | Tightly coupled application which create clusters- Entire Cluster can be load balanced | Dependency of Interworking of Multiple Network Functions (NFs) leads to tightly coupled resiliency/recovery framework |
High Availability | Relatively lower number of High Availability Application | High traffic & high transaction volume leads application into High Availability Architecture model (N+M model of deployment) |
QOS | SLA driven QOS dependency | Network QOS Dependency & traffic marking |
- Flexible & diversified service scenarios and short time-to-market (TTM) – The “as a service” model introduced a wide set of unprecedented benefits in terms of investments, delivery time and scalability, enabling the diffusion of novel (mobile) services and the adoption of new technologies as Big Data, IoT and machine learning (Sfondrini, Motta and Longo, 2018).
- Control plane functions abstracted into multiple network function services (NFSs)
- User plane functions are decentralized, flexibly deployed close to the access network or as per service requirement
- Selection of Edge, Region and Center Databases based on Application type & its requirements
Telco Cloud Datacentre Topology
Taking resiliency into design consideration, below Telco strategic cloud Datacentre topologies options are identified.
Figure 11: Option #1: Two Site Topology – Active / Standby
Figure 12: Option #2: Two Site Topology – Active – Active
Figure 13: Option #3: Two Site Topology – Active – Active – Standby
Business Workload / Service Resiliency
The following factors need consideration to ensure that the resiliency is dimensioned across the Business Services and Business Workloads.
- Protect Data and infrastructure configurations against cyber-attacks and ransomware threats. Cyberthreats have increased during cloud adoption – Incidentally, cloud cyber-attacks accounted for 20% of all cyber-attacks in 2020, making cloud computing platforms the third most-targeted cyber environment (Thierer, 2012; Morgan, 2022)
- Minimize impact on business operations
- Reduce the time to recover and costs associated to Cyber attacks
- Reduce the exposure to outages which could impact the business Compliance and fulfillment of regulatory oversight requirements across company communications and the existence of adequate safeguards to protection against insider risks that could impact compliance or data privacy
Figure 14: Design practices to improve Business Services Resiliency
Conclusion
Cloud Business Continuity: Cloud Business continuity is achieved with a well-established Resiliency framework. It helps forecast the impact of changes and necessary steps to avoid disruption. It is recommended to have regular Monitoring framework, time to time compliance checks, In-place Policy framework for regular checks for continuous improvements and timely mitigation of any risks.
Automated resiliency configuration can handle exceptional situations that are outside the boundary system in place and with less human intervention, resulting in higher SLA improvement and better CX experience.
Application Resiliency: Application Resiliency helps to re-adapt and avoid critical failures so that applications can timely respond to service disruptions. End user expects applications to always be available and responsive and thus a lag of even a micro-second impacts customer experience that results in financial loss in today’s competitive market. Possible approach for application resiliency could be achieved with diversified deployments, adapting micro-services approach and redundant code base/services.
Data Resiliency: Organizations exhaustively make use of data mining, Analytics, AI/ML on data that run across in expanded horizon, so it is essential to protect and manage data at a faster pace. Some of the key feature of data resiliency involve data protection from malware attack, end to end data security and traditional approach of data backup/recovery method. Make sure to make Data solutions to be intelligent enough to able to self-recover in case of disasters with geo-redundancy approach.
Network Resiliency: Connectivity failure for even a fraction of a second may result in big financial losses and most importantly, organization reputation. To mitigate such risks, it is important for an organization to have adequately redundant and resilient solution for continuous connectivity. Fast access with high-end fibre connectivity as primary link and secondary link with minimum switchover time.
Data Centre Resiliency: With Growth of Cloud adoption approach, solution vulnerability, natural disaster and unnatural chaotic situations are main issues in site continuity. Organization must ensure that data center approach is resilient enough to handle such events. Some of the key parameters for evaluation are Application criticality, Disaster recovery approach, and impact on end users.
Acknowledgement
This work was supported by stc’s DR project. The authors would like to thank Abdullah Alfakhri for his advice on various technical issues examined in this paper. The authors, however, bears full responsibility for the paper.
Contacts
Dr. Rajesh K. Saxena, rajeshsaxena@in.ibm.com
Amitabh Kumar, amitabhk@ae.ibm.com
Muhammed Farukh Khan, Muhammed.Farukh.Khan@ibm.com
Atul Pandey, Atul.Pandey1@ibm.com
Muhammed Imran Azam, mmazam.c@stc.com.sa
Mohammed Asif Sheroz, msheroz.c@stc.com.sa
Yahya A. Alfaifi, yalfaifi@stc.com.sa