By Andrew Oliver, Senior Director of Product Marketing at MariaDB
The world is changing and becoming more perilous. Hurricanes can hit New York. California burns yearly in addition to earthquakes. In 2022, both Germany and Seoul flooded while the UK experienced a record heatwave that caused fatalities, not to mention power grid disruption. There is hope on the horizon as deploying renewable energy has become more affordable than even maintaining coalfire plants, but regardless of progress, data infrastructure must be more resilient than ever.
In disaster recovery or high availability, redundancy is the whole ball game. Modern applications, especially in the cloud, are no different. It is the volume, business and user expectations, and boundary of possibilities that have changed. The tape backups of yore could never keep up. For high-volume applications, even cross-region replication (WAN replication) usually does not keep up with the transaction rate. Fortunately, a bevy of new technologies and capabilities are changing the game.
Field of Expectations
While the world has become more perilous, user expectations for reliability have grown. In the late 1990s and early 2000s, website outages were rather frequent and accepted. Now, if even a major retail site goes down, not only is it a major business event for that company, it makes headlines. To some degree, outages are like crime; while they have actually gone down, awareness and, in many cases, the impacts have gone up. Users expect absolute reliability even if a hurricane has taken out a major data center – which, with multiple power and network redundancies, is not as likely as it once was.
Natural disasters are less common than more immediate problems caused by human fault. When Facebook went down in October 2021, it was not a fire, hurricane, flooding, or earthquake but a network misconfiguration. Meanwhile, software developers are under increased pressure to deliver early and often. Fast development cycles mean less time for quality assurance and more bugs.
If all of that was not enough, business on the Internet is “spikey.” One day can be a total lag, and the next day could require almost unbounded capacity. What does a user who sees an hourglass (or spinning orb) do? Why, refresh of course, and double the number of failed requests. Luckily the cloud and modern virtualization technologies do give the ability to scale at least hardware on demand – at least up to a point.
Availability Zones
One of the best things about modern cloud providers is that they provide a sophisticated framework for high availability, simplified into the concept of availability zones. Where is the network redundant and the facility separate but not far away? Where this used to be a complicated issue and a series of agreements, now it is just different availability zones in the same region. This has greatly simplified facility, network, and hardware concerns.
With virtualization and orchestration technologies like Kubernetes, it is possible to develop and deploy services that spin up seamlessly across multiple zones. However, storage is always the biggest issue. Amazon Web Services EBS does not stripe across zones. This requires storage reliability to be handled at a higher level.
The most obvious examples of this are databases. Until recently, the market had a choice between a client-server relational database that handled data integrity well but could only scale vertically, and so-called “NoSQL” databases that often lacked the features, sophistication, and transactional reliability of an Oracle or DB2. This has changed. NoSQL databases often do supply transactional idioms. Relational databases can now handle unstructured and JSON data, and there are now distributed SQL offerings (including MariaDB Xpand) that meet all of the common expectations of a relational database while scaling horizontally, including across availability zones (rack awareness).
It is now possible to deploy a distributed database that scales by just adding nodes and can even “scale back” by removing nodes. These databases have redundant copies of data and enable virtually unlimited scale both in data size and user capacity. When a node or availability zone is lost, these databases continue and reintegrate and catch up nodes if they return quickly or automatically restore redundancy if they don’t. Combined with load balancers, routing protocols and redundancies over DNS, it is possible to deploy applications and infrastructure that tolerates multiple faults across a cloud region, even at the data layer.
Global Regions
Surviving the loss of a datacenter or two in a region is nice. However, when the big mega hurricane takes out us-east-1a, us-east-1b, and us-east-1c (aka Virginia), it would be nice to be able to failover to us-east-2 (aka Ohio). Again with modern routing protocols and DNS, that is not so hard. However, again the database is the lynchpin.
In order to maintain some level of failure tolerance, the database must replicate in real or near-real-time. Fortunately, most databases support some form of cross-region replication. This capability is usually asynchronous with eventual consistency to avoid impacting regular operational performance. For financial and other high applications with a high write volume, a key issue is whether the database (Virginia) can write to the replica cluster (Ohio) even at peak load. To answer this requirement, distributed SQL databases such as MariaDB Xpand have added parallel replication to utilize the full processing and abundant network capacity of all the nodes on the source and target clusters, enabling even distant replication to scale with the workload.
For some applications, performance is more important than whether every last drop of data made it over during “the big one.” Data can be recovered or restored by other means. In other applications, the cost of latency is worth ensuring both sides are up to date. Some distributed SQL databases can span clusters across regions. While data redundancy and replication comes with a continuous performance impact, it can be critical for applications that need to failover with no potential for data loss and minimal impact.
When deciding on cross-region failover, it is important to understand the costs associated with ongoing operations. Cloud providers charge for data transfer. Data ingress is usually free, but egress usually costs money. The cost of replicating to a second region is not just the cost of the storage or compute but IOPS and egress.
Online Backup
Despite the best security, multiple redundancies, and best practices, bad things can still happen. Consider that human folly could change data in a way that was not intended. This would replicate throughout the infrastructure. Nothing about the multiple redundancies would necessarily fix that. Despite all of our best efforts toward fail proof infrastructure, online backups are still necessary.
It is no longer practical to take the system down for backups. High-volume databases need a backup solution that does not interrupt the system while ensuring the data is captured correctly. Distributed databases offer parallel backup and restore. This capability captures the current state, and ensures full transactional integrity in an active system under load, and shortening recovery time in the event it’s required.
Configuration Management
Remember that Facebook was not a natural disaster, security failure or possibly even a change of source code. Instead, it was a configuration mistake. While modern devops has made administration and change management easier, it has also made it more complex. Where there is complexity, there is the opportunity for human error.
There are new tools available to reduce the chance of this error. One of the most promising technologies is one called GitOps. GitOps deploys an agent-based architecture to the various components of a system. Changes to the configuration are not pushed to the various nodes of the system but pulled by the agents and applied. Administrators check in changes to a revision control repository. Moreover, changes can be automatically rolled back in the event communication is lost.
The Big One
Surviving the big one is about preparation requiring multiple redundancies, backups, and configuration management including at the database layer. For cloud applications it means deploying new technologies including distributed databases with multiple availability zones and cross region replication. However, the “Big One” is often no act of god, but a human error. For these it requires both backups and a way to manage and apply configuration as well as rollback changes that do not work. Those that apply these practices can not only perform and scale, but carry on even when bad things happen.
Author Bio
Andrew C. Oliver is a columnist and software developer with a long history in open source, database, and cloud computing. He founded Apache POI and served on the board of the Open-Source Initiative. Oliver also helped with marketing in startups including JBoss, Lucidworks, and Couchbase. He is currently the Senior Director of Product Marketing for MariaDB Corporation.