DRJ Spring 2020

Conference & Exhibit

Attend The #1 BC/DR Event!

Winter Journal

Volume 32, Issue 4

Full Contents Now Available!

Wednesday, 23 October 2019 16:38

The Five Hidden Risks of IT Disaster Recovery Failures

Written by  JOSEPH NOONAN

Noonan1

In today’s world – where data can reside almost anywhere – protecting your business from threats can be complex and costly. Businesses often struggle with confidence in their ability to recover from disaster and defend against evolving threats like ransomware.

The 2019 Atlantic hurricane season has been true to form with the unprecedented devastation of Category 5 Hurricane Dorian. According to the Federal Emergency Management Agency (FEMA), 40 percent of businesses fail to reopen after a disaster, such as flood, fire, hurricane, blizzard, or earthquake.

Not only are natural disasters leading to business and IT concerns, but ransomware is also on the rise and becoming a board-level conversation. Experts believe up to 40 percent of all email spam contains ransomware. Ransomware has become so prolific it is no longer a question of if you are going to get hit with this type of malware, but when.

In light of these threats, unexpected recovery failure is a common scenario every IT team fears. They diligently backup critical servers, protect the data by moving it offsite and, if necessary, archive it for long-term retention. But even in times when you need it most, such as during a natural disaster, disaster recovery can fail.

Noonan2This continues to happen at a surprisingly high rate, even while new technologies are making backup, retention, and disaster recovery easier, more reliable, and more cost-effective than ever. So, why do recoveries continue to fail?

This article examines five common problems that are often responsible for unexpected IT recovery failures.

1. Failure to identify and understand dependencies in both backup and recovery

For today’s IT teams, recovery is no longer the simple process of making a copy of data and loading it on a new server. IT faces the challenge of meeting ever-higher end-user expectations to access critical applications while simultaneously managing increasingly complex infrastructures that combine physical, virtual and cloud environments.

Disaster recovery plans often include backup and data retention strategies that do not thoroughly map the dependencies and requirements needed for smooth disaster recovery. Many are learning the hard way that failing to align backup plans with specific restore expectations can have devastating consequences.

To identify critical dependencies, brainstorm a variety of downtime scenarios and walk through specific steps you will follow to restore service to end users. Examine each step in the process for potential dependencies or barriers to disaster recovery. Then, document critical dependencies – boot orders, application requirements, etc. – and incorporate them in the recovery steps.

2. Lack of understanding of software compatibility issues

There are a wide range of software compatibility issues that can render data unrecoverable. Microsoft Shadow Copy (also known as Volume Snapshot Service or VSS) is a common source of compatibility issues in Windows. VSS captures and copies stable images for backups on running servers (and other systems) without degrading the performance and stability of the services they provide. Conflicting software may cause those recoveries to fail.

Resolving this class of errors can be complex, requiring hours of troubleshooting to identify and fix the offending writer errors.

However, new backup and cloud disaster recovery technologies are integrating advanced self-healing software to solve software compatibility. This technology automatically detects VSS compatibility issues, misconfigurations, and a wide range of threats to recoverability. Without intervention from IT, this type of software mediates VSS conflicts, restarts backups, and performs a variety of other steps to remediate backup issues before they threaten recovery.

3. Inadequate testing

IT teams continue to struggle to find the time and resources needed to perform disaster recovery testing frequently enough to ensure recoveries will happen as planned. Recovery testing can be costly and take valuable IT resources away from more value-added activities.

Testing should be done at least monthly and reflect a realistic disaster recovery scenario. That means no shortcuts, such as testing only annually (or not at all), preloading tapes in tape libraries, prestaging servers, and substituting spot checking for full testing of restores.

4. Failure to protect against data corruption and malware

There are myriad causes of backup data corruption that can cause recoveries to fail – from solar flare bit flipping, to unexpected power outages, to XFS and filesystem issues, to various hardware failures, and human error.

Despite the growing frequency of headline-grabbing incidents, failing to detect malware in backup environments continues to be among the most common issue causing disaster recovery failures. Ransomware creators have become increasingly sophisticated – creating programs that lie dormant long enough to be included in data backups, eliminating the ability to defend against attacks with a simple recovery of the latest data.

Whatever the cause, IT teams can take practical steps to protect backups and recoveries from corruption and malware. For example, select a backup and recovery technology that is Linux-based, as most malware infections target Windows-based systems. Additionally, ensure your backup and recovery technology can detect early warning signs of malware infection. These technologies use AI to detect and automatically alert you to anomalous patterns in the backup environment, such as increases in encryption density – that indicate a likely impending incident.

5. Failure to follow media management best practices

One of the most common reasons a seemingly perfect backup cannot be recovered is the mishandling of backup or archive media – tapes, removable hard drives, etc. According to a recent vendor survey, 57 percent of enterprises are still using tape (and will continue to do so) for data retention in cases where fast recovery time is not a concern.

While tape and removable disk media are relatively low tech, they are highly manual and requires disciplined adherence to best practices. Simple human errors such as mislabeling tape or an archive drive can make recovery impossible.

Document the specific steps required for best practices in media handling for your environment and ensure documentation is accessible to everyone who may be involved in disaster recovery. This documentation needs to cover the entire process from back-up schedules to media rotation schedules to labeling conventions to storage and transportation.

Conclusion

Today’s IT infrastructures are increasingly complex combinations of on-premises, SaaS, cloud, and virtual environments. Protecting critical operations running in these environments requires a fundamental change in the paradigm – from server-level backup to holistic recovery-centric planning. By defining and working from an understanding of the disaster recovery needs of the entire infrastructure, IT can save time, reduce critical risks and eliminate the nightmare of recovery failures.

Noonan JoeJoe Noonan has spent more than 17 years delivering hardware and software technology solutions for virtualization, cloud, data protection and disaster recovery. He has worked for Unitrends since 2010 driving its software product strategy for data protection, recovery automation, and cloud disaster recovery and migration. Noonan has also held roles in developing technology alliances and now is VP of product management and marketing for Unitrends and Spanning, both Kaseya companies. Noonan holds a Bachelor of Science in electrical engineering and an MBA, both from Villanova University.