Figure 1: The requirements of the cloud generation are more complex and difficult to meet than previous generation data centers.
My colleague Jack Pouchet has been posing an interesting question lately: have data centers become so focused on driving down PUE that availability has been compromised as an unintended consequence? It certainly seems that high-profile data center outages are more common now than they were 10 years ago. Is PUE the culprit?
First, let me clarify. Pouchet isn’t saying that high availability and low PUEs are mutually exclusive. His message is that in our quest to drive down PUE we’ve moved away from some proven architectures and technologies and embraced newer approaches that increase risk in ways that are difficult to foresee. Pouchet’s point is that we need to build on the experience of the past to support the multi-faceted requirements of today’s data centers.
There has been a fundamental shift in data center requirements in the last five years as we transitioned into the cloud generation. Just as the mainframe generation once gave way to the client/server generation, the client/server generation is now being replaced by the cloud generation (Figure 1). Beginning in about 2011, we started to see the focus on availability, cooling, and flexibility, which predominated in the client/server generation, move to a focus on capacity, modularity, speed, efficiency, and integration, which mark the cloud generation. The overall message is get the architecture right for your business requirements, whether it is data security, transactional, or operational availability – and the efficiency, sustainability, and reliability will take care of themselves.
That shift doesn’t mean we no longer need to design and manage for availability – the requirements of emerging generations of data centers build on rather than displace those that came before. Consider the results from the latest Ponemon Institute Cost of Downtime Report released earlier this year: the mean cost of a full data center outage is up seven percent compared to 2013 (Figure 2) and the average time to recover from an outage rose 10 percent (Figure 3).
We can’t afford to lose focus on availability. Instead, we have to adapt proven, high-availability designs and processes to the new requirements of the cloud generation while leveraging emerging risk-mitigating technologies that support those goals. Here are four ways to accomplish that.
1. High-Availability Power Architectures for the Cloud Generation
The 2N or 2N+1 dual-bus architecture has historically been the choice of high-availability data centers. When properly designed, these architectures eliminate single points of failure in the critical power system and allow maintenance to be performed on any component while continuing to power the load.
In today’s environment where the need to optimize capital efficiency and esource utilization is paramount, this level of redundancy is more difficult to justify. A high-availability alternative has emerged in the form of one of the various reserve architectures pioneered in large colocation facilities.
The basic reserve architecture creates redundant power protection with fault tolerance and concurrent maintainability through the use of static transfer switches (STS) that allow a redundant UPS system to pick up the load from any one of multiple UPS systems (Figure 4). Downstream, the power distribution system can be similar in design to that of a 2N dual-bus architecture. This reduces initial costs while raising UPS utilization rates from below 50 percent in the 2N architecture to N+1/N in a reserve architecture where N is the number of modules used for capacity.
Figure 4: Basic reserve architecture with four UPS systems supporting the load and one system in reserve for redundancy.
The reserve architecture can be tailored to business requirements and service models. The shared reserve model shown in Figure 4 works well when the loads across a facility are of equal priority or criticality. In a facility where some clients or loads require a higher level of availability, a dedicated reserve system can be employed. In this architecture, the UPS system supporting high-priority loads has a dedicated reserve system, essentially creating a 2N system within the reserve architecture. Lower priority loads are supported through a shared reserve UPS system.
Alternately, two reserve modules can be shared across multiple primary modules in a configuration that is commonly referred as “eight to make six” or “ten to make eight.” With this configuration, any module can be taken offline for service while maintaining redundancy across the system. A final option is to create reserve capacity from the unused capacity of the primary UPS system using a critical power management system.
A critical power management system (CPMS) is highly recommended for any reserve system implementation to maximize reserve system utilization while performing successful transfer procedures that prevent overloading any reserve system module.
The reserve system is just one example of how power system architectures are evolving to build upon the experience of the past while better meeting the needs of the cloud generation in terms of capacity, modularity, speed, and efficiency.
2. Intelligent, High-Efficiency Thermal Management
Thermal management, traditionally known as cooling or heat removal, is the first place many organizations attack when they seek to drive down PUE. There are significant opportunities in many data centers. They just have to be addressed wisely.
Economization, which has been used in data centers since before the turn of the century, enables the use of outside air to increase cooling system efficiency. Interest in economization has grown significantly in recent years as energy costs and consumption continue to rise. However, some organizations eager to push the limits of economization have adopted comfort-level economization systems for the data center, repeating the mistakes of the past when comfort cooling systems were used in some data centers and proved incapable of delivering the capacity, precision or ratio of sensible cooling required to support dense clusters of electronics.
Today, there are multiple data center economization options available that can safely maximize the use of outside air to support thermal management, including direct and indirect systems and precision cooling units that integrate economization into their design. These systems are increasingly being used to replace traditional mechanical cooling, with the use of outside air and water evaporation systems providing the desired supply temperature throughout most, if not all hours of the year. These technologies can improve cooling system efficiency by up to 50 percent while providing a safe environment for data center systems.
Perhaps the greatest change in thermal management in the cloud generation is the use of intelligent thermal controls. These controls enable machine-to-machine communication so thermal units across a facility can work as a team to further increase efficiency. They also automate cooling system operational routines, such as temperature and airflow management, valve auto-tuning, lead/lag, and others that enhance overall system operation. In addition, they provide centralized visibility into unit operation that can be used to guide maintenance and help ensure any failure doesn’t affect IT systems.
By bringing increased intelligence and greater integration to proven technologies, these developments meet the needs of capacity, efficiency, and integration in cloud data centers.
3. Advancing Operational Maturity
The data center represents a complex network of interdependent systems managed and maintained by humans. If there is one thing our history (and Murphy’s Law) has taught us is that complex human-machine systems fail – usually more frequently than we expect. The key to minimizing these failures and managing them when they occur is the combination of organizational and personnel experience.
One of the challenges we often face on a power system upgrade is that the one-line diagram no longer reflects the current state of the data center, which has evolved since the original one-line was created. It’s essential to have a clear, up-to-date picture of what’s in the data center and how it is configured to respond efficiently to an outage.
Equally important is documenting tasks to effectively respond to outages and establish a schedule to practice for outage events. Two best practice options: schedule regular “pull-the-plug” tests to ensure people and equipment react appropriately during an event; or schedule less extreme simulations such as automated battery tests.
UPS system failure can also be addressed through a disciplined approach to startup and maintenance. Factory witness tests provide a controlled environment for ensuring the components within a power system work together as designed while ongoing preventive maintenance has been proven to reduce the likelihood of equipment failure.
A study of 5,000 three-phase UPS units with more than 185 million combined operating hours found that the frequency of preventive maintenance visits correlated with an increase in mean time between failure (MTBF) (Figure 5). Most organizations can optimize their maintenance investment through two preventive maintenance visits annually.
Preventive maintenance is a common target when budget cuts are mandated but it is important to recognize that there is a cost associated with these cuts in the form of increased risk. As the 2016 Cost of Data Center Outages Report documents, that cost is growing, and short-term savings created by cutting preventive maintenance could result in a large, unanticipated expense.
Another factor that enhances organizational maturity and minimizes potential failures is battery monitoring. Batteries still remain the most cost-effective approach to providing short-term ride-through power. While there are advances being made in battery performance, they still represent the weak link in the critical power system. Integrated battery monitoring strengthens this weak link by providing continuous visibility into battery health – including cell voltage, resistance, current, and temperature – without requiring a full discharge and recharge cycle. This allows batteries to be utilized fully while preventing unanticipated failure. These systems also support predictive analysis, which can optimize replacement cycles.
We simply can’t neglect the human factor in today’s data center. Clearly defined policies and procedures, practiced regularly and supported by key technologies such as battery monitoring, remain as important today as they have ever been. The key is to balance the level of risk you are willing to absorb with the need to accurately simulate real-world conditions, performing these tests frequently enough to allow personnel to get comfortable acting under the pressure of an outage.
4. Centralizing Infrastructure Management
Data center infrastructure management (DCIM) is the final piece of the availability puzzle and one that has “grown up” with the cloud generation. It has become a valuable tool for organizations seeking to maximize availability.
Two capabilities, in particular, can help prevent downtime. First is the ability to consolidate monitoring data across all systems to highlight potential infrastructure issues before they impact operations. The other is the ability to better understand the interdependencies between data center systems. This is especially important as data center capacity management becomes more dynamic. As loads are shifted to available resources, it’s critical to know whether the infrastructure supporting those resources has the capacity to support the new load, to prevent problems such as exceeding UPS capacity, or creating hot spots that can damage equipment.
The job of managing a data center is increasingly complex and requirements are more varied and difficult to achieve than at any time in the history. At the same time, businesses are more dependent on their data centers than ever. The cost of downtime continues to rise, with costs for some facilities exceeding $2 million per incident. By adapting proven architecture and technologies to the new requirements while deploying new technologies that increase visibility into system operation, we can design and manage high-availability data centers that deliver the capacity, efficiency, speed, and integration required in the cloud generation.