Data Center Power Quality Challenges Put Businesses at Risk; It’s Time to Fix That
While disaster recovery and risk management conversations frequently center around cloud backup solutions or network route redundancy, it’s likely not a surprise to learn data centers themselves can be a source of risk to the organizations which depend on them – small and large businesses alike. In fact, data center downtime can cost hundreds of thousands or even millions of dollars. Those costs are not just the data center’s but also their customers.
The biggest threats to data center uptime are power quality and other power management issues. Power-related problems cause 43 percent of data center outages according to the Uptime Institute. As a result, risk management and business continuity protection actually begin with improving resiliency for data centers.
Power is the lifeblood of a data center. It is critical in enabling every single aspect of data center operations. When power quality or reliability in a data center suffers, so does everything else. Multiple power sources are often required for a data center to achieve both the availability and efficiency which it requires. Even after the power sourced from the utilities is conditioned and the required redundancy is achieved the equipment which operates within the facilities like variable-frequency drives, computer power supplies and other electronics often creates harmful harmonics.
Data center managers have three primary power quality pain points, which have the potential to disrupt their operations and create risks for their end-users. In this article, we explain how these challenges can negatively impact data centers and, by extension, the businesses relying on them. As well, we discuss the most powerful step data center facility managers can take to mitigate downtime risks right now.
Pain Point 1: Lack of Visibility into Granular Power Quality Data
Historically, power quality assessments get completed during the bring-up phase of a data center in a one-time event and are only undertaken again when a significant server overhaul is made. However, this approach vastly underestimates the impact of power quality issues within the data center on an ongoing basis.
Power quality issues are regularly caused by the equipment inside of a data center that is critical to its operation. Linear loads (common in household appliances) occur when equipment draws current in sinusoidal waveforms which create no waveform distortion or harmonics. However, most equipment in a data center is actually a non-linear load, which draws current in high-amplitude short pulses which introduce harmonic distortion. Examples of non-linear load equipment include switch-mode power supply units, variable-speed drives, computers and uninterruptible power supplies (UPS).
Harmonic distortion can cause mis-operation of equipment, unwanted current, cable overheating, vibration or buzzing, false tripping of protection devices, increased energy loss and a variety of other equipment malfunctions or failures. When occurrences such as blinks, flashes, glitches, automatic resets and even downtime occur, they are the results of several kinds of power quality issues which include, but are not limited to:
- Transients.
- Interruptions.
- Sags.
- Undervoltage.
- Swell/overvoltage.
- Waveform distortion.
- Voltage fluctuations.
- Frequency variations.
Lack of visibility into power quality metrics can severely limit failover planning and increase device failure rates requiring additional troubleshooting for data center managers, opening up these mission-critical facilities and the businesses which rely on them to greater risk of downtime. Since the majority of power used in data centers is consumed directly from the IT devices plugged into rack power distribution systems (PDUs), the PDU is the most logical place to gain 24×7 visibility to changes in power quality and power load. Real-time monitoring and improved visibility into these metrics can make it possible to proactively plan and prevent a range of issues and build in redundancy and automation to keep systems running.
Pain Point 2: Inability to Optimize Efficiency and Mission-Critical Infrastructure Configurations
Without added visibility into power metrics at the PDU, data center operators will face a host of other challenges around optimizing operational efficiency and scaling their mission-critical infrastructure deployments to meet the increasingly high-density requirements of modern businesses.
When it comes to operational efficiency, stranded power and data center capacity planning go hand-in-hand. Stranded capacity is installed power which is available but not used to support a data center’s critical IT load, which is the amount of energy the facility’s IT equipment (servers, network equipment, etc.) utilizes. It can occur when the amount of power available is over provisioned or there is no space to install additional infrastructure assuming there is available cooling. Stranded capacity has been called the data center industry’s “dirty little secret” and can be a major obstacle to deploying mission-critical infrastructure optimally, in a way that is both sustainable and geared towards high-density deployments. Being able to identify this capacity in a data center and solve for its availability can help data center operators maximize power and space usage efficiency. This, in turn, can make life simpler for modern businesses which rely on data center providers to help them scale up their IT infrastructure quickly and reliably. These same tools can also be used to identify zombie servers that are consuming power (often 50-60 percent of the load just sitting idle) but providing no useful work or computing. The catch? It won’t happen without better power metric measurement and monitoring at the rack PDU.
Additionally, data center managers regularly encounter outlet issues when embarking on high density deployments. Currently, facility managers tend to gravitate to off-the-shelf power strips to meet their density requirements, but this equipment often does not give them access to the right number, type or configuration of outlets. Overcurrent events on a PDU’s branch circuit can cause the PDU’s overcurrent protection device (OCPD) to trip and shut off all outlets on that branch circuit. If a coordination study has not been done with the upstream circuit breaker, then this event can trigger a total loss of power to the whole PDU. Often, the cause of an overcurrent event is the failure of a server’s power supply. Unfortunately, identifying the outlet responsible for the trips can be a difficult and time-consuming process for facility managers using the intelligent PDUs available on the market today. Using a PDU that can identify the failed device/outlet can save countless hours.
The problem for today’s businesses which rely on data center operators is this: as with all power quality issues, server supply failures, breaker trips and time-intensive troubleshooting processes all raise the risk of downtime. While data center facility managers do employ all manner of backups and redundancy to avoid this from happening, the risk is still present. Having monitoring capabilities for the outlet level can make it easier to identify the source of problems and solve issues quickly, saving time (and money) by getting systems back online faster when problems arise.
Pain Point 3: Difficulties Integrating Power Quality Monitoring Within Existing Infrastructure Management Tools
Power quality monitoring can be a tricky subject for data center providers. This is because all of the intelligent PDUs available on the market today do not offer these unique features and in some cases are not fully capable of integrating with data center infrastructure management (DCIM) or building management systems (BMS), causing a disconnect between monitoring and operating tools used for the facility and information on power quality. Having support for the variety of infrastructure management tools used in data centers requires integration capabilities with a variety of platforms. Being able to integrate the added monitoring features of an advanced, intelligent PDU with a data center operator’s current management systems can streamline processes by helping operational teams detect and resolve issues that could cause downtime early on.
Solving the Pain Points Also Solves the Risk
The modern data center faces all kinds of internal risks to its uptime, especially power quality issues. These risks are more prevalent as demand forces IT equipment to be installed in other locations like colocation, edge, warehouses, storefronts, and other facilities. From transients and sags to waveform distortion and voltage fluctuations, any number of power quality problems can directly threaten a data center’s uptime and reliability. That’s not good for the businesses that rely on these mission-critical facilities.
As discussed above, the rack PDU is the best place to significantly improve uptime in data center facilities. While there are many intelligent PDU options available today, consider an even more advanced PDU design which offers greater intelligence and visibility even to the outlet level as a necessity. The optimal intelligent PDU design should incorporate high-quality digital meters to measure power consumption, identify stranded capacity and provide critical information about the IT load. It should monitor both the PDU’s infeed to confirm power quality going to the devices and the PDU’s outlet level to identify potentially harmful power events. It should capture waveforms and other power quality information to support a full analysis of the problem.