Getting the Most Reliability from Your Assets with Remote Monitoring

remote-assest-management-dcim
by Max Hamner, Research and Development Engineer
Ray Daugherty, Senior Services Consultant

As a data center operator, the success of your business model is dependent on your uptime and SLA compliance – which are highly affected by your assets’ reliability and longevity.  There is a direct correlation between reliability and your ROI.

Power distribution and cooling gear (Generators, PDUs, UPSs, CRACs, CRAHs) for data centers are costly to purchase and maintain, but downtime and lost contracts are far more expensive.

To maximize availably and SLA compliance, two of the top factors (above vendor quality) are the environmental conditions in which they operate, and the quality of power being provided.

ASHRAE, ITIC, and other organizations have long proven the direct relationship between power quality and operating environment on hardware uptime and reliability. They have produced many standards to define the challenges and goals of environmental and power conditions. This awareness has led all major hyperscalers to include environmental and power compliance as part of their colocation contracts.

How Power Affects Your Infrastructure Gear

Power quality directly affects the life of your hardware. Not just significant power events, but ongoing variations in the power quality that accumulate as degraded reliability in your infrastructure and IT gear. The many small sags, surges, and spikes slowly erode the quality of the components in an electrical device. This effect has been well researched and documented by multiple organizations, who have produced reports and created standards that define the optimum operating conditions to maximize equipment life.  The most prevalent industry standard is provided by ITIC (Information Technology Industry Council). They provide and maintain the ITIC curve (shown below), which allows for measurement of damage/risk to a device for individual power events.

ITIC Curve (from itic.org)
ITIC Curve (from itic.org)

How Environmental Conditions Affect Your Infrastructure Gear

Humidity, air pressure, air quality and temperature combine to affect condensation and oxidation on surfaces and on airflow, which can affect cooling efficiency.

High humidity can lead to condensation and moisture buildup inside IT equipment, causing electrical short circuits and corrosion.  Low humidity can lead to static electricity buildup, causing electrostatic discharge events that can damage electronic components.

Dust and other particulate matter can accumulate inside IT equipment, blocking airflow and causing overheating. It can also interfere with moving parts, such as fans and drives, reducing their efficiency and potentially leading to equipment failure.

Excessive heat can cause IT equipment to overheat, leading to reduced performance and potentially causing hardware failures. Extremely cold conditions can also affect IT equipment, causing it to become less efficient and possibly leading to condensation and moisture-related problems.  Most IT equipment has recommended operating temperature ranges, and it’s important to ensure that the operating environment stays within these ranges.

The above factors can be controlled by climate control systems (HVAC) to maintain proper temperature and humidity levels, and by dust filters and clean rooms to reduce particulate contamination. 

ASHRAE (The American Society of Heating, Refrigerating and Air-Conditioning Engineers) has provided research results from studies into the impact of environmental conditions on hardware. One example is provided here:

https://www.ashrae.org/news/esociety/completed-research-rp-1755-february-2020

From this research they have provided a recommended standard (Equipment Thermal Guidelines for Data Processing Environments) specifically for data center operations. This specifies optimal environmental conditions to maximize equipment reliability and life.

Psychrometric chart used to measure ASHRAE compliance.
Psychrometric chart used to measure ASHRAE compliance.

The ASHRAE standard for thermal guidelines in the data center can be found in the 2016 ASHRAE white paper TC9.9 Data Center Power Equipment Thermal Guidelines and Best Practices.  They were last updated in 2021.  Here is a link to a PDF for a reference card on the 2021 Equipment Thermal Guidelines for Data Processing Environments

Visibility of Your Power Quality and Impact

How do you know if you are being affected by these factors – is your critical infrastructure gear slowly degrading in reliability, or is an increase in downtime looming ahead?

A DCIM (Data Center Infrastructure Management) solution can help monitor power conditions on your gear and provide easier access to power events and waveforms logged by PQMs (Power Quality Meters).  General monitoring of individual device report points (e.g., voltage, current, etc.) allows you to track and trend different aspects of your power quality and detect when thresholds have been exceeded:

  • Power Quality Meter monitoring – high resolution power monitoring with event capture
  • Event capture – seeing very short duration events that power gear like UPS, PDUs, RPPs generally cannot detect or report. PQM (Power Quality Meters) has high resolution power monitoring and when an event occurs, saves a snapshot around the moment of the event so it can be accessed after the event – often down to sub millisecond level – the 63rd harmonic for advanced PQMs.

Visibility of Your Environmental Conditions and Impact

Environmental monitoring in data centers is crucial to ensure the efficient and reliable operation of IT equipment while protecting against environmental factors that can lead to downtime or hardware failures. Usually this is accomplished with sensors and detectors:

  • Temperature/humidity sensors ensure your HVAC systems are allowing you to operate within the ASHRAE guidelines or within Service Level Agreement (SLA) guidelines imposed by your customers.
  • If you have sufficient sensor density, you can create heat maps of your data center to correct hot and cold spots and improve cooling efficiency. Monitoring the supply and return temperatures of your cooling equipment helps ensure you are not overcooling your data center and wasting energy.
  • Airflow sensors monitor the air movement in and around racks. They ensure that hot and cold aisles are well-structured.
  • Water leak detectors alert you to leaks or flooding, helping to prevent damage to hardware and electrical systems.

How to Minimize the Risk

These factors all contribute to a risk of reduced reliability and downtown for your critical infrastructure gear.  The risk and impact have been proven through thorough research with documented data provided by independent agencies.  Tracking these risk factors should be part of your ongoing monitoring and data collection of your hardware.

A powerful tool for managing these risks is a full-feature DCIM solution capable of tracking these conditions, providing real-time alarms, as well as analysis of historical data. The visibility of these factors can be challenging, but a quality DCIM solution provides this visibility in addition to meeting your basic monitoring and alerting needs.

A DCIM solution that can track these aspects of your hardware also provides core data, and advanced views like thermal distribution maps and psychrometric charts, which will allow you to maximize the efficiency of your power distribution and cooling infrastructure.

How We Help

The Modius® OpenData® Remote Monitoring Module tracks and trends raw data values on power (load/voltage/current) and captures events from PQMs and other equipment.

The OpenData Environment Module provides visibility of cooling loads and distribution of cooling loads and ability to do “what-if scenarios”, provides heat maps (thermal images), provides ASHRAE compliance (psychrometric charts) as well as the ability to measure/show compliance.

The OpenData Machine Learning Module – Detects anomalies in device operations which can potentially identify equipment that has degraded.  This allows you to fix or replace the equipment before it fails and causes an outage.

Take the Next Step

If you are looking for a next-generation DCIM solution that can help you better understand your data center’s status and opportunities efficiencies, consider Modius OpenData. OpenData provides integrated tools including machine learning capability to manage the assets and performance of colocation facilities, enterprise data centers, and critical infrastructure.

OpenData is a ready-to-deploy DCIM featuring an enterprise-class architecture that scales incredibly well. In addition, OpenData gives you real-time, normalized, actionable data accessible through a single sign-on and a single pane of glass.

We are passionate about helping clients run more profitable data centers and providing operators with the best possible view into a managed facility’s data. We have been delivering DCIM solutions since 2007. We are based in San Francisco and are proudly a Veteran Owned Small Business (VOSB Certified). You can reach us at sales@modius.com or 1-(888) 323.0066.

Share this article

Facebook
Twitter
LinkedIn