Data Center Alerts-Effective Prioritization of DCIM Alarms Through Better Software Processes

data center alerts

Red Alert! Red Alert!

Breakers tripped. Overloaded backup gear. More breaker’s tripped. More failover states. Servers offline. Followed by meetings with “Cascade Event” as the phrase of the day.

Much of your equipment has alarm capabilities. Handy sometimes and other times not so much. A slew of devices all vying for your attention – and each device’s hardware alarms are tied to the limits of that individual piece of gear. But your infrastructure is more the sum of the parts – it is a complex system of hardware woven together to perform a larger goal.

One of DCIM‘s core superpowers is software alarms. Software alarms in a good Data Center Infrastructure Management solution enables proactive aggregation across devices to deliver the big picture in near real-time. To leverage this strength, you need a system of flexible software alarms. These alarms are not provided by a hardware vendor and are not limited by the capabilities of individual hardware – they do not just reflect hardware limits but your design and implementation choices. In addition, these alarms can factor in multiple values in a meaningful way.

Alarm severity helps rapidly recognize the criticality and priority of ongoing conditions in the data center. Software alarms can significantly enhance this rapid recognition by aggregating cross-device status into a big-picture severity. For example, a UPS on battery alarm is undoubtedly a high priority – but a UPS on battery at the same time as a generator failing to start is significantly more critical. Likewise, PDUs exceeding a redundant capability threshold is important – but even more important for a PDU that is exceeding its redundant capacity threshold with an offline partner or a partner that also is exceeding its redundant capability threshold.

The bottom line is software alarms combined with a system that supports a flexible number of severities, and a flexible notification system ups your game when responding to alarms. With this system, you can add an ultra-critical alarm severity tied to well-defined software alarms – and attach appropriate notification rules for immediate awareness of all hands  – to let you ensure rapid response in worst scenarios. And it may save you from that meeting addressing a cascade event that took down your customer!!

And if the worst case happens and you find yourself chasing a critical failure cascade event – your most potent forensics tool is a detailed history of hardware and software alarms. This data lets you walk through the event in detail, and software alarms combined with hardware alarms let you see the big picture within the event timeline.

Are you leveraging the power of software alarms to their maximum effect? Does your DCIM solution empower you to do this?

The Open Data DCIM solution can provide a superior monitoring experience by surfacing the alarms that matter. You can reach us at sales@modius.com to see how we can help you make this work in your critical infrastructure. Whether that be a captive data center, co-lo, telecom networks, or distributed assets located in colo’s or edge data centers, we can help.