IT Infrastructure Insights

Do You Need Hot Spares in Your Array?

Jan 31, 2020

Share this:

Posted in:

On-Prem Storage

Recently a customer of our Visual One Intelligence (Visual One) service began reporting that two of the customer’s arrays no longer had hot spare coverage for specific drive technology in their environment.  This customer’s arrays had high capacity requirements, 500 TB to 1 PB each. They create only a small number of pools per arrays, making each pool as large as possible.

In this situation, the array was an older NetApp array with older drives. One of the drives failed, and the hot spare had taken over for the failing drive.  The system’s use of the hot spare happened automatically. As a result, the customer was unaware of the system’s use of the hot spare until they reviewed all the alerts on the array.  Once they reviewed the array data, the customer could see a drive failure occurred, and the hot spare had taken over.

The hot spare stand-in for the failed drive all worked as designed.  The customer, as most do, use a variety of drive sizes in the storage system. The problem is this customer’s system only had one available hot spare to match the largest drive size in it. If another of this size drive fails, the customer may be facing a catastrophic data loss event.

The customer’s line of thinking was lack of the right hot spares is not that big of a deal because NetApp provides RAID and RAID-DP (double protection). The customer felt that replacing the hot spare was not a high priority until the Visual One service presented the customer with the failure rates of drives based on the age of the drive. Visual One predicted that on their drive population and age of the drives, two more failures could occur before the end of the month. While the Visual One analysis raised the customer’s awareness, they did feel they needed to take action.

The Visual One service did one more analysis, how an unplanned outage might impact hosts and applications if the drives failed.  Since Visual One discovers the pools a storage system associates to individual drives, it knew which pools the system assigned to drives without hot spare protection. Visual One also knows which applications are associated with those pools. The service was able to identify and report on the applications at risk.

This particular customer is a large domestic airline that had its business units mapped to Visual One. Our service can quickly build a list of potential hosts and applications that an outage might affect. The customer also had all their applications classified into separate categories:

–    Production

–    Development

–    Staging

–    REQUIRED TO FLY (A.K.A. Mission Critical)

The fact that there were servers at risk in the required to fly category caught their attention. Several of the servers, potentially impacted by an outage, were supporting applications that they considered “required to fly.”  Visual One found “required to fly” servers in the list of at-risk servers, which then caused the appropriate escalation of priorities. The response went from “fix it when we can” to “fix it now.”

Our Visual One service presented the airline with all of the information in a visual format so they could not only “see” the issue, but they were able to understand the potential impact and see what servers and applications were at risk. Finally, by using Visual One, they were able to take appropriate action in advance to avoid an unplanned outage.

The value of Visual One is not just presenting information visually but pinpointing what the data is trying to tell you so that IT can take corrective actions, proactively before a problem creates data loss or application outage.

After addressing the problem, we sat down with the customer for a debrief. During that meeting, the customer mentioned that before Visual One, they lacked a tool to gather telemetry data from all their diverse storage systems rapidly. They had to manually collect data, which they never really had time to do, let alone then correlate that data across storage systems. Without a consolidated view of storage system health, they were continually dealing with emergencies or the only addressing problems when they had time, which never seemed to have.

Being in a continual state of emergency leads to staff and equipment cost overruns.  It also leads to a high-level of staff frustration. As a result, the organization may lose valuable and hard to replace IT staff.

Waiting to address problems when they had time was an even more significant challenge.

Before using the Visual One service, the airline had two outage incidents in the previous year. Both of those cases resulted in unplanned outages lasting over 24 hours, and each case estimated cost of the downtime was $750,000.  The IT team could have prevented both of the incidents had they been armed with the right data. The hot spares situation is an example of Visual One providing that information and empowering IT to fix it before it created an outage. As a result, the customer estimated that using Visual One will help them save $1.5M annually in reducing unplanned downtime, not to mention all of its other benefits.