IT Infrastructure Insights

Do You Need Hot Spares in Your Array?

Jan 31, 2020

Share this:

Posted in:

Capacity Planning

Recently a customer of our Visual Storage Intelligence (VSI) service began reporting that two of the customer’s arrays no longer had hot spare coverage for specific drive technology in their environment.  This customer’s arrays had high capacity requirements, 500 TB to 1 PB each. They create only a small number of pools per arrays, making each pool as large as possible.

In this situation, the array was an older NetApp array with older drives. One of the drives failed, and the hot spare had taken over for the failing drive.  The system’s use of the hot spare happened automatically. As a result, the customer was unaware of the system’s use of the hot spare until they reviewed all the alerts on the array.  Once they reviewed the array data, the customer could see a drive failure occurred, and the hot spare had taken over.

The hot spare stand-in for the failed drive all worked as designed.  The customer, as most do, use a variety of drive sizes in the storage system. The problem is this customer’s system only had one available hot spare to match the largest drive size in it. If another of this size drive fails, the customer may be facing a catastrophic data loss event.

The customer’s line of thinking was lack of the right hot spares is not that big of a deal because NetApp provides RAID and RAID-DP (double protection). The customer felt that replacing the hot spare was not a high priority until the VSI service presented the customer with the failure rates of drives based on the age of the drive. VSI predicted that on their drive population and age of the drives, two more failures could occur before the end of the month. While the VSI analysis raised the customer’s awareness, they did feel they needed to take action.

The VSI service did one more analysis, how an unplanned outage might impact hosts and applications if the drives failed.  Since VSI discovers the pools a storage system associates to individual drives, it knew which pools the system assigned to drives without hot spare protection. VSI also knows which applications are associated with those pools. The service was able to identify and report on the applications at risk.

This particular customer is a large domestic airline that had its business units mapped to VSI. Our service can quickly build a list of potential hosts and applications that an outage might affect. The customer also had all their applications classified into separate categories:

–    Production

–    Development

–    Staging

–    REQUIRED TO FLY (A.K.A. Mission Critical)

The fact that there were servers at risk in the required to fly category caught their attention. Several of the servers, potentially impacted by an outage, were supporting applications that they considered “required to fly.”  VSI found “required to fly” servers in the list of at-risk servers, which then caused the appropriate escalation of priorities. The response went from “fix it when we can” to “fix it now.”

Our VSI service presented the airline with all of the information in a visual format so they could not only “see” the issue, but they were able to understand the potential impact and see what servers and applications were at risk. Finally, by using VSI, they were able to take appropriate action in advance to avoid an unplanned outage.

The value of VSI is not just presenting information visually but pinpointing what the data is trying to tell you so that IT can take corrective actions, proactively before a problem creates data loss or application outage.

After addressing the problem, we sat down with the customer for a debrief. During that meeting, the customer mentioned that before VSI, they lacked a tool to gather telemetry data from all their diverse storage systems rapidly. They had to manually collect data, which they never really had time to do, let alone then correlate that data across storage systems. Without a consolidated view of storage system health, they were continually dealing with emergencies or the only addressing problems when they had time, which never seemed to have.

Being in a continual state of emergency leads to staff and equipment cost overruns.  It also leads to a high-level of staff frustration. As a result, the organization may lose valuable and hard to replace IT staff.

Waiting to address problems when they had time was an even more significant challenge.

Before using the VSI service, the airline had two outage incidents in the previous year. Both of those cases resulted in unplanned outages lasting over 24 hours, and each case estimated cost of the downtime was $750,000.  The IT team could have prevented both of the incidents had they been armed with the right data. The hot spares situation is an example of VSI providing that information and empowering IT to fix it before it created an outage. As a result, the customer estimated that using VSI will help them save $1.5M annually in reducing unplanned downtime, not to mention all of its other benefits.