Wednesday, July 27, 2016

Be smart, back up!

Smart, mtbf, mttf, afr, argh*. These are just a few of the acronyms that pertain to hard disk failure. A systems manager might be forgiven for not knowing what these really mean, but ignoring the tell-tale signs is not smart and will lead to a shorter mean time between failures (mtbf) and mean time to failures (mttf).

Manufacturers made a smart acronym for self-monitoring, analysis and reporting technology. Smart (the technology) enables the monitoring of important parameters, and, thus, early detection of a coming hard disk failure. The smart systems manager heeds the warnings.

The mtbf indicates the expected operating time between two consecutive failures in hours. The mtbf considers the life cycle of a device that fails repeatedly, is repaired and returned to service again.  The mttf indicates the expected operating time until the failure of a device expressed in hours. Mtbf and mttf are almost synonymous in usage. Sometimes mtbf refers to mean time before failure. Afr (annualized failure rate) on the other hand, is the percentage failure share of a certain amount of hard disks, which is extrapolated to one year based on expectation values.

Manufacturers advertise server-type hard disks to have an mtbf of more than 1.2 million hours. One year is 8,760 hours, so are server-type hard disks expected to last 138 years (just a little more than 137 actually, but I love the number 138)? Not really, statistics and experience do not support that. Specially in servers not housed in climate controlled environments, and much more so in servers that has warned you via smart, and much, much more so, if there was a recent crash in the same array.

So why do manufacturers claim mtbfs that big? Do they claim that their hard disks can operate for 138 years? No. Conversely, they may claim that 138 hard drives could be operated for a year and only one failure could be expected. Much more realistic, isn't it?

The most expensive hardware and the most redundant raid configuration won't mean a thing if you lose precious data. So be smart, back up, heed the signs, and own up if you lose data. The hard disks wouldn't say “I told you so!” even if they actually did.

*argh - (not an acronym) expression of frustration, annoyance, dismay, embarrassment and anger