HardwareBanter - View Single Post

#2 July 15th 04, 04:31 AM

"ohaya" wrote in message ...
Hi,

I was wondering if anyone could tell me how to calculate/estimate the
overall MTBF of a RAID 1 (mirrored) configuration? I'm just looking for
a simple, "rule-of-thumb" type of calculation, assuming ideal
conditions.

I've been looking around for this, and I've seen a number of different
"takes" on this, and some of them seem to be quite at odds with each
other (and sometimes with themselves), so I thought that I'd post here
in the hopes that someone might be able to help.

The basis will of course start with the MTBF of the HD over its rated
service life. That figure is not published by HD mfgs. The MTBF published
is a projection, educated guess plus imperical data from drives in early
life. The problem after that is any kind of assumptions about the pure
randomness of a failure or whether a failure might be clustered over
time/usage destroys any feasible precise math attempt.

The next issue is what kind of failures to take into consideration. Are SW,
OS, malice and external physical events like lightning, earthquakes, EMP,
PWS failure, other HW failure or overheating to be excluded? Excluding
such then my take is that if you replace a failing or potentially
failing(SMART) member of a RAID 1 set within 8 hours of failure/warning
during the drive's rated service life then it'll be a VERY cold day in hell
before you lose the RAID 1 set IF the HD model/batch does not have a
pathological failure mode that is intensely clustered.

An actual calculation would require information that is not available and
even the mfgs may not know precisely that information until towards the end
of a model's service life if then.

What takes on this have you found? I'd like to see how anyone would shoot
at this issue. The point is that with the exclusions noted then a RAID 1
set is VASTLY more reliable than a single HD. A shot would be at least 5000
times more reliable.
5000 is a rough shot at the number of 8 hour periods in five years.

So if you took 10K user of a given model HD and ran them all for the rated
service life and got 500 failures. Then take that same 10K group but all
using 2 drive RAID 1, there would only be a 1 chance in ten of a single
failure in the whole group or 1 failure if the group were 100K.

The accumulated threat of all the noted exclusions is VASTLY greater than
this so this issue is really a non-issue as RAID 1 is as good as it needs to
be. Keep a good backup.