View Single Post
  #4  
Old July 15th 04, 05:57 AM
Ron Reaugh
external usenet poster
 
Posts: n/a
Default


"ohaya" wrote in message ...

The basis will of course start with the MTBF of the HD over its rated
service life. That figure is not published by HD mfgs. The MTBF

published
is a projection, educated guess plus imperical data from drives in early
life. The problem after that is any kind of assumptions about the pure
randomness of a failure or whether a failure might be clustered over
time/usage destroys any feasible precise math attempt.

The next issue is what kind of failures to take into consideration. Are

SW,
OS, malice and external physical events like lightning, earthquakes,

EMP,
PWS failure, other HW failure or overheating to be excluded? Excluding
such then my take is that if you replace a failing or potentially
failing(SMART) member of a RAID 1 set within 8 hours of failure/warning
during the drive's rated service life then it'll be a VERY cold day in

hell
before you lose the RAID 1 set IF the HD model/batch does not have a
pathological failure mode that is intensely clustered.

An actual calculation would require information that is not available

and
even the mfgs may not know precisely that information until towards the

end
of a model's service life if then.

What takes on this have you found? I'd like to see how anyone would

shoot
at this issue. The point is that with the exclusions noted then a RAID

1
set is VASTLY more reliable than a single HD. A shot would be at least

5000
times more reliable.
5000 is a rough shot at the number of 8 hour periods in five years.

So if you took 10K user of a given model HD and ran them all for the

rated
service life and got 500 failures. Then take that same 10K group but

all
using 2 drive RAID 1, there would only be a 1 chance in ten of a single
failure in the whole group or 1 failure if the group were 100K.

The accumulated threat of all the noted exclusions is VASTLY greater

than
this so this issue is really a non-issue as RAID 1 is as good as it

needs to
be. Keep a good backup.



Ron,

Thanks for your response. I've looked at so many different sources the
last couple of days, my eyes are blurring and my head is aching .

Before I begin, I was really looking for just a kind of "ballpark" kind
of "rule of thumb" for now, with as many assumptions/caveats as needed
to make it simple, i.e., something like assume drives are in their
"life" (the flat part of the Weibull/bathtub curve), ignore software,
etc.

Think of it like this: I just gave you two SCSI drives, I guarantee you
their MTBF is 1.2 Mhours,


1,200,000 hours ~= 137 years
Now do you think that means that 1/2 fail in 137 years?

which won't vary over the time period that
they'll be in-service,


That's known to be false.

no other hardware will ever fail (i.e., don't
worry about the processor board or raid controller), and it takes ~0
time to repair a failure.

Given something like that, and assuming I RAID1 these two drives, what
kind of MTBF would you expect over time?


Zero repair time?

- Is it the square of the individual drive MTBF?
See: http://www.phptr.com/articles/article.asp?p=28689


All obvious and based on known inaccurate assumptions

Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this
one doesn't make sense if MTTR=0 == MTBF=infinity?)
Or: http://www.teradataforum.com/teradat...107_214543.htm (again,
don't know how MTTR=0 would work)

- Is it 150% the individual drive MTBF?
See:

http://www.zzyzx.com/products/whitep...ity_primer.pdf

"Industry standards have determined that redundant components increase the
MTBF by 50%." No citation supplied.

"It should be noted that in the example above, if the downtime is reduced to
zero, availability changes to 1 or 100% regardless of the MTBF."

- Is it double the individual drive MTBF? (I don't remember where I saw
this one.)


It's kind of funny, but when I first started looking, I thought that I'd
find something simple. That was this weekend ...


As I said in my prior post. Maintained RAID 1 failure(of the cases
included) can be ignored as it's swamped by other failures in the real
world. It's a great academic exercise with little practical application
here.