Estimating RAID 1 MTBF?

#1 July 15th 04, 03:48 AM

Hi,

I was wondering if anyone could tell me how to calculate/estimate the
overall MTBF of a RAID 1 (mirrored) configuration? I'm just looking for
a simple, "rule-of-thumb" type of calculation, assuming ideal
conditions.

I've been looking around for this, and I've seen a number of different
"takes" on this, and some of them seem to be quite at odds with each
other (and sometimes with themselves), so I thought that I'd post here
in the hopes that someone might be able to help.

Thanks,
Jim

#2 July 15th 04, 04:31 AM

"ohaya" wrote in message ...
Hi,

I was wondering if anyone could tell me how to calculate/estimate the
overall MTBF of a RAID 1 (mirrored) configuration? I'm just looking for
a simple, "rule-of-thumb" type of calculation, assuming ideal
conditions.

I've been looking around for this, and I've seen a number of different
"takes" on this, and some of them seem to be quite at odds with each
other (and sometimes with themselves), so I thought that I'd post here
in the hopes that someone might be able to help.

The basis will of course start with the MTBF of the HD over its rated
service life. That figure is not published by HD mfgs. The MTBF published
is a projection, educated guess plus imperical data from drives in early
life. The problem after that is any kind of assumptions about the pure
randomness of a failure or whether a failure might be clustered over
time/usage destroys any feasible precise math attempt.

The next issue is what kind of failures to take into consideration. Are SW,
OS, malice and external physical events like lightning, earthquakes, EMP,
PWS failure, other HW failure or overheating to be excluded? Excluding
such then my take is that if you replace a failing or potentially
failing(SMART) member of a RAID 1 set within 8 hours of failure/warning
during the drive's rated service life then it'll be a VERY cold day in hell
before you lose the RAID 1 set IF the HD model/batch does not have a
pathological failure mode that is intensely clustered.

An actual calculation would require information that is not available and
even the mfgs may not know precisely that information until towards the end
of a model's service life if then.

What takes on this have you found? I'd like to see how anyone would shoot
at this issue. The point is that with the exclusions noted then a RAID 1
set is VASTLY more reliable than a single HD. A shot would be at least 5000
times more reliable.
5000 is a rough shot at the number of 8 hour periods in five years.

So if you took 10K user of a given model HD and ran them all for the rated
service life and got 500 failures. Then take that same 10K group but all
using 2 drive RAID 1, there would only be a 1 chance in ten of a single
failure in the whole group or 1 failure if the group were 100K.

The accumulated threat of all the noted exclusions is VASTLY greater than
this so this issue is really a non-issue as RAID 1 is as good as it needs to
be. Keep a good backup.

#3 July 15th 04, 05:09 AM

The basis will of course start with the MTBF of the HD over its rated
service life. That figure is not published by HD mfgs. The MTBF published
is a projection, educated guess plus imperical data from drives in early
life. The problem after that is any kind of assumptions about the pure
randomness of a failure or whether a failure might be clustered over
time/usage destroys any feasible precise math attempt.

The next issue is what kind of failures to take into consideration. Are SW,
OS, malice and external physical events like lightning, earthquakes, EMP,
PWS failure, other HW failure or overheating to be excluded? Excluding
such then my take is that if you replace a failing or potentially
failing(SMART) member of a RAID 1 set within 8 hours of failure/warning
during the drive's rated service life then it'll be a VERY cold day in hell
before you lose the RAID 1 set IF the HD model/batch does not have a
pathological failure mode that is intensely clustered.

An actual calculation would require information that is not available and
even the mfgs may not know precisely that information until towards the end
of a model's service life if then.

What takes on this have you found? I'd like to see how anyone would shoot
at this issue. The point is that with the exclusions noted then a RAID 1
set is VASTLY more reliable than a single HD. A shot would be at least 5000
times more reliable.
5000 is a rough shot at the number of 8 hour periods in five years.

So if you took 10K user of a given model HD and ran them all for the rated
service life and got 500 failures. Then take that same 10K group but all
using 2 drive RAID 1, there would only be a 1 chance in ten of a single
failure in the whole group or 1 failure if the group were 100K.

The accumulated threat of all the noted exclusions is VASTLY greater than
this so this issue is really a non-issue as RAID 1 is as good as it needs to
be. Keep a good backup.

Ron,

Thanks for your response. I've looked at so many different sources the
last couple of days, my eyes are blurring and my head is aching

.

Before I begin, I was really looking for just a kind of "ballpark" kind
of "rule of thumb" for now, with as many assumptions/caveats as needed
to make it simple, i.e., something like assume drives are in their
"life" (the flat part of the Weibull/bathtub curve), ignore software,
etc.

Think of it like this: I just gave you two SCSI drives, I guarantee you
their MTBF is 1.2 Mhours, which won't vary over the time period that
they'll be in-service, no other hardware will ever fail (i.e., don't
worry about the processor board or raid controller), and it takes ~0
time to repair a failure.

Given something like that, and assuming I RAID1 these two drives, what
kind of MTBF would you expect over time?

- Is it the square of the individual drive MTBF?
See: http://www.phptr.com/articles/article.asp?p=28689
Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this
one doesn't make sense if MTTR=0 == MTBF=infinity?)
Or: http://www.teradataforum.com/teradat...107_214543.htm (again,
don't know how MTTR=0 would work)

- Is it 150% the individual drive MTBF?
See:
http://www.zzyzx.com/products/whitep...ity_primer.pdf

- Is it double the individual drive MTBF? (I don't remember where I saw
this one.)

It's kind of funny, but when I first started looking, I thought that I'd
find something simple. That was this weekend

...

Jim

#4 July 15th 04, 05:57 AM

"ohaya" wrote in message ...

The basis will of course start with the MTBF of the HD over its rated
service life. That figure is not published by HD mfgs. The MTBF
published
is a projection, educated guess plus imperical data from drives in early
life. The problem after that is any kind of assumptions about the pure
randomness of a failure or whether a failure might be clustered over
time/usage destroys any feasible precise math attempt.

The next issue is what kind of failures to take into consideration. Are
SW,
OS, malice and external physical events like lightning, earthquakes,
EMP,
PWS failure, other HW failure or overheating to be excluded? Excluding
such then my take is that if you replace a failing or potentially
failing(SMART) member of a RAID 1 set within 8 hours of failure/warning
during the drive's rated service life then it'll be a VERY cold day in
hell
before you lose the RAID 1 set IF the HD model/batch does not have a
pathological failure mode that is intensely clustered.

An actual calculation would require information that is not available
and
even the mfgs may not know precisely that information until towards the
end
of a model's service life if then.

What takes on this have you found? I'd like to see how anyone would
shoot
at this issue. The point is that with the exclusions noted then a RAID
1
set is VASTLY more reliable than a single HD. A shot would be at least
5000
times more reliable.
5000 is a rough shot at the number of 8 hour periods in five years.

So if you took 10K user of a given model HD and ran them all for the
rated
service life and got 500 failures. Then take that same 10K group but
all
using 2 drive RAID 1, there would only be a 1 chance in ten of a single
failure in the whole group or 1 failure if the group were 100K.

The accumulated threat of all the noted exclusions is VASTLY greater
than
this so this issue is really a non-issue as RAID 1 is as good as it
needs to
be. Keep a good backup.

Ron,

Thanks for your response. I've looked at so many different sources the
last couple of days, my eyes are blurring and my head is aching .

Before I begin, I was really looking for just a kind of "ballpark" kind
of "rule of thumb" for now, with as many assumptions/caveats as needed
to make it simple, i.e., something like assume drives are in their
"life" (the flat part of the Weibull/bathtub curve), ignore software,
etc.

Think of it like this: I just gave you two SCSI drives, I guarantee you
their MTBF is 1.2 Mhours,

1,200,000 hours ~= 137 years
Now do you think that means that 1/2 fail in 137 years?

which won't vary over the time period that
they'll be in-service,

That's known to be false.

no other hardware will ever fail (i.e., don't
worry about the processor board or raid controller), and it takes ~0
time to repair a failure.

Given something like that, and assuming I RAID1 these two drives, what
kind of MTBF would you expect over time?

Zero repair time?

- Is it the square of the individual drive MTBF?
See: http://www.phptr.com/articles/article.asp?p=28689

All obvious and based on known inaccurate assumptions

Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this
one doesn't make sense if MTTR=0 == MTBF=infinity?)
Or: http://www.teradataforum.com/teradat...107_214543.htm (again,
don't know how MTTR=0 would work)

- Is it 150% the individual drive MTBF?
See:

http://www.zzyzx.com/products/whitep...ity_primer.pdf

"Industry standards have determined that redundant components increase the
MTBF by 50%." No citation supplied.

"It should be noted that in the example above, if the downtime is reduced to
zero, availability changes to 1 or 100% regardless of the MTBF."

- Is it double the individual drive MTBF? (I don't remember where I saw
this one.)

It's kind of funny, but when I first started looking, I thought that I'd
find something simple. That was this weekend ...

As I said in my prior post. Maintained RAID 1 failure(of the cases
included) can be ignored as it's swamped by other failures in the real
world. It's a great academic exercise with little practical application
here.

#5 July 15th 04, 06:06 AM

"ohaya" wrote in message ...

....

Before I begin, I was really looking for just a kind of "ballpark" kind
of "rule of thumb" for now, with as many assumptions/caveats as needed
to make it simple, i.e., something like assume drives are in their
"life" (the flat part of the Weibull/bathtub curve), ignore software,
etc.

The drives *have* to be in their nominal service life: once you go beyond
that, you won't get any meaningful numbers (because they have no
significance to the product, and thus the manufacturer won't have performed
any real testing in that life range).

Think of it like this: I just gave you two SCSI drives, I guarantee you
their MTBF is 1.2 Mhours, which won't vary over the time period that
they'll be in-service, no other hardware will ever fail (i.e., don't
worry about the processor board or raid controller), and it takes ~0
time to repair a failure.

Given something like that, and assuming I RAID1 these two drives, what
kind of MTBF would you expect over time?

Infinite.

- Is it the square of the individual drive MTBF?
See: http://www.phptr.com/articles/article.asp?p=28689

No. This example applies to something like an unmanned spacecraft, where no
repairs or replacements can be made. Such a system has no meaningful MTBF
beyond its nominal service life (which will usually be much less than the
MTBF of even a single component, when that component is something as
reliable as a disk drive).

Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this
one doesn't make sense if MTTR=0 == MTBF=infinity?)

That's how it works, and this is the applicable formula to use. For
completeness, you'd need to factor in the fact that drives have to be
replaced not only when they fail but when they reach the end of their
nominal service life, unless you reserved an extra slot to use to build the
new drive's contents (effectively, temporarily creating a double mirror)
before taking the old drive out.

Or: http://www.teradataforum.com/teradat...107_214543.htm (again,
don't know how MTTR=0 would work)

The same way: though the explanation for RAID-5 MTBF is not in the usual
form, it's equivalent.

- Is it 150% the individual drive MTBF?
See:

http://www.zzyzx.com/products/whitep...ility_primer.p
df

No: the comment you saw there is just some half-assed rule of thumb that
once again assumes no repairs are effected (and is still wrong even under
that assumption, though the later text that explains the value of repair is
qualitatively valid).

- Is it double the individual drive MTBF? (I don't remember where I saw
this one.)

No.

The second paper that you cited has a decent explanation of why the formula
is what it is. If you'd like a more detailed one, check out Transaction
Processing: Concepts and Techniques by Jim Gray and Andreas Reuter.

- bill

#6 July 15th 04, 06:29 AM

Bill Todd wrote:

"ohaya" wrote in message ...

...

Before I begin, I was really looking for just a kind of "ballpark" kind
of "rule of thumb" for now, with as many assumptions/caveats as needed
to make it simple, i.e., something like assume drives are in their
"life" (the flat part of the Weibull/bathtub curve), ignore software,
etc.

The drives *have* to be in their nominal service life: once you go beyond
that, you won't get any meaningful numbers (because they have no
significance to the product, and thus the manufacturer won't have performed
any real testing in that life range).

Think of it like this: I just gave you two SCSI drives, I guarantee you
their MTBF is 1.2 Mhours, which won't vary over the time period that
they'll be in-service, no other hardware will ever fail (i.e., don't
worry about the processor board or raid controller), and it takes ~0
time to repair a failure.

Given something like that, and assuming I RAID1 these two drives, what
kind of MTBF would you expect over time?

Infinite.

- Is it the square of the individual drive MTBF?
See: http://www.phptr.com/articles/article.asp?p=28689

No. This example applies to something like an unmanned spacecraft, where no
repairs or replacements can be made. Such a system has no meaningful MTBF
beyond its nominal service life (which will usually be much less than the
MTBF of even a single component, when that component is something as
reliable as a disk drive).

Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this
one doesn't make sense if MTTR=0 == MTBF=infinity?)

That's how it works, and this is the applicable formula to use. For
completeness, you'd need to factor in the fact that drives have to be
replaced not only when they fail but when they reach the end of their
nominal service life, unless you reserved an extra slot to use to build the
new drive's contents (effectively, temporarily creating a double mirror)
before taking the old drive out.

Or: http://www.teradataforum.com/teradat...107_214543.htm (again,
don't know how MTTR=0 would work)

The same way: though the explanation for RAID-5 MTBF is not in the usual
form, it's equivalent.

- Is it 150% the individual drive MTBF?
See:

http://www.zzyzx.com/products/whitep...ility_primer.p
df

No: the comment you saw there is just some half-assed rule of thumb that
once again assumes no repairs are effected (and is still wrong even under
that assumption, though the later text that explains the value of repair is
qualitatively valid).

- Is it double the individual drive MTBF? (I don't remember where I saw
this one.)

No.

The second paper that you cited has a decent explanation of why the formula
is what it is. If you'd like a more detailed one, check out Transaction
Processing: Concepts and Techniques by Jim Gray and Andreas Reuter.

- bill

Bill,

Thanks. This kind of goes along with some other info I've just been
looking at (something like "Product of Reliabilities" on a website).

If the above calculation is in fact a good estimate, and just so that
I'm clear, if:

- I had a RAID1 setup with two SCSI drives that really have an MTBF of
1.2Mhours, and
- The drives are within their "normal" lifetime (i.e., not in infant
mortality or end-of-life), and
- The processor board/hardware was such that it supported a hot swap
such that if one of the drives failed, it could be replaced without
having halting the system, and
- We estimated (for planning purposes) that let's say, worst-case, it
took someone an 4 hours to detect the failure, get another identical
drive, and replace it (so MTTR ~4 hours).

Then a reasonable ballpark estimate for the "theoretical" MTTF (which is
~MTBF) to be:

(1.2Mhours)(1.2Mhours)
---------------------- = MTTF(RAID1)
2 x 4 hours

Is that correct?

Wow!!!

Somehow, this seems "counter-intuitive" (sorry)

....

Jim

#7 July 15th 04, 06:43 AM

It's kind of funny, but when I first started looking, I thought that I'd
find something simple. That was this weekend ...

As I said in my prior post. Maintained RAID 1 failure(of the cases
included) can be ignored as it's swamped by other failures in the real
world. It's a great academic exercise with little practical application
here.

Ron,

Thanks again. I'm starting to understand your 2nd sentence above

.

If I'm understanding what you're saying, with a RAID1 setup, with 2
drives with reasonable (i.e., 1.2Mhours) MTBF, from a design standpoint,
you wouldn't be worried about failures of the drives themselves, because
there are other failures/components (e.g., the processor board, etc.)
that would have an MTBF much lower than the raid'ed drives themselves.

Did I get that right?

BTW, re. the "0" MTTR, see my post back to Bill Todd. I had given 4
hours as an example in that post, but after posting and thinking about
it, given the scenario that I posed, it really seems like the MTTR would
be more like "0" than like 4 hours, since with my scenario, the "system"
never really fails (since the drives are hot-swappable).

Comments?

Jim

#8 July 15th 04, 07:02 AM

"ohaya" wrote in message ...

....

- I had a RAID1 setup with two SCSI drives that really have an MTBF of
1.2Mhours, and
- The drives are within their "normal" lifetime (i.e., not in infant
mortality or end-of-life), and
- The processor board/hardware was such that it supported a hot swap
such that if one of the drives failed, it could be replaced without
having halting the system, and
- We estimated (for planning purposes) that let's say, worst-case, it
took someone an 4 hours to detect the failure, get another identical
drive, and replace it (so MTTR ~4 hours).

Then a reasonable ballpark estimate for the "theoretical" MTTF (which is
~MTBF) to be:

(1.2Mhours)(1.2Mhours)
---------------------- = MTTF(RAID1)
2 x 4 hours

Is that correct?

Wow!!!

Somehow, this seems "counter-intuitive" (sorry) ....

Hey, *single* disks are pretty damn reliable in the kind of ideal service
conditions you postulate: mirrored disks are just (reliable) squared.

A 2,000,000-year RAID-1-pair MTBF sounds great, until you recognize that if
you have 2,000,000 installations, about one of them will fail each year. If
each site has 100 disk pairs rather than just one, then someone will lose
data every 3+ days (or you'll need only 20,000 sites for about one to lose
data every year).

That's still really good, but not so far beyond something you'd start
worrying about to be utterly ridiculous - at least if you're a manufacturer
(individual customers still have almost no chance of seeing a failure, but
even a single one that does is still very bad publicity). Start including
RAID-5 configurations, and system MTBF drops by roughly the square of the
number of drives in a set, which starts getting significant before long
(again, especially from the manufacturer's viewpoint, even if very few
individual customers actually experience data loss: some of the new
virtualization architectures have RAID-5-like failure characteristics - even
if they're not using parity but mirroring to protect data, they're
distributing it around the disk set in a manner that can cause data loss if
*any* two disks fail - which users should at least be aware of).

- bill

#9 July 15th 04, 09:21 AM

"ohaya" wrote in message ...

....

BTW, re. the "0" MTTR, see my post back to Bill Todd. I had given 4
hours as an example in that post, but after posting and thinking about
it, given the scenario that I posed, it really seems like the MTTR would
be more like "0" than like 4 hours, since with my scenario, the "system"
never really fails (since the drives are hot-swappable).

Comments?

If you've learned how to repopulate on the order of 100 GB of failed drive
in zero time, especially while not seriously degrading on-going processing
(so don't just assert that you can use anything like the full bandwidth of
its partner to restore it), I suspect that there are many people who would
be very interested in talking with you.

- bill

#10 July 15th 04, 12:36 PM

Then a reasonable ballpark estimate for the "theoretical" MTTF (which is
~MTBF) to be:

(1.2Mhours)(1.2Mhours)
---------------------- = MTTF(RAID1)
2 x 4 hours

Is that correct?

Wow!!!

Somehow, this seems "counter-intuitive" (sorry) ....

Hey, *single* disks are pretty damn reliable in the kind of ideal service
conditions you postulate: mirrored disks are just (reliable) squared.

A 2,000,000-year RAID-1-pair MTBF sounds great, until you recognize that if
you have 2,000,000 installations, about one of them will fail each year. If
each site has 100 disk pairs rather than just one, then someone will lose
data every 3+ days (or you'll need only 20,000 sites for about one to lose
data every year).

Bill,

Thanks for the perspective.

But, so that I'm clear, if the individual drives really have 1.2Mhours
MTBF (and I think the Atlas 15K II spec sheet actually claims
1.4Mhours), then the "squared" MTBF would indicate that RAID 1 pair
would be something like 1+ TRILLION hours MTBF, not 1+ MILLION hours.
Have I misinterpreted something?

Jim

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
IDE RAID	Ted Dawson	Asus Motherboards	29	September 21st 04 03:39 AM
Need help with SATA RAID 1 failure on A7N8X Delux	Cameron	Asus Motherboards	10	September 6th 04 11:50 PM
Asus P4C800 Deluxe ATA SATA and RAID Promise FastTrack 378 Drivers and more.	Julian	Asus Motherboards	2	August 11th 04 12:43 PM
What are the advantages of RAID setup?	Rich	General	5	February 23rd 04 08:34 PM
Gigabyte GA-8KNXP and Promise SX4000 RAID Controller	Old Dude	Gigabyte Motherboards	4	November 12th 03 07:26 PM