If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Estimating RAID 1 MTBF?
Hi,
I was wondering if anyone could tell me how to calculate/estimate the overall MTBF of a RAID 1 (mirrored) configuration? I'm just looking for a simple, "rule-of-thumb" type of calculation, assuming ideal conditions. I've been looking around for this, and I've seen a number of different "takes" on this, and some of them seem to be quite at odds with each other (and sometimes with themselves), so I thought that I'd post here in the hopes that someone might be able to help. Thanks, Jim |
#2
|
|||
|
|||
"ohaya" wrote in message ... Hi, I was wondering if anyone could tell me how to calculate/estimate the overall MTBF of a RAID 1 (mirrored) configuration? I'm just looking for a simple, "rule-of-thumb" type of calculation, assuming ideal conditions. I've been looking around for this, and I've seen a number of different "takes" on this, and some of them seem to be quite at odds with each other (and sometimes with themselves), so I thought that I'd post here in the hopes that someone might be able to help. The basis will of course start with the MTBF of the HD over its rated service life. That figure is not published by HD mfgs. The MTBF published is a projection, educated guess plus imperical data from drives in early life. The problem after that is any kind of assumptions about the pure randomness of a failure or whether a failure might be clustered over time/usage destroys any feasible precise math attempt. The next issue is what kind of failures to take into consideration. Are SW, OS, malice and external physical events like lightning, earthquakes, EMP, PWS failure, other HW failure or overheating to be excluded? Excluding such then my take is that if you replace a failing or potentially failing(SMART) member of a RAID 1 set within 8 hours of failure/warning during the drive's rated service life then it'll be a VERY cold day in hell before you lose the RAID 1 set IF the HD model/batch does not have a pathological failure mode that is intensely clustered. An actual calculation would require information that is not available and even the mfgs may not know precisely that information until towards the end of a model's service life if then. What takes on this have you found? I'd like to see how anyone would shoot at this issue. The point is that with the exclusions noted then a RAID 1 set is VASTLY more reliable than a single HD. A shot would be at least 5000 times more reliable. 5000 is a rough shot at the number of 8 hour periods in five years. So if you took 10K user of a given model HD and ran them all for the rated service life and got 500 failures. Then take that same 10K group but all using 2 drive RAID 1, there would only be a 1 chance in ten of a single failure in the whole group or 1 failure if the group were 100K. The accumulated threat of all the noted exclusions is VASTLY greater than this so this issue is really a non-issue as RAID 1 is as good as it needs to be. Keep a good backup. |
#3
|
|||
|
|||
The basis will of course start with the MTBF of the HD over its rated service life. That figure is not published by HD mfgs. The MTBF published is a projection, educated guess plus imperical data from drives in early life. The problem after that is any kind of assumptions about the pure randomness of a failure or whether a failure might be clustered over time/usage destroys any feasible precise math attempt. The next issue is what kind of failures to take into consideration. Are SW, OS, malice and external physical events like lightning, earthquakes, EMP, PWS failure, other HW failure or overheating to be excluded? Excluding such then my take is that if you replace a failing or potentially failing(SMART) member of a RAID 1 set within 8 hours of failure/warning during the drive's rated service life then it'll be a VERY cold day in hell before you lose the RAID 1 set IF the HD model/batch does not have a pathological failure mode that is intensely clustered. An actual calculation would require information that is not available and even the mfgs may not know precisely that information until towards the end of a model's service life if then. What takes on this have you found? I'd like to see how anyone would shoot at this issue. The point is that with the exclusions noted then a RAID 1 set is VASTLY more reliable than a single HD. A shot would be at least 5000 times more reliable. 5000 is a rough shot at the number of 8 hour periods in five years. So if you took 10K user of a given model HD and ran them all for the rated service life and got 500 failures. Then take that same 10K group but all using 2 drive RAID 1, there would only be a 1 chance in ten of a single failure in the whole group or 1 failure if the group were 100K. The accumulated threat of all the noted exclusions is VASTLY greater than this so this issue is really a non-issue as RAID 1 is as good as it needs to be. Keep a good backup. Ron, Thanks for your response. I've looked at so many different sources the last couple of days, my eyes are blurring and my head is aching . Before I begin, I was really looking for just a kind of "ballpark" kind of "rule of thumb" for now, with as many assumptions/caveats as needed to make it simple, i.e., something like assume drives are in their "life" (the flat part of the Weibull/bathtub curve), ignore software, etc. Think of it like this: I just gave you two SCSI drives, I guarantee you their MTBF is 1.2 Mhours, which won't vary over the time period that they'll be in-service, no other hardware will ever fail (i.e., don't worry about the processor board or raid controller), and it takes ~0 time to repair a failure. Given something like that, and assuming I RAID1 these two drives, what kind of MTBF would you expect over time? - Is it the square of the individual drive MTBF? See: http://www.phptr.com/articles/article.asp?p=28689 Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this one doesn't make sense if MTTR=0 == MTBF=infinity?) Or: http://www.teradataforum.com/teradat...107_214543.htm (again, don't know how MTTR=0 would work) - Is it 150% the individual drive MTBF? See: http://www.zzyzx.com/products/whitep...ity_primer.pdf - Is it double the individual drive MTBF? (I don't remember where I saw this one.) It's kind of funny, but when I first started looking, I thought that I'd find something simple. That was this weekend ... Jim |
#4
|
|||
|
|||
"ohaya" wrote in message ... The basis will of course start with the MTBF of the HD over its rated service life. That figure is not published by HD mfgs. The MTBF published is a projection, educated guess plus imperical data from drives in early life. The problem after that is any kind of assumptions about the pure randomness of a failure or whether a failure might be clustered over time/usage destroys any feasible precise math attempt. The next issue is what kind of failures to take into consideration. Are SW, OS, malice and external physical events like lightning, earthquakes, EMP, PWS failure, other HW failure or overheating to be excluded? Excluding such then my take is that if you replace a failing or potentially failing(SMART) member of a RAID 1 set within 8 hours of failure/warning during the drive's rated service life then it'll be a VERY cold day in hell before you lose the RAID 1 set IF the HD model/batch does not have a pathological failure mode that is intensely clustered. An actual calculation would require information that is not available and even the mfgs may not know precisely that information until towards the end of a model's service life if then. What takes on this have you found? I'd like to see how anyone would shoot at this issue. The point is that with the exclusions noted then a RAID 1 set is VASTLY more reliable than a single HD. A shot would be at least 5000 times more reliable. 5000 is a rough shot at the number of 8 hour periods in five years. So if you took 10K user of a given model HD and ran them all for the rated service life and got 500 failures. Then take that same 10K group but all using 2 drive RAID 1, there would only be a 1 chance in ten of a single failure in the whole group or 1 failure if the group were 100K. The accumulated threat of all the noted exclusions is VASTLY greater than this so this issue is really a non-issue as RAID 1 is as good as it needs to be. Keep a good backup. Ron, Thanks for your response. I've looked at so many different sources the last couple of days, my eyes are blurring and my head is aching . Before I begin, I was really looking for just a kind of "ballpark" kind of "rule of thumb" for now, with as many assumptions/caveats as needed to make it simple, i.e., something like assume drives are in their "life" (the flat part of the Weibull/bathtub curve), ignore software, etc. Think of it like this: I just gave you two SCSI drives, I guarantee you their MTBF is 1.2 Mhours, 1,200,000 hours ~= 137 years Now do you think that means that 1/2 fail in 137 years? which won't vary over the time period that they'll be in-service, That's known to be false. no other hardware will ever fail (i.e., don't worry about the processor board or raid controller), and it takes ~0 time to repair a failure. Given something like that, and assuming I RAID1 these two drives, what kind of MTBF would you expect over time? Zero repair time? - Is it the square of the individual drive MTBF? See: http://www.phptr.com/articles/article.asp?p=28689 All obvious and based on known inaccurate assumptions Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this one doesn't make sense if MTTR=0 == MTBF=infinity?) Or: http://www.teradataforum.com/teradat...107_214543.htm (again, don't know how MTTR=0 would work) - Is it 150% the individual drive MTBF? See: http://www.zzyzx.com/products/whitep...ity_primer.pdf "Industry standards have determined that redundant components increase the MTBF by 50%." No citation supplied. "It should be noted that in the example above, if the downtime is reduced to zero, availability changes to 1 or 100% regardless of the MTBF." - Is it double the individual drive MTBF? (I don't remember where I saw this one.) It's kind of funny, but when I first started looking, I thought that I'd find something simple. That was this weekend ... As I said in my prior post. Maintained RAID 1 failure(of the cases included) can be ignored as it's swamped by other failures in the real world. It's a great academic exercise with little practical application here. |
#5
|
|||
|
|||
"ohaya" wrote in message ... .... Before I begin, I was really looking for just a kind of "ballpark" kind of "rule of thumb" for now, with as many assumptions/caveats as needed to make it simple, i.e., something like assume drives are in their "life" (the flat part of the Weibull/bathtub curve), ignore software, etc. The drives *have* to be in their nominal service life: once you go beyond that, you won't get any meaningful numbers (because they have no significance to the product, and thus the manufacturer won't have performed any real testing in that life range). Think of it like this: I just gave you two SCSI drives, I guarantee you their MTBF is 1.2 Mhours, which won't vary over the time period that they'll be in-service, no other hardware will ever fail (i.e., don't worry about the processor board or raid controller), and it takes ~0 time to repair a failure. Given something like that, and assuming I RAID1 these two drives, what kind of MTBF would you expect over time? Infinite. - Is it the square of the individual drive MTBF? See: http://www.phptr.com/articles/article.asp?p=28689 No. This example applies to something like an unmanned spacecraft, where no repairs or replacements can be made. Such a system has no meaningful MTBF beyond its nominal service life (which will usually be much less than the MTBF of even a single component, when that component is something as reliable as a disk drive). Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this one doesn't make sense if MTTR=0 == MTBF=infinity?) That's how it works, and this is the applicable formula to use. For completeness, you'd need to factor in the fact that drives have to be replaced not only when they fail but when they reach the end of their nominal service life, unless you reserved an extra slot to use to build the new drive's contents (effectively, temporarily creating a double mirror) before taking the old drive out. Or: http://www.teradataforum.com/teradat...107_214543.htm (again, don't know how MTTR=0 would work) The same way: though the explanation for RAID-5 MTBF is not in the usual form, it's equivalent. - Is it 150% the individual drive MTBF? See: http://www.zzyzx.com/products/whitep...ility_primer.p df No: the comment you saw there is just some half-assed rule of thumb that once again assumes no repairs are effected (and is still wrong even under that assumption, though the later text that explains the value of repair is qualitatively valid). - Is it double the individual drive MTBF? (I don't remember where I saw this one.) No. The second paper that you cited has a decent explanation of why the formula is what it is. If you'd like a more detailed one, check out Transaction Processing: Concepts and Techniques by Jim Gray and Andreas Reuter. - bill |
#6
|
|||
|
|||
Bill Todd wrote: "ohaya" wrote in message ... ... Before I begin, I was really looking for just a kind of "ballpark" kind of "rule of thumb" for now, with as many assumptions/caveats as needed to make it simple, i.e., something like assume drives are in their "life" (the flat part of the Weibull/bathtub curve), ignore software, etc. The drives *have* to be in their nominal service life: once you go beyond that, you won't get any meaningful numbers (because they have no significance to the product, and thus the manufacturer won't have performed any real testing in that life range). Think of it like this: I just gave you two SCSI drives, I guarantee you their MTBF is 1.2 Mhours, which won't vary over the time period that they'll be in-service, no other hardware will ever fail (i.e., don't worry about the processor board or raid controller), and it takes ~0 time to repair a failure. Given something like that, and assuming I RAID1 these two drives, what kind of MTBF would you expect over time? Infinite. - Is it the square of the individual drive MTBF? See: http://www.phptr.com/articles/article.asp?p=28689 No. This example applies to something like an unmanned spacecraft, where no repairs or replacements can be made. Such a system has no meaningful MTBF beyond its nominal service life (which will usually be much less than the MTBF of even a single component, when that component is something as reliable as a disk drive). Or: http://tech-report.com/reviews/2001q...d/index.x?pg=2 (this one doesn't make sense if MTTR=0 == MTBF=infinity?) That's how it works, and this is the applicable formula to use. For completeness, you'd need to factor in the fact that drives have to be replaced not only when they fail but when they reach the end of their nominal service life, unless you reserved an extra slot to use to build the new drive's contents (effectively, temporarily creating a double mirror) before taking the old drive out. Or: http://www.teradataforum.com/teradat...107_214543.htm (again, don't know how MTTR=0 would work) The same way: though the explanation for RAID-5 MTBF is not in the usual form, it's equivalent. - Is it 150% the individual drive MTBF? See: http://www.zzyzx.com/products/whitep...ility_primer.p df No: the comment you saw there is just some half-assed rule of thumb that once again assumes no repairs are effected (and is still wrong even under that assumption, though the later text that explains the value of repair is qualitatively valid). - Is it double the individual drive MTBF? (I don't remember where I saw this one.) No. The second paper that you cited has a decent explanation of why the formula is what it is. If you'd like a more detailed one, check out Transaction Processing: Concepts and Techniques by Jim Gray and Andreas Reuter. - bill Bill, Thanks. This kind of goes along with some other info I've just been looking at (something like "Product of Reliabilities" on a website). If the above calculation is in fact a good estimate, and just so that I'm clear, if: - I had a RAID1 setup with two SCSI drives that really have an MTBF of 1.2Mhours, and - The drives are within their "normal" lifetime (i.e., not in infant mortality or end-of-life), and - The processor board/hardware was such that it supported a hot swap such that if one of the drives failed, it could be replaced without having halting the system, and - We estimated (for planning purposes) that let's say, worst-case, it took someone an 4 hours to detect the failure, get another identical drive, and replace it (so MTTR ~4 hours). Then a reasonable ballpark estimate for the "theoretical" MTTF (which is ~MTBF) to be: (1.2Mhours)(1.2Mhours) ---------------------- = MTTF(RAID1) 2 x 4 hours Is that correct? Wow!!! Somehow, this seems "counter-intuitive" (sorry) .... Jim |
#7
|
|||
|
|||
It's kind of funny, but when I first started looking, I thought that I'd find something simple. That was this weekend ... As I said in my prior post. Maintained RAID 1 failure(of the cases included) can be ignored as it's swamped by other failures in the real world. It's a great academic exercise with little practical application here. Ron, Thanks again. I'm starting to understand your 2nd sentence above . If I'm understanding what you're saying, with a RAID1 setup, with 2 drives with reasonable (i.e., 1.2Mhours) MTBF, from a design standpoint, you wouldn't be worried about failures of the drives themselves, because there are other failures/components (e.g., the processor board, etc.) that would have an MTBF much lower than the raid'ed drives themselves. Did I get that right? BTW, re. the "0" MTTR, see my post back to Bill Todd. I had given 4 hours as an example in that post, but after posting and thinking about it, given the scenario that I posed, it really seems like the MTTR would be more like "0" than like 4 hours, since with my scenario, the "system" never really fails (since the drives are hot-swappable). Comments? Jim |
#8
|
|||
|
|||
"ohaya" wrote in message ... .... - I had a RAID1 setup with two SCSI drives that really have an MTBF of 1.2Mhours, and - The drives are within their "normal" lifetime (i.e., not in infant mortality or end-of-life), and - The processor board/hardware was such that it supported a hot swap such that if one of the drives failed, it could be replaced without having halting the system, and - We estimated (for planning purposes) that let's say, worst-case, it took someone an 4 hours to detect the failure, get another identical drive, and replace it (so MTTR ~4 hours). Then a reasonable ballpark estimate for the "theoretical" MTTF (which is ~MTBF) to be: (1.2Mhours)(1.2Mhours) ---------------------- = MTTF(RAID1) 2 x 4 hours Is that correct? Wow!!! Somehow, this seems "counter-intuitive" (sorry) .... Hey, *single* disks are pretty damn reliable in the kind of ideal service conditions you postulate: mirrored disks are just (reliable) squared. A 2,000,000-year RAID-1-pair MTBF sounds great, until you recognize that if you have 2,000,000 installations, about one of them will fail each year. If each site has 100 disk pairs rather than just one, then someone will lose data every 3+ days (or you'll need only 20,000 sites for about one to lose data every year). That's still really good, but not so far beyond something you'd start worrying about to be utterly ridiculous - at least if you're a manufacturer (individual customers still have almost no chance of seeing a failure, but even a single one that does is still very bad publicity). Start including RAID-5 configurations, and system MTBF drops by roughly the square of the number of drives in a set, which starts getting significant before long (again, especially from the manufacturer's viewpoint, even if very few individual customers actually experience data loss: some of the new virtualization architectures have RAID-5-like failure characteristics - even if they're not using parity but mirroring to protect data, they're distributing it around the disk set in a manner that can cause data loss if *any* two disks fail - which users should at least be aware of). - bill |
#9
|
|||
|
|||
"ohaya" wrote in message ... .... BTW, re. the "0" MTTR, see my post back to Bill Todd. I had given 4 hours as an example in that post, but after posting and thinking about it, given the scenario that I posed, it really seems like the MTTR would be more like "0" than like 4 hours, since with my scenario, the "system" never really fails (since the drives are hot-swappable). Comments? If you've learned how to repopulate on the order of 100 GB of failed drive in zero time, especially while not seriously degrading on-going processing (so don't just assert that you can use anything like the full bandwidth of its partner to restore it), I suspect that there are many people who would be very interested in talking with you. - bill |
#10
|
|||
|
|||
Then a reasonable ballpark estimate for the "theoretical" MTTF (which is ~MTBF) to be: (1.2Mhours)(1.2Mhours) ---------------------- = MTTF(RAID1) 2 x 4 hours Is that correct? Wow!!! Somehow, this seems "counter-intuitive" (sorry) .... Hey, *single* disks are pretty damn reliable in the kind of ideal service conditions you postulate: mirrored disks are just (reliable) squared. A 2,000,000-year RAID-1-pair MTBF sounds great, until you recognize that if you have 2,000,000 installations, about one of them will fail each year. If each site has 100 disk pairs rather than just one, then someone will lose data every 3+ days (or you'll need only 20,000 sites for about one to lose data every year). Bill, Thanks for the perspective. But, so that I'm clear, if the individual drives really have 1.2Mhours MTBF (and I think the Atlas 15K II spec sheet actually claims 1.4Mhours), then the "squared" MTBF would indicate that RAID 1 pair would be something like 1+ TRILLION hours MTBF, not 1+ MILLION hours. Have I misinterpreted something? Jim |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
IDE RAID | Ted Dawson | Asus Motherboards | 29 | September 21st 04 03:39 AM |
Need help with SATA RAID 1 failure on A7N8X Delux | Cameron | Asus Motherboards | 10 | September 6th 04 11:50 PM |
Asus P4C800 Deluxe ATA SATA and RAID Promise FastTrack 378 Drivers and more. | Julian | Asus Motherboards | 2 | August 11th 04 12:43 PM |
What are the advantages of RAID setup? | Rich | General | 5 | February 23rd 04 08:34 PM |
Gigabyte GA-8KNXP and Promise SX4000 RAID Controller | Old Dude | Gigabyte Motherboards | 4 | November 12th 03 07:26 PM |