If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#31
|
|||
|
|||
On Fri, 16 Jul 2004 20:44:14 GMT, "Ron Reaugh"
wrote: I wrote data to the disk. It didn't come back. Sounds like failure to me. A single sector lost does not constitute RAID 1 failure. A single byte lost does. Does RAID 1 operate whereby each read is redundant and then the two read datasets are compared in OS buffers? NO! There is a failure rate that such would catch although obscure. Does that constitute a RAID 1 failure? Folks are grasping into obscurity and very low probabilities. Ron is determined to try to maintain the fiction that he has a clue. He is failing. Malc. |
#32
|
|||
|
|||
wrote in message news:1089999993.239394@smirk... .... One particular worrisome trend is "off-track writes", which is rumored to be more common in consumer-grade disks (typically IDE disks): If during writing mechanical vibration occurs, the head might wander off, and write the new data slightly off the track, without completely overwriting the data on the track. If you now seek away and come back to read later, you can get lucky and by coincidence settle on the new data, or you can get unlucky and hit the old track, and read old data (which is still there, with perfectly valid ECCs, but maybe not for a whole track and only for a few sectors). You can see how this can be quite catastrophic, even in a non-redundant system. Hmmm. This sounds similar (but not identical) to a couple of failure modes (which I may be recalling from Jim Gray's book): silent failure to write at all, or a 'wild write' that hits sectors other than those it was aimed at. I'm still plugging away at a file system that can tolerate (and correct) such errors without undue excess overhead. Think there's any market for it? .... What you might detect here is a certain mindset. We all know that individual disks are fallible, and we've learned to live with this (operative word here is "backup"). Though that won't necessarily save you from the types of errors mentioned above. For small RAID arrays (often based on motherboards or PCI cards, or hidden in the back end of NAS servers), we do a few simple steps that give you a huge improvement in reliability, but are still considered somewhat unreliable. I'd like to change that. For most personal and small business users, these small RAID systems give you a huge bang for the buck. But once you enter the realm of the big enterprise storage systems, things change, and you MUST NEVER EVER LOSE DATA (in all upper case), because if you do, high-level executives will have their busy schedules interrupted, and you engineer's behind will be on the line or toast. Well, I've always taken a somewhat more idealistic view: you should never, ever lose data because 1) it can be a major inconvenience even for the small-system user and 2) the technology to ensure against such loss exists, even at PC-level prices (though if you're running without ECC memory or with a buggy - e.g., overclocked - processor there's really not too much we can do at the storage end to help you out). The reason the enterprise storage systems are so expensive (in terms of $/GB) is that they are fantastically well built, and vendors go to extraordinary lengths to stand behind them. OTOH, you can get away with far less sturdy (and thus far less expensive) boxes if you wrap end-to-end redundancy checks around multiple instances of them. And since anyone seriously interested in availability will be running at least two geographically- (or at least slightly-)separated instances, the only real additional overhead is in implementing those checks effectively. - bill |
#33
|
|||
|
|||
"Bill Todd" wrote in message ... Well, I've always taken a somewhat more idealistic view: you should never, ever lose data because 1) it can be a major inconvenience even for the small-system user and 2) the technology to ensure against such loss exists, even at PC-level prices (though if you're running without ECC memory or with a buggy - e.g., overclocked - processor there's really not too much we can do at the storage end to help you out). Oh so you do realize the weakness. And it's only OCed systems without ECC that are fallible is it now?? |
#34
|
|||
|
|||
"Ron Reaugh" wrote in message ... "Bill Todd" wrote in message ... Well, I've always taken a somewhat more idealistic view: you should never, ever lose data because 1) it can be a major inconvenience even for the small-system user and 2) the technology to ensure against such loss exists, even at PC-level prices (though if you're running without ECC memory or with a buggy - e.g., overclocked - processor there's really not too much we can do at the storage end to help you out). Oh so you do realize the weakness. I realize far more than you're ever likely to be able to imagine, Ron. Don't you ever get tired of being an idiot? - bill |
#35
|
|||
|
|||
In article ,
Ron Reaugh wrote: A single sector lost does not constitute RAID 1 failure. Does RAID 1 operate whereby each read is redundant and then the two read datasets are compared in OS buffers? NO! There is a failure rate that such would catch although obscure. Does that constitute a RAID 1 failure? Folks are grasping into obscurity and very low probabilities. If you weren't trying to avoid the loss of a single sector you could make your RAID logic a lot simpler. -- I've seen things you people can't imagine. Chimneysweeps on fire over the roofs of London. I've watched kite-strings glitter in the sun at Hyde Park Gate. All these things will be lost in time, like chalk-paintings in the rain. `-_-' Time for your nap. | Peter da Silva | Har du kramat din varg, idag? 'U` |
#36
|
|||
|
|||
"Peter da Silva" wrote in message ... In article , Ron Reaugh wrote: A single sector lost does not constitute RAID 1 failure. Does RAID 1 operate whereby each read is redundant and then the two read datasets are compared in OS buffers? NO! There is a failure rate that such would catch although obscure. Does that constitute a RAID 1 failure? Folks are grasping into obscurity and very low probabilities. If you weren't trying to avoid the loss of a single sector you could make your RAID logic a lot simpler. No, that reverse logic doesn't follow. No one is trying to lose anything. The question is whether the unlikely but theoretical possible loss(there are other theoretically possible losses which seem to be easily ignored) of a sector in a two drive RAID 1 configuration is necessarily catastrophic. The answer is that most often it will not stop the show. The likelihood of this loss scenario is dramatically reduced by normal regular activities like reguler use, BU and defrag. You take a low probability vulnerability and diminish it's likelihood further and then even it it happens it's most likely to not be a show stopper...what are you left with in a two drive RAID 1 configuration......a non-issue. |
#37
|
|||
|
|||
"Ron Reaugh" wrote in message ...
(...) Remember that this discussion was about two drive RAID 1. (...) And what percentage of "bit error" goes undetected overall system wise? (...) Two drive modest configuration RAID 1 arrays are the issue. Sector faults used to occur at about the same order of magnitude as actual (whole) drive failures (several studies from the early/mid nineties), although it seems to have gotten rather worse over the last decade or so. This is somewhat anecdotal, but the actual total sector error rates per drive have probably gone up a small amount (perhaps a factor of two or three - which is remarkable given the much larger increase in the number of sectors on a drive), but hardware failures rates have gone down by an order of magnitude. So you lose a sector twenty or thirty times more often than you lose a whole drive. Without scrubbing, the MTTR is very high (since the error is never detected, and thus never corrected), which seriously negatively impacts the reliability of the array (at least for the sector in question, and those in the general vicinity). This is a bit dated, but: "Latent Sector Faults and Reliability of Disk Arrays," by HANNU H. KARI: http://www.cs.hut.fi/~hhk/phd/phd.html |
#38
|
|||
|
|||
In article ,
Ron Reaugh wrote: "Bill Todd" wrote in message ... Well, I've always taken a somewhat more idealistic view: you should never, ever lose data because 1) it can be a major inconvenience even for the small-system user and 2) the technology to ensure against such loss exists, even at PC-level prices (though if you're running without ECC memory or with a buggy - e.g., overclocked - processor there's really not too much we can do at the storage end to help you out). Oh so you do realize the weakness. And it's only OCed systems without ECC that are fallible is it now?? Malcolm and Bill: You know, Ron is right - within a certain class of users and applications. For some other class of application (the high-end ones that get all the attention), he is totally and dangerously wrong. If you are a desktop user who uses MS Windows, and MS Word, and Outlook and IE, as a personal desktop, all the reliability that you needs is provided by a single disk (use a crappy IDE from Fry's), and a real simple backup mechanism (for example once a week copy all the user files, that is the content of "My Documents", to a writeable CD). Even a full-blown disk failure (whether it is complete failure to spin up, or read errors, or even data corruption) requires you to at worst buy a new disk, reinstall the OS, and copy his documents back from the most recent weekly CD. You need no RAID. One could argue that you actually don't need a real computer, but that would be cynical. If you run a minor server, with a white-box PC, maybe running Linux and Apache, or Windows and SQLserver: The risks to this system are so huge (for example from incompetent administration, bad power supplies) that a single RAID card or RAID on the motherboard, with a pair of RAID-1 IDE disks is more than adequate. Even with this pretty crappy disk setup, the chance that a disk failure takes you out is small, compared to the other risks (let's not even start on the risks that SQLserver has a bug and corrupts the data, which is much more likely than disk failures). In this realm, Ron is right: If there is a read error during a RAID-1 rebuild, just mark the sector as bad, and pray that is wasn't an inode, or the index of the database. Matter-of-fact, if the guys disk is only 70% full, chances are 30% that the victim is an unallocated sector, and the bad sector will get remapped on the next write without anyone being any wiser. If you need a storage system to support the trading desk for Morgan Stanley, or the database for the social security administration, or an e-business where failure of the computer would cause complete disruption of the revenue flow (categoric example: pornographic web site), then we have to use a disk system that is EXTREMELY HIGHLY RIDICULOUSLY reliable, and costs the big $$$. This is where you get yourself a big HP server, run IBM DB2 on it, and put the data on four Hitachi Lightnings (two of them at the local site, each internally RAIDed, mirrored across the two with a LVM or a SAN virtualization box, and each then synchronously remote mirrored via rented dark fiber to two offsite facilities). By the way, all brand names in the preceding sentence are meant as humorous illustrations; the fact that they might be my former, current or future employers is one of these funny coincidences. This system will cost you several ten M$, but it is unlikely to go down or lose data. If it happens to lose data (for example because the field service and support team of the vendor who put it together and manages for you f***ed up, which does happen in real life), the CEO of the vendor will call your CIO, and offer to kiss any bodypart the CIO wants to have kissed ... not to mention some major financial apologies. In this environment, quietly marking a sector bad would be tantamount to treason, and might even start a lawsuit. Actually, I've heard that the single largest cause of data loss on high-end system is wetware failure. Commonly if a small emergency happens (typically in the middle of the night, at 3AM, when human reasoning is at its worst), the team of super-experts tries to repair it, sometimes with desastrous consequences. The only saving grace is: More often than not it is the customer's own employees who are at fault, so from the perspective of being a vendor, you are usually OK. By the way, what do I run at home (I have a minor server with Linux, Apache, mySQL, but it isn't used for anything that involves money making): A single (non-RAIDed) 10K RPM Quantum SCSI disk, with nightly backups to a 200GB cheap IDE disk, and occasional backups to writeable CD or DLT tape taken offsite (whenever I feel like it). I've been thinking of getting a cheapo IDE RAID card (I could probably swipe a used 3Ware 4-port card from the office, we used a few of them in a test setup and they are now gathering dust), and put 4 reasonable IDE disks on them (for example the 80GB 7200 RPM Seagates, which can be had for about $80 on sale). With four disks in a RAID-10 configuration, I would have more space than my current SCSI disk, it would probably be faster (4 spindles instead of one, at least for a read-intensive workload like a web server), and extra reliability to boot. Just haven't had the time and energy to deal with it. Oh, and calling the 3Ware card "cheapo" is not meant as an insult: I really like working with them; they are inexpensive, effective, and get the job done (for a reasonable definition of job). -- The address in the header is invalid for obvious reasons. Please reconstruct the address from the information below (look for _). Ralph Becker-Szendy _firstname_@lr _dot_ los-gatos _dot_ ca.us |
#39
|
|||
|
|||
wrote in message news:1090041101.742428@smirk... .... Malcolm and Bill: You know, Ron is right - within a certain class of users and applications. I don't think you've been paying close enough attention. While Ron is certainly right in stating that there are a great many situations in which other potential risks far outweigh the risk of loss of RAID-1-protected data (even relatively incompetently RAID-1-protected data, as would be the case in a non-scrubbing array), that issue is not one of those being debated. Where he went completely off the rails was in suggesting that fatal sector deterioration is not 'failure', and is thus irrelevant (as scrubbing would then also be) to the calculation of the MTBF of the RAID-1 pair - independent of what other external (non-RAID-1-pair-related) risks may or may not exist in the environment in question. That RAID-1-specific MTBF is what the original poster expressed interest in. And scrubbing (or lack thereof) is beyond any shadow of a doubt of major importance in evaluating it. - bill |
#40
|
|||
|
|||
In article ,
Bill Todd wrote: wrote in message news:1090041101.742428@smirk... ... Malcolm and Bill: You know, Ron is right - within a certain class of users and applications. I don't think you've been paying close enough attention. While Ron is certainly right in stating that there are a great many situations in which other potential risks far outweigh the risk of loss of RAID-1-protected data (even relatively incompetently RAID-1-protected data, as would be the case in a non-scrubbing array), that issue is not one of those being debated. Where he went completely off the rails was in suggesting that fatal sector deterioration is not 'failure', and is thus irrelevant (as scrubbing would then also be) to the calculation of the MTBF of the RAID-1 pair - independent of what other external (non-RAID-1-pair-related) risks may or may not exist in the environment in question. That RAID-1-specific MTBF is what the original poster expressed interest in. And scrubbing (or lack thereof) is beyond any shadow of a doubt of major importance in evaluating it. - bill OK, now I understand. You are completely correct. The MTBF or a RAID array is the time until the first bit, byte or sector fails (or is quietly corrupted). Not the time until the whole thing falls over dead. -- The address in the header is invalid for obvious reasons. Please reconstruct the address from the information below (look for _). Ralph Becker-Szendy _firstname_@lr _dot_ los-gatos _dot_ ca.us |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
IDE RAID | Ted Dawson | Asus Motherboards | 29 | September 21st 04 03:39 AM |
Need help with SATA RAID 1 failure on A7N8X Delux | Cameron | Asus Motherboards | 10 | September 6th 04 11:50 PM |
Asus P4C800 Deluxe ATA SATA and RAID Promise FastTrack 378 Drivers and more. | Julian | Asus Motherboards | 2 | August 11th 04 12:43 PM |
What are the advantages of RAID setup? | Rich | General | 5 | February 23rd 04 08:34 PM |
Gigabyte GA-8KNXP and Promise SX4000 RAID Controller | Old Dude | Gigabyte Motherboards | 4 | November 12th 03 07:26 PM |