HardwareBanter - View Single Post

#4 January 25th 08, 05:46 PM posted to comp.arch.storage

In article ,
wrote:
On several occasions I have seen situations where faulty UPS's caused
servers wtih RAID 5 arrays to reboot continuosly which caused
corruption to either the RAID array itself or the file system.

OK, let's analyze this. Did the continuous reboots cause:

A. The disk array to suffer so many errors (for example disk errors
on the actual spinning platters or hardware errors, for example in
the memory cache in your disk array) that it can not correct for
them, because it is designed to handle only one error at a time.

B. The disk array to corrupt data on disk.

C. The attached host and its filesystem to become "confused", and
write incorrect data to the disk array, which the disk array correctly
stored, but which now causes corruption?

From your description, we can not distinguish exactly what you have
observed. Let's analyze those three scenarios in reverse order:

C. There is nothing the disk array can do if the host is broken and
writes incorrect data to it. If you have a broken host, or broken
file system running on the host, it doesn't matter whether your
recording media is the worlds most reliable disk array, or some
junk from the surplus store. Fix your host.

B. If your disk array is so badly built that it corrupts data on disk
(meaning deliberately writing wrong data to disk, or losing the
ability to correct disk errors), then it is a piece of crap, and
you need to either replace it with a quality product, or have the
vendor fix it.

HOWEVER, it is true that RAID1 is simpler to implement than RAID5,
in particular if you do not require stable and serialized reads
after a failure (meaning that after a failure, the returned data is
still data that was previously written, but not necessarily the
data that was most recently written, nor necessarily always the
same data). If you happen to have a really crappy disk array with
sort-of broken firmware, running it in RAID1 is likely to stress it
much less, and you may be able to live with flaws in the RAID
implementation in that scenario.

A. If power cycling causes hardware errors, you need a better quality
disk array. To some extent, spinning disks will always be
vulnerable to errors, but a well-built power distribution system in
the disk array should to some extent protect the disks from power
cycling causing errors.

BUT: Disks will always fail, and power cycling will increase the
rate of disk failure. And it turns out that RAID5 is actually less
resilient against disk failure than RAID10 (note that I did not
write RAID1 here). Here's why. For a concrete example, imagine
that you have 10 disks, each 1TB in size (I picked round numbers,
not because those are completely realistic, but to make the math
easier). If you configure those 10 disks as a 9+P parity RAID5
array, you will get 9TB of useable capacity, but ANY failure of 2
disks will cause data loss, and the probability that two sector or
track errors on separate disks collaborate to cause data loss is
pretty high. If on the other hand you configure those 10 disks as
a RAID10 array (mirrored and striped), then 4 out of 5 times you
can actually survive a double disk fault, as long as the two failed
disks are not "next" to each other (meaning part of the mirror pair
for the same stripe). Similarly, the probability of two sector or
track errors causing data loss is also about 5 times lower than for
a similar RAID5 setup. The price you pay is that the useable
capacity is only 5TB.

So it is indeed true that RAID1 (in the guise of the real-world
RAID10 implementation) is statistically more tolerant of disk
errors than RAID5, even though its worst case is the same.

BUT: For a well-designed commercial-use disk array, with sufficient
spares, good disk state monitoring, and good power distribution and
batteries for clean shutdown, the difference described above should
be infinitesimally small. If your disk array, workload, and
reliability expectations are ssuch such that you need the handle
double disk faults, don't run RAID10 because it can handle it
"most" of the time, but get a dual-fault tolerant array.

I am
considering recommending RAID1 whenever possible because I suspect
that it would be more resillient under the same conditions because I
have two seperate copies of the system and I do not suspect that
mirroring would mirror NTFS corruption or suffer from the problems of
RAID 5 array corruption. I would like to hear your opinions on this.

I, for one, would not agree with that statement, until we can figure
out what is really causing your problems. But as mentioned above,
using RAID1 will make things easier in several respects, and might
be a workable band-aid to reduce the incidence of such problems to a
tolerable level. You might also mask a much graver problem, so when
it eventually comes back to bit you, it hurts even more.

Can you tell us: What type of disk array, what type of host, what type
of connection, what type of workload? What are all the details of the
corruption? Did the disk array management software (you are running
management software, right?) report data errors? Can you look in the
log files of your host whether disk errors were logged? What have the
vendors of your host, OS and disk array contributed to solving the
problem (you have full support contracts, right?).

Good luck! If you really care about your data, chisel it on stone
tablets. Make two copies. Remember what happened to Moses when he
dropped commandements 11 through 15.

--
Ralph Becker-Szendy _firstname_@lr_dot_los-gatos_dot_ca_dot_us
735 Sunset Ridge Road; Los Gatos, CA 95033