RAID 5 corruption, RAID 1 more stable?

**[email protected]** · #1 January 25th 08, 02:52 PM posted to comp.arch.storage

On several occasions I have seen situations where faulty UPS's caused
servers wtih RAID 5 arrays to reboot continuosly which caused
corruption to either the RAID array itself or the file system. I am
considering recommending RAID1 whenever possible because I suspect
that it would be more resillient under the same conditions because I
have two seperate copies of the system and I do not suspect that
mirroring would mirror NTFS corruption or suffer from the problems of
RAID 5 array corruption. I would like to hear your opinions on this.

thanks

**nik Simpson** · #2 January 25th 08, 03:31 PM posted to comp.arch.storage

wrote:
On several occasions I have seen situations where faulty UPS's caused
servers wtih RAID 5 arrays to reboot continuosly which caused
corruption to either the RAID array itself or the file system. I am
considering recommending RAID1 whenever possible because I suspect
that it would be more resillient under the same conditions because I
have two seperate copies of the system and I do not suspect that
mirroring would mirror NTFS corruption or suffer from the problems of
RAID 5 array corruption. I would like to hear your opinions on this.

Mirroring would most certainly copy any NTFS level corruption. By the
time it gets to the controller it's just blocks of data. The controller
has no way of knowing that the data has been corrupted in the file
system layer, and will faithfully replicate the corrupt data to both
halves of the mirror.

--
Nik Simpson

**Cydrome Leader** · #3 January 25th 08, 03:57 PM posted to comp.arch.storage

wrote:
On several occasions I have seen situations where faulty UPS's caused
servers wtih RAID 5 arrays to reboot continuosly which caused
corruption to either the RAID array itself or the file system. I am
considering recommending RAID1 whenever possible because I suspect
that it would be more resillient under the same conditions because I
have two seperate copies of the system and I do not suspect that
mirroring would mirror NTFS corruption or suffer from the problems of
RAID 5 array corruption. I would like to hear your opinions on this.

thanks

hardware mirroring will copy any garbage you throw at it, so if your
operating system fails to complete a write, and the RAID array is still
working, it will write what it gets.

Luckily, NTFS is pretty tough and fairly hard to corrupt, as well as it
comes up ok from unclean shutdowns pretty well.

If the underlying RAID itself is corrupted from a dirty shutdown, yes,
your NTFS may break, and the array may detect itself as degraded or
failed and may not even let you access it.

Not having a junk UPS is best, but if you're going to have to deal with
this, I'd suggest sticking with raid1. It's far simpler, and faster. If
everything goes to hell, you can usually just run off one disk. Raid5 will
limp along with bad disk, bad being either really failed, or just marked
bad by the raid controller.

A battery backed cache on your raid controller may help too. It can save
writes that never happed for commit to disks after power is restored.

#4 January 25th 08, 05:46 PM posted to comp.arch.storage

In article ,
wrote:
On several occasions I have seen situations where faulty UPS's caused
servers wtih RAID 5 arrays to reboot continuosly which caused
corruption to either the RAID array itself or the file system.

OK, let's analyze this. Did the continuous reboots cause:

A. The disk array to suffer so many errors (for example disk errors
on the actual spinning platters or hardware errors, for example in
the memory cache in your disk array) that it can not correct for
them, because it is designed to handle only one error at a time.

B. The disk array to corrupt data on disk.

C. The attached host and its filesystem to become "confused", and
write incorrect data to the disk array, which the disk array correctly
stored, but which now causes corruption?

From your description, we can not distinguish exactly what you have
observed. Let's analyze those three scenarios in reverse order:

C. There is nothing the disk array can do if the host is broken and
writes incorrect data to it. If you have a broken host, or broken
file system running on the host, it doesn't matter whether your
recording media is the worlds most reliable disk array, or some
junk from the surplus store. Fix your host.

B. If your disk array is so badly built that it corrupts data on disk
(meaning deliberately writing wrong data to disk, or losing the
ability to correct disk errors), then it is a piece of crap, and
you need to either replace it with a quality product, or have the
vendor fix it.

HOWEVER, it is true that RAID1 is simpler to implement than RAID5,
in particular if you do not require stable and serialized reads
after a failure (meaning that after a failure, the returned data is
still data that was previously written, but not necessarily the
data that was most recently written, nor necessarily always the
same data). If you happen to have a really crappy disk array with
sort-of broken firmware, running it in RAID1 is likely to stress it
much less, and you may be able to live with flaws in the RAID
implementation in that scenario.

A. If power cycling causes hardware errors, you need a better quality
disk array. To some extent, spinning disks will always be
vulnerable to errors, but a well-built power distribution system in
the disk array should to some extent protect the disks from power
cycling causing errors.

BUT: Disks will always fail, and power cycling will increase the
rate of disk failure. And it turns out that RAID5 is actually less
resilient against disk failure than RAID10 (note that I did not
write RAID1 here). Here's why. For a concrete example, imagine
that you have 10 disks, each 1TB in size (I picked round numbers,
not because those are completely realistic, but to make the math
easier). If you configure those 10 disks as a 9+P parity RAID5
array, you will get 9TB of useable capacity, but ANY failure of 2
disks will cause data loss, and the probability that two sector or
track errors on separate disks collaborate to cause data loss is
pretty high. If on the other hand you configure those 10 disks as
a RAID10 array (mirrored and striped), then 4 out of 5 times you
can actually survive a double disk fault, as long as the two failed
disks are not "next" to each other (meaning part of the mirror pair
for the same stripe). Similarly, the probability of two sector or
track errors causing data loss is also about 5 times lower than for
a similar RAID5 setup. The price you pay is that the useable
capacity is only 5TB.

So it is indeed true that RAID1 (in the guise of the real-world
RAID10 implementation) is statistically more tolerant of disk
errors than RAID5, even though its worst case is the same.

BUT: For a well-designed commercial-use disk array, with sufficient
spares, good disk state monitoring, and good power distribution and
batteries for clean shutdown, the difference described above should
be infinitesimally small. If your disk array, workload, and
reliability expectations are ssuch such that you need the handle
double disk faults, don't run RAID10 because it can handle it
"most" of the time, but get a dual-fault tolerant array.

I am
considering recommending RAID1 whenever possible because I suspect
that it would be more resillient under the same conditions because I
have two seperate copies of the system and I do not suspect that
mirroring would mirror NTFS corruption or suffer from the problems of
RAID 5 array corruption. I would like to hear your opinions on this.

I, for one, would not agree with that statement, until we can figure
out what is really causing your problems. But as mentioned above,
using RAID1 will make things easier in several respects, and might
be a workable band-aid to reduce the incidence of such problems to a
tolerable level. You might also mask a much graver problem, so when
it eventually comes back to bit you, it hurts even more.

Can you tell us: What type of disk array, what type of host, what type
of connection, what type of workload? What are all the details of the
corruption? Did the disk array management software (you are running
management software, right?) report data errors? Can you look in the
log files of your host whether disk errors were logged? What have the
vendors of your host, OS and disk array contributed to solving the
problem (you have full support contracts, right?).

Good luck! If you really care about your data, chisel it on stone
tablets. Make two copies. Remember what happened to Moses when he
dropped commandements 11 through 15.

--
Ralph Becker-Szendy _firstname_@lr_dot_los-gatos_dot_ca_dot_us
735 Sunset Ridge Road; Los Gatos, CA 95033

**the wharf rat** · #5 January 25th 08, 09:39 PM posted to comp.arch.storage

In article ,
wrote:
mirroring would mirror NTFS corruption or suffer from the problems of
RAID 5 array corruption. I would like to hear your opinions on this.

Well, unless you start with very big discs you're still going
to need to involve RAID 5 even if you mirror. That being said I've
definitely seen instances where file system corruption on disc A was
happily mirrored to disc B before it was discovered.

Maybe put the disk array on a ups thats on the ups?

This is why Ghu, the great, has given us LTO drives.

**Cydrome Leader** · #6 January 25th 08, 10:21 PM posted to comp.arch.storage

Dan Rumney wrote:
There's nothing inherent in RAID-5 that makes it susceptible to
corruption due to continuous rebooting of the controller. Equally,

there's more disk writes per write from the host. If it's a controller
with volatile cache memory, and you're always losing power this could be a
problem.

If you write/change 1 bit on a RAID5 volume, the controller has to read in
64kB (just a typical value), recalculate parity across the drives, then
write it all back.

Plain mirroring might still be 64kB stripes, but only across two disks,
not 3 or more.

there's nothing inherent in RAID-1 that makes it more robust in these
scenarios.

There's also less drives to possible have lost writes to with a mirror
than with RAID5. A crappy RAID1 controller may not even notice that the
data across both disks doesn't match though.

If the corruption is being caused by the controller it doesn't matter if
you have mirrored copies of your data; the controller will just write
the corruption to one or both copies.

Also, if the corruption is truly at the NTFS level, then you should be
looking at your filesystem and not the storage controller.

True, but if a controller is writing garbage to disk, NTFS will notice.

**Lon** · #7 January 26th 08, 03:06 AM posted to comp.arch.storage

nik Simpson wrote:
wrote:
On several occasions I have seen situations where faulty UPS's caused
servers wtih RAID 5 arrays to reboot continuosly which caused
corruption to either the RAID array itself or the file system. I am
considering recommending RAID1 whenever possible because I suspect
that it would be more resillient under the same conditions because I
have two seperate copies of the system and I do not suspect that
mirroring would mirror NTFS corruption or suffer from the problems of
RAID 5 array corruption. I would like to hear your opinions on this.

Mirroring would most certainly copy any NTFS level corruption. By the
time it gets to the controller it's just blocks of data. The controller
has no way of knowing that the data has been corrupted in the file
system layer, and will faithfully replicate the corrupt data to both
halves of the mirror.

I am unclear how the logic that mirroring would be any more reliable
than raid5 under conditions of power problems. If the UPS control is
whacked, good ups is cheap insurance. If the goal is to stay with the
lousy ups, possibly feeding a second system with a different power
source and using host based mirroring would help.

**Lon** · #8 January 26th 08, 03:12 AM posted to comp.arch.storage

Cydrome Leader wrote:
Dan Rumney wrote:
There's nothing inherent in RAID-5 that makes it susceptible to
corruption due to continuous rebooting of the controller. Equally,

there's more disk writes per write from the host. If it's a controller
with volatile cache memory, and you're always losing power this could be a
problem.

If you are using a raid controller or array with no safe memory, that is
the first mistake. Presuming you care enough about the data to spend
for the 1/n overhead and write cycle overhead of raid.

If you write/change 1 bit on a RAID5 volume, the controller has to read in
64kB (just a typical value), recalculate parity across the drives, then
write it all back.

64k for a modestly low end raid controller.

Plain mirroring might still be 64kB stripes, but only across two disks,
not 3 or more.

Mirror used to be the norm in the midlevel unix boxen. Saw way more
propagations of error data than recovery from error unless there was
some additional software/hardware in between that could reasonably
unambiguously spot the good copy in reasonable scenarios. More often it
was used as a quick way to take a snapshot or clone a file system by
making and breaking mirrors.

there's nothing inherent in RAID-1 that makes it more robust in these
scenarios.

There's also less drives to possible have lost writes to with a mirror
than with RAID5. A crappy RAID1 controller may not even notice that the
data across both disks doesn't match though.

Yup.

If the corruption is being caused by the controller it doesn't matter if
you have mirrored copies of your data; the controller will just write
the corruption to one or both copies.

Also, if the corruption is truly at the NTFS level, then you should be
looking at your filesystem and not the storage controller.

True, but if a controller is writing garbage to disk, NTFS will notice.

Not until it reads it. :-)

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
how to isolate raid corruption	gg	Asus Motherboards	9	July 29th 07 06:24 AM
SATA Raid 1 Data Corruption - A7N8X / RocketRaid 1520	ice	Asus Motherboards	7	December 17th 05 08:39 AM
OC'ing RAID Corruption...	Vigor	Overclocking AMD Processors	10	December 9th 05 11:35 AM
How stable is A8N-E raid and nic	Don	Asus Motherboards	0	September 24th 05 08:12 AM
P4C800-E Deluxe and S-ATA RAID... Intel RAID or Promise RAID ???	Fraizer	Asus Motherboards	3	October 29th 03 01:50 PM