Unimpressive performance of large MD raid

**kkkk**

Hi there,
we have a big "storage" computer: dual Xeon (8 cores
total) with lots of disks partly connected to a 3ware 9650SE controller
and partly to the SATA/SAS controllers in the mobo.
The hard disks are Western Digital WD7500AYYS 750GB

We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
parity). 6 of those disks are through the mobo controller, the others
are through the 3ware.

I hoped I would get something like 1 GB/sec sequential write on 10 disks
:-P instead I see MUCH lower performances

I can't understand where is the bottleneck!?

In sequential read with separate instances of "dd" one for each drive
(directly from the block device), I can reach at least 800 MB/sec no
problem (I can probably go much higher, I just have not tried). So I
would exclude that it is a bus bandwidth problem (it's pci-express in
any case, and the 3ware is on an 8x).

Here are my write performances:

I am writing a sequential 14GB file with dd
time dd if=/dev/zero of=zerofile count=28160000 conv=notrunc ; time sync
(the throughput I report is not the one reported by dd: it is adjusted
by hand after also seeing the time sync takes, so it's near to the real
throughput. I confirm the drives LEDs are off after sync finishes.)
There is no other I/O activity. Disk scheduler is deadline for all drives.

All caches enabled on both 3ware an disks attached to the MOBO:
first write = 111 MB/sec
overwrite = 194 MB/sec

Cache enabled only in disks connected to the MOBO (6 over 12):
first write = 95 MB/sec
overwrite = 120 MB/sec

Cache disabled everywhe (this takes an incredibly long time to do the
final flush)
first write = 63 MB/sec
overwrite = 75 MB/sec

I have looked in top and htop what happens. Htop reports LOTS of red
bars (iowait?) practically 50% red bars in every core (8 cores).

Here is what happens in a few of those situations:
- Cache all enabled, overwrite:
dd is constantly at 100% CPU (question: shouldn't it be always 0% CPU
always waiting on blocking IO??). Depending on the moment, either
kjournald or pdflush are at about 75%. More time it is kjournald.
md1_raid5 (raid6 in fact) is around 35%.

- Cache all enabled, first write:
like above but there are often moments in which neither kjournald nor
pdflush are running. Hence the speed difference. dd is always at near
100% CPU.

- cache only in disks attached to mobo, overwrite:
Similar to "cache all enabled, overwrite" except that in this case dd
can never reach 100%, it is around 40%, the other processes are down
accordingly, hence the lower speed. There are more red bars shown in
htop, for all cores.

- cache only in disks attached to mobo, first write:
dd reaches 100% but kjournald reaches 40% max. Pdflush reaches 15% max.
md1_raid5 is down to about 15%.

- cache all disabled, overwrite:
dd reaches about 30%, kjournald is max 20% and md1_raid5 reaches 10%
max. Actually dd alone reaches even 100% but only in the first 20
seconds or so and at that time kjournald and md1_raid5 are still at 20%
and 10%.

- cache all disabled, first write:
similar to above.

So I don't understand how the thing works here.
I don't understand why dd CPU is at 100% (caches on) instead of being 0%
I don't understand why kjournald doesn't go 100%, I don't understand
what kjournald has to do in the case of overwrite (there is no
significant journal on overwrites, right? I am using defaults, should be
data=ordered)
I don't understand why the caches change the performance so much for
sequential write...
Also, question: if I had the most powerful hardware RAID, would
performances be limited anyway to 200MB/sec due to kjournald??

Then I have another question: "sync" from the bash really seems to work
in the sense that it takes time and after this time I confirm that the
activity LEDs of the drives are really off. I have a MD-raid-6+LVM here!
Weren't both MDraid5-6 AND LVM supposed NOT to pass the write barriers
downstream to the disks?? Doesn't sync use exactly barriers?
(implemented with device flushes) Sync here seems to work!

Thanks for your help

**David Schwartz**

On Apr 22, 12:29*pm, kkkk wrote:

we have a big "storage" computer: dual Xeon (8 cores
total) with lots of disks partly connected to a 3ware 9650SE controller
and partly to the SATA/SAS controllers in the mobo.
The hard disks are Western Digital WD7500AYYS 750GB

We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
parity). 6 of those disks are through the mobo controller, the others
are through the 3ware.

How mind-blowingly awful. I'm sure there must be some set of
circumstances that justify such a ridiculously poor setup, but I don't
know what they could possibly be. How do you justify buying a high-end
RAID 6 controller and then not using its RAID capabilities?

DS

**kkkk**

David Schwartz wrote:
How mind-blowingly awful. I'm sure there must be some set of
circumstances that justify such a ridiculously poor setup, but I don't
know what they could possibly be.

Read my other reply

How do you justify buying a high-end
RAID 6 controller and then not using its RAID capabilities?

Had we found a cheap no-raid 16 drives SATA controller for PCI-Express
we would have bought it. If you know of any, please tell me.

Thank you

**Ed Wilts**

Had we found a cheap no-raid 16 drives SATA controller for PCI-Express
we would have bought it. If you know of any, please tell me.

The standard rule of thumb is "good, fast, cheap" - pick any 2. If
you want reasonably good and cheap, you're not going to get fast.

I've seen cheap and fast implementations but they weren't any good - I
had the pleasure of recovering a corrupt 25TB file system over my
Christmas break. We've since replaced it with a more expensive but
good solution.

**David Brown[_2_]**

kkkk wrote:
Hi there,
we have a big "storage" computer: dual Xeon (8 cores
total) with lots of disks partly connected to a 3ware 9650SE controller
and partly to the SATA/SAS controllers in the mobo.
The hard disks are Western Digital WD7500AYYS 750GB

We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
parity). 6 of those disks are through the mobo controller, the others
are through the 3ware.

I understand entirely your reasons for wanting to use Linux software
raid rather than a hardware raid. But I've a couple of other points or
questions - as much for my own learning as anything else.

If you have so many disks connected, did you consider having at least
one as a hot spare? If one of your disks dies and it takes time to
replace it, the system will be very slow while running degraded.

Secondly, did you consider raid 10 as an alternative? Obviously it is
less efficient in terms of disk space, but it should be much faster. It
may also be safer (depending on the likely rates of different kinds of
failures) since there is no "raid 5 write hole". Raid 6, on the other
hand, is probably the slowest raid choice. Any writes that don't cover
a complete stripe will need reads from several of the disks, followed by
parity calculations - and the more disks you have, the higher the
chances of hitting such incomplete stripe writes.

http://www.enterprisenetworkingplanet.com/nethub/article.php/10950_3730176_1

**kkkk**

David Brown wrote:
If you have so many disks connected, did you consider having at least
one as a hot spare?

Of course! We have 4 spares shared among all the arrays.

Secondly, did you consider raid 10 as an alternative?

I wouldn't expect performances of raid 10 via MD to be higher than the
raid-6 of my original post (and might even be much slower at the same
number of drives) because, as I mentioned, the "md1_raid5" (raid-6
actually) process never goes higher than 35% CPU occupation. Regarding
the read+checksum+write problem of raid5/6 for small writes, there
shouldn't be any in this case because I am doing a sequential write.

Any writes that don't cover
a complete stripe will need reads from several of the disks,

Not the case here because I am doing sequential write.

Also, the overhead you mention is present if the stripe is not in cache,
but with large amounts of RAM I expect the stripe should be in cache
(especially the stripe related to the file/directory metadata should
be... while the rest doesn't matter as it is sequential). Yesterday
during the tests the free amount of RAM was 33GB on that machine over a
total of 48GB...

**David Brown[_2_]**

kkkk wrote:
David Brown wrote:
If you have so many disks connected, did you consider having at least
one as a hot spare?

Of course! We have 4 spares shared among all the arrays.

You didn't mention it, so I thought I'd check, since I don't know your
background or experience. I've heard of people using raid 6 because
then they don't need hot spares - the array will effectively run as raid
5 until they replace the dud drive...

Secondly, did you consider raid 10 as an alternative?

I wouldn't expect performances of raid 10 via MD to be higher than the
raid-6 of my original post (and might even be much slower at the same
number of drives) because, as I mentioned, the "md1_raid5" (raid-6
actually) process never goes higher than 35% CPU occupation. Regarding
the read+checksum+write problem of raid5/6 for small writes, there
shouldn't be any in this case because I am doing a sequential write.

Linux raid 10 with "far" layout normally gives sequential read
performance around equal to a pure striped array. It will be a little
faster than raid 6 for the same number of drives, but not a huge
difference (with "near" layout raid 10, you'll get much slower
sequential reads). Sequential write for raid 10 will be a little slower
than for raid 6 (since you are not cpu bound at all). But random
writes, especially of small sizes, will be much better, as will the
performance of multiple simultaneous reads (sequential or random). Of
course this depends highly on your workload, and is based on how the
data is laid out on the disk.

Where you will really see the difference is if when a disk fails, and
you are running in degraded mode and rebuilding. Replacing a disk and
rebuilding will take a tenth of the disk activity with raid 10 than with
raid 6 - it only needs to read through a single disk to do the copy.
With raid 6, the rebuild involves reading *all* the data off *all* the
other disks. And according to some articles I've read, the chances of
getting a sector unrecoverable read error during this rebuild with many
large disks is very high, leading to a second disk failure. This is, of
course, totally independent of whether you are using software raid or
(as others suggest) hardware raid.

It looks quite likely that your performance issues are some sort of IO
bottleneck, but I don't have the experience to help here.

Any writes that don't cover a complete stripe will need reads from
several of the disks,

Not the case here because I am doing sequential write.

Also, the overhead you mention is present if the stripe is not in cache,
but with large amounts of RAM I expect the stripe should be in cache
(especially the stripe related to the file/directory metadata should
be... while the rest doesn't matter as it is sequential). Yesterday
during the tests the free amount of RAM was 33GB on that machine over a
total of 48GB...

You're right here - caching the data will make a very big difference.
And this could be an area where software raid on Linux will do much
better than hardware raid on the card - the software raid can use all of
that 48 GB for such caching, not just the memory on the raid card.

Thanks for your comments - as I said, I'm learning about this myself
(currently mostly theory - when I get the time, I can put it into practice).

**Bill Todd**

kkkk wrote:

....

Some interesting questions. Perhaps you're getting slammed by people
for your configuration choices (which do not seem unreasonable given the
also-not-unreasonable goals that you say drove them) because they're
embarrassed to admit that they have no idea what the answers to those
questions are (which is too bad, because people like me would find
correct answers to them interesting).

Calypso seems especially ignorant when talking about optimal RAID group
sizes. Perhaps he's confusing RAID-5/6 with RAID-3 - but even then he'd
be wrong, since what you really want with RAID-3 is for the total *data*
content (excluding parity) of a stripe to be a convenient value, meaning
that you tend to favor group sizes like 5 or 9 (not counting any spares
that may be present). And given that you've got both processing power
and probably system/memory bus bandwidth to burn, there's no reason why
a software RAID-6 implementation shouldn't perform fairly competitively
with a hardware one.

That said, it's possible that the Linux system file cache interacts
poorly with md in terms of how it destages data to the array - e.g., if
it hands data to md in chunks that don't correspond to a full stripe set
of data to write out (I'm assuming without looking at the code that md
implements RAID-6 such that it can write out a full group of stripes
without having to read in anything) *and* doesn't tell md that the write
is lazy (allowing md to accumulate data in its own buffers until a
convenient amount has arrived - assuming that they're large enough) then
even sequential writes could get pretty expensive (as you seem to be
seeing). A smart implementation might couple the file cache with md
such that no such copy operation was necessary at all, but that would
tend to complicate the layering interface.

Or it's conceivable that ext3's journaling is messing you up,
particularly if you've got the journal on the RAID-6 LUN. If you don't
need the protection of journaling, try using ext2; if you do need it,
make sure that the journal isn't parity-protected (e.g., mirror it
instead of just adding it to the RAID-6 LUN).

I did spend a few minutes in Google trying to find detailed information
about md's RAID-6 implementation but got nowhere. Perhaps its
developers think that no one who isn't willing to read the code has any
business trying to understand its internals - though that attitude would
be difficult to justify in the current case given that they didn't seem
to do a very good job of providing the performance that one might
reasonably expect from a default set-up.

- bill

**David Brown[_2_]**

Bill Todd wrote:
kkkk wrote:

...

Some interesting questions. Perhaps you're getting slammed by people
for your configuration choices (which do not seem unreasonable given the
also-not-unreasonable goals that you say drove them) because they're
embarrassed to admit that they have no idea what the answers to those
questions are (which is too bad, because people like me would find
correct answers to them interesting).

Calypso seems especially ignorant when talking about optimal RAID group
sizes. Perhaps he's confusing RAID-5/6 with RAID-3 - but even then he'd
be wrong, since what you really want with RAID-3 is for the total *data*
content (excluding parity) of a stripe to be a convenient value, meaning
that you tend to favor group sizes like 5 or 9 (not counting any spares
that may be present). And given that you've got both processing power
and probably system/memory bus bandwidth to burn, there's no reason why
a software RAID-6 implementation shouldn't perform fairly competitively
with a hardware one.

That said, it's possible that the Linux system file cache interacts
poorly with md in terms of how it destages data to the array - e.g., if
it hands data to md in chunks that don't correspond to a full stripe set
of data to write out (I'm assuming without looking at the code that md
implements RAID-6 such that it can write out a full group of stripes
without having to read in anything) *and* doesn't tell md that the write
is lazy (allowing md to accumulate data in its own buffers until a
convenient amount has arrived - assuming that they're large enough) then
even sequential writes could get pretty expensive (as you seem to be
seeing). A smart implementation might couple the file cache with md
such that no such copy operation was necessary at all, but that would
tend to complicate the layering interface.

Or it's conceivable that ext3's journaling is messing you up,
particularly if you've got the journal on the RAID-6 LUN. If you don't
need the protection of journaling, try using ext2; if you do need it,
make sure that the journal isn't parity-protected (e.g., mirror it
instead of just adding it to the RAID-6 LUN).

An alternative to consider, especially if you are working mainly with
large files, is xfs rather than ext3. xfs works better with large files
(mainly due to it's support of extents), and has good support for
working with raid (it matches its data and structures with the raid
stripes).

I did spend a few minutes in Google trying to find detailed information
about md's RAID-6 implementation but got nowhere. Perhaps its
developers think that no one who isn't willing to read the code has any
business trying to understand its internals - though that attitude would
be difficult to justify in the current case given that they didn't seem
to do a very good job of providing the performance that one might
reasonably expect from a default set-up.

There is a lot more information about linux raid5 than raid6. I think
that reflects usage. Raid 6 is typically used when you have a larger
number of drives - say, 8 or more. People using such large arrays are
much more likely to be looking for higher-end solutions with strong
support contracts, and are thus more likely to be using something with
high-end hardware raid cards. Raid 5 needs only 3 disks, and is a very
common solution for small servers. If you search around for
configuration how-tos, benchmarks, etc., you'll find relatively few that
have more than 4 disks, and therefore few that use raid 6. There's also
a trend (so I've read) towards raid 10 (whether it be linux raid10, or
standard raid 1 + 0) rather than raid 5/6 because of better recovery.

- bill

**kkkk**

David Brown wrote:

I did spend a few minutes in Google trying to find detailed
information about md's RAID-6 implementation but got nowhere.
...
There is a lot more information about linux raid5 than raid6.

You mean on *Linux MD* raid5? That could be good. Where?

Raid-6 algorithms are practically equivalent to raid-5, except parity
computation obviously .

Thread Tools
Show Printable Version Email this Page
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
two HDs in RAID better than one large drive?	sillyputty[_2_]	Homebuilt PC's	16	November 21st 08 02:24 PM
Slow RAID 1 performance on SATA - can I convert to RAID 0?	Coolasblu	Storage (alternative)	0	July 30th 06 08:02 AM
NCCH-DR large raid drives	adaptabl	Asus Motherboards	9	April 19th 06 11:02 AM
Which SATA drives for large RAID 5 array?	Eli	Storage (alternative)	16	March 26th 05 06:47 PM
Large files on Barracuda IV in RAID	Nick	Storage (alternative)	9	August 27th 03 06:16 PM