Unimpressive performance of large MD raid

**kkkk**

Hi there,
we have a big "storage" computer: dual Xeon (8 cores
total) with lots of disks partly connected to a 3ware 9650SE controller
and partly to the SATA/SAS controllers in the mobo.
The hard disks are Western Digital WD7500AYYS 750GB

We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
parity). 6 of those disks are through the mobo controller, the others
are through the 3ware.

I hoped I would get something like 1 GB/sec sequential write on 10 disks
:-P instead I see MUCH lower performances

I can't understand where is the bottleneck!?

In sequential read with separate instances of "dd" one for each drive
(directly from the block device), I can reach at least 800 MB/sec no
problem (I can probably go much higher, I just have not tried). So I
would exclude that it is a bus bandwidth problem (it's pci-express in
any case, and the 3ware is on an 8x).

Here are my write performances:

I am writing a sequential 14GB file with dd
time dd if=/dev/zero of=zerofile count=28160000 conv=notrunc ; time sync
(the throughput I report is not the one reported by dd: it is adjusted
by hand after also seeing the time sync takes, so it's near to the real
throughput. I confirm the drives LEDs are off after sync finishes.)
There is no other I/O activity. Disk scheduler is deadline for all drives.

All caches enabled on both 3ware an disks attached to the MOBO:
first write = 111 MB/sec
overwrite = 194 MB/sec

Cache enabled only in disks connected to the MOBO (6 over 12):
first write = 95 MB/sec
overwrite = 120 MB/sec

Cache disabled everywhe (this takes an incredibly long time to do the
final flush)
first write = 63 MB/sec
overwrite = 75 MB/sec

I have looked in top and htop what happens. Htop reports LOTS of red
bars (iowait?) practically 50% red bars in every core (8 cores).

Here is what happens in a few of those situations:
- Cache all enabled, overwrite:
dd is constantly at 100% CPU (question: shouldn't it be always 0% CPU
always waiting on blocking IO??). Depending on the moment, either
kjournald or pdflush are at about 75%. More time it is kjournald.
md1_raid5 (raid6 in fact) is around 35%.

- Cache all enabled, first write:
like above but there are often moments in which neither kjournald nor
pdflush are running. Hence the speed difference. dd is always at near
100% CPU.

- cache only in disks attached to mobo, overwrite:
Similar to "cache all enabled, overwrite" except that in this case dd
can never reach 100%, it is around 40%, the other processes are down
accordingly, hence the lower speed. There are more red bars shown in
htop, for all cores.

- cache only in disks attached to mobo, first write:
dd reaches 100% but kjournald reaches 40% max. Pdflush reaches 15% max.
md1_raid5 is down to about 15%.

- cache all disabled, overwrite:
dd reaches about 30%, kjournald is max 20% and md1_raid5 reaches 10%
max. Actually dd alone reaches even 100% but only in the first 20
seconds or so and at that time kjournald and md1_raid5 are still at 20%
and 10%.

- cache all disabled, first write:
similar to above.

So I don't understand how the thing works here.
I don't understand why dd CPU is at 100% (caches on) instead of being 0%
I don't understand why kjournald doesn't go 100%, I don't understand
what kjournald has to do in the case of overwrite (there is no
significant journal on overwrites, right? I am using defaults, should be
data=ordered)
I don't understand why the caches change the performance so much for
sequential write...
Also, question: if I had the most powerful hardware RAID, would
performances be limited anyway to 200MB/sec due to kjournald??

Then I have another question: "sync" from the bash really seems to work
in the sense that it takes time and after this time I confirm that the
activity LEDs of the drives are really off. I have a MD-raid-6+LVM here!
Weren't both MDraid5-6 AND LVM supposed NOT to pass the write barriers
downstream to the disks?? Doesn't sync use exactly barriers?
(implemented with device flushes) Sync here seems to work!

Thanks for your help

**David Schwartz**

On Apr 22, 12:29*pm, kkkk wrote:

we have a big "storage" computer: dual Xeon (8 cores
total) with lots of disks partly connected to a 3ware 9650SE controller
and partly to the SATA/SAS controllers in the mobo.
The hard disks are Western Digital WD7500AYYS 750GB

We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
parity). 6 of those disks are through the mobo controller, the others
are through the 3ware.

How mind-blowingly awful. I'm sure there must be some set of
circumstances that justify such a ridiculously poor setup, but I don't
know what they could possibly be. How do you justify buying a high-end
RAID 6 controller and then not using its RAID capabilities?

DS

**kkkk**

lid wrote:
U comp.arch.storage kkkk prica:
We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
parity). 6 of those disks are through the mobo controller, the others
are through the 3ware.

First of all, you've got best of breed 3Ware 9650SE controller which has the
best RAID6 of all SATA RAID controllers... And you're mixing it with onboard
controller to produce the software RAID6?! WHY?!!!

We are a research entity and the funding comes at unknown times. We
decided to build the system so that any component can be replaced with
any other similar component available in shops at any time, e.g. the
9650SE can be replaced with multiple controllers in the future. If the
3ware breaks in the future, at that time there might not be a compatible
controller in production. (We cannot buy from ebay!)
With the current setup, in any emergency, the disks can be connected via
any controller or even USB, and the MD linux raid will still work and we
will be able to get data out. Furthermore we trust visible, open,
old/tested, linux MD code more than any embedded RAID code which nobody
knows except 3ware. What if there was a bug in 9650SE code? It was a
recent controller when we bought it, and we would have found out only
later, maybe years later after setting up our array.
Also, we were already proficient with linux MD.

Anyway since linux MD raid never occupies more than 35% CPU (of a single
core!) in any test, I don't think it is the bottleneck. But this is part
of my question.

Second, 12 drives for RAID6 is suboptimal...

WHY??

Go with 8 or 16 drives and attach them directly to 3Ware...

We have already lots of data and virtual machines loaded in there. Even
if it was possible to attach all to 3ware controller (actually it might
indeed be possible, since it is a 16ML [we have 24 drives on the
machine]), we wouldn't have used the RAID from 3ware for the reasons
explained above.

With MD raid it shouldn't make a difference unless you say that the
larger cache in the 3ware speeds up the operation. This is again part of
my question: the cache seems to have a dramatic effect which I do not
completely understand for sequential I/O. It must be something related
to the bus overhead or the context switching of the CPU (for serving the
interrupts) but I would like a confirmation. Also consider that with 8
cores and a PCI express bus, both overheads should have been negligible.
Anyway the cache from the drives should be enough to minimize this
overhead (I mean for the MOBO drives) so I would not expect a tremendous
speedup from using the 3ware cache for all the drives (I mean still with
MD).

If you really want, concatenate them via LVM, but I won't suggest
it...

LVM concatenation looks like very unsafe...

Other thing, RAID6 has double distributed parity (like RAID5 does)... So,
every drive is for data, and every drive is used for parity...

I know. My sentence was for explaining the exact chunk/stride size we have.

I hoped I would get something like 1 GB/sec sequential write on 10 disks
:-P instead I see MUCH lower performances

With your configuration - very unlikely! :/

What performance would you expect from 3ware raid-6 12-disks with ext3
(defaults mount) sequential dd write?

Thank you

**kkkk**

David Schwartz wrote:
How mind-blowingly awful. I'm sure there must be some set of
circumstances that justify such a ridiculously poor setup, but I don't
know what they could possibly be.

Read my other reply

How do you justify buying a high-end
RAID 6 controller and then not using its RAID capabilities?

Had we found a cheap no-raid 16 drives SATA controller for PCI-Express
we would have bought it. If you know of any, please tell me.

Thank you

**David Brown[_2_]**

kkkk wrote:
Hi there,
we have a big "storage" computer: dual Xeon (8 cores
total) with lots of disks partly connected to a 3ware 9650SE controller
and partly to the SATA/SAS controllers in the mobo.
The hard disks are Western Digital WD7500AYYS 750GB

We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for
parity). 6 of those disks are through the mobo controller, the others
are through the 3ware.

I understand entirely your reasons for wanting to use Linux software
raid rather than a hardware raid. But I've a couple of other points or
questions - as much for my own learning as anything else.

If you have so many disks connected, did you consider having at least
one as a hot spare? If one of your disks dies and it takes time to
replace it, the system will be very slow while running degraded.

Secondly, did you consider raid 10 as an alternative? Obviously it is
less efficient in terms of disk space, but it should be much faster. It
may also be safer (depending on the likely rates of different kinds of
failures) since there is no "raid 5 write hole". Raid 6, on the other
hand, is probably the slowest raid choice. Any writes that don't cover
a complete stripe will need reads from several of the disks, followed by
parity calculations - and the more disks you have, the higher the
chances of hitting such incomplete stripe writes.

http://www.enterprisenetworkingplanet.com/nethub/article.php/10950_3730176_1

**kkkk**

David Brown wrote:
If you have so many disks connected, did you consider having at least
one as a hot spare?

Of course! We have 4 spares shared among all the arrays.

Secondly, did you consider raid 10 as an alternative?

I wouldn't expect performances of raid 10 via MD to be higher than the
raid-6 of my original post (and might even be much slower at the same
number of drives) because, as I mentioned, the "md1_raid5" (raid-6
actually) process never goes higher than 35% CPU occupation. Regarding
the read+checksum+write problem of raid5/6 for small writes, there
shouldn't be any in this case because I am doing a sequential write.

Any writes that don't cover
a complete stripe will need reads from several of the disks,

Not the case here because I am doing sequential write.

Also, the overhead you mention is present if the stripe is not in cache,
but with large amounts of RAM I expect the stripe should be in cache
(especially the stripe related to the file/directory metadata should
be... while the rest doesn't matter as it is sequential). Yesterday
during the tests the free amount of RAM was 33GB on that machine over a
total of 48GB...

**Ed Wilts**

Had we found a cheap no-raid 16 drives SATA controller for PCI-Express
we would have bought it. If you know of any, please tell me.

The standard rule of thumb is "good, fast, cheap" - pick any 2. If
you want reasonably good and cheap, you're not going to get fast.

I've seen cheap and fast implementations but they weren't any good - I
had the pleasure of recovering a corrupt 25TB file system over my
Christmas break. We've since replaced it with a more expensive but
good solution.

**Bill Todd**

kkkk wrote:

....

Some interesting questions. Perhaps you're getting slammed by people
for your configuration choices (which do not seem unreasonable given the
also-not-unreasonable goals that you say drove them) because they're
embarrassed to admit that they have no idea what the answers to those
questions are (which is too bad, because people like me would find
correct answers to them interesting).

Calypso seems especially ignorant when talking about optimal RAID group
sizes. Perhaps he's confusing RAID-5/6 with RAID-3 - but even then he'd
be wrong, since what you really want with RAID-3 is for the total *data*
content (excluding parity) of a stripe to be a convenient value, meaning
that you tend to favor group sizes like 5 or 9 (not counting any spares
that may be present). And given that you've got both processing power
and probably system/memory bus bandwidth to burn, there's no reason why
a software RAID-6 implementation shouldn't perform fairly competitively
with a hardware one.

That said, it's possible that the Linux system file cache interacts
poorly with md in terms of how it destages data to the array - e.g., if
it hands data to md in chunks that don't correspond to a full stripe set
of data to write out (I'm assuming without looking at the code that md
implements RAID-6 such that it can write out a full group of stripes
without having to read in anything) *and* doesn't tell md that the write
is lazy (allowing md to accumulate data in its own buffers until a
convenient amount has arrived - assuming that they're large enough) then
even sequential writes could get pretty expensive (as you seem to be
seeing). A smart implementation might couple the file cache with md
such that no such copy operation was necessary at all, but that would
tend to complicate the layering interface.

Or it's conceivable that ext3's journaling is messing you up,
particularly if you've got the journal on the RAID-6 LUN. If you don't
need the protection of journaling, try using ext2; if you do need it,
make sure that the journal isn't parity-protected (e.g., mirror it
instead of just adding it to the RAID-6 LUN).

I did spend a few minutes in Google trying to find detailed information
about md's RAID-6 implementation but got nowhere. Perhaps its
developers think that no one who isn't willing to read the code has any
business trying to understand its internals - though that attitude would
be difficult to justify in the current case given that they didn't seem
to do a very good job of providing the performance that one might
reasonably expect from a default set-up.

- bill

**David Schwartz**

On Apr 23, 3:51*am, kkkk wrote:

Anyway since linux MD raid never occupies more than 35% CPU (of a single
core!) in any test, I don't think it is the bottleneck. But this is part
of my question.

It is the bottleneck, it's just not a CPU bottleneck, it's an I/O
bottleneck. The problem is simply the number of I/Os the system has to
issue. With a 12 disk RAID 6 array implemented in software, a write of
a single byte (admittedly the worst case) will require 10 reads
followed by 12 writes that cannot be started until all 10 reads
complete. Each of these operations has to be started and completed by
the MD driver.

I understand the reasoning behind your configuration choices, they
just utterly sacrifice performance.

DS

**David Brown[_2_]**

kkkk wrote:
David Brown wrote:
If you have so many disks connected, did you consider having at least
one as a hot spare?

Of course! We have 4 spares shared among all the arrays.

You didn't mention it, so I thought I'd check, since I don't know your
background or experience. I've heard of people using raid 6 because
then they don't need hot spares - the array will effectively run as raid
5 until they replace the dud drive...

Secondly, did you consider raid 10 as an alternative?

I wouldn't expect performances of raid 10 via MD to be higher than the
raid-6 of my original post (and might even be much slower at the same
number of drives) because, as I mentioned, the "md1_raid5" (raid-6
actually) process never goes higher than 35% CPU occupation. Regarding
the read+checksum+write problem of raid5/6 for small writes, there
shouldn't be any in this case because I am doing a sequential write.

Linux raid 10 with "far" layout normally gives sequential read
performance around equal to a pure striped array. It will be a little
faster than raid 6 for the same number of drives, but not a huge
difference (with "near" layout raid 10, you'll get much slower
sequential reads). Sequential write for raid 10 will be a little slower
than for raid 6 (since you are not cpu bound at all). But random
writes, especially of small sizes, will be much better, as will the
performance of multiple simultaneous reads (sequential or random). Of
course this depends highly on your workload, and is based on how the
data is laid out on the disk.

Where you will really see the difference is if when a disk fails, and
you are running in degraded mode and rebuilding. Replacing a disk and
rebuilding will take a tenth of the disk activity with raid 10 than with
raid 6 - it only needs to read through a single disk to do the copy.
With raid 6, the rebuild involves reading *all* the data off *all* the
other disks. And according to some articles I've read, the chances of
getting a sector unrecoverable read error during this rebuild with many
large disks is very high, leading to a second disk failure. This is, of
course, totally independent of whether you are using software raid or
(as others suggest) hardware raid.

It looks quite likely that your performance issues are some sort of IO
bottleneck, but I don't have the experience to help here.

Any writes that don't cover a complete stripe will need reads from
several of the disks,

Not the case here because I am doing sequential write.

Also, the overhead you mention is present if the stripe is not in cache,
but with large amounts of RAM I expect the stripe should be in cache
(especially the stripe related to the file/directory metadata should
be... while the rest doesn't matter as it is sequential). Yesterday
during the tests the free amount of RAM was 33GB on that machine over a
total of 48GB...

You're right here - caching the data will make a very big difference.
And this could be an area where software raid on Linux will do much
better than hardware raid on the card - the software raid can use all of
that 48 GB for such caching, not just the memory on the raid card.

Thanks for your comments - as I said, I'm learning about this myself
(currently mostly theory - when I get the time, I can put it into practice).

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
two HDs in RAID better than one large drive?	sillyputty[_2_]	Homebuilt PC's	16	November 21st 08 02:24 PM
Slow RAID 1 performance on SATA - can I convert to RAID 0?	Coolasblu	Storage (alternative)	0	July 30th 06 08:02 AM
NCCH-DR large raid drives	adaptabl	Asus Motherboards	9	April 19th 06 11:02 AM
Which SATA drives for large RAID 5 array?	Eli	Storage (alternative)	16	March 26th 05 06:47 PM
Large files on Barracuda IV in RAID	Nick	Storage (alternative)	9	August 27th 03 06:16 PM