Unimpressive performance of large MD raid

**Chris Friesen**

kkkk wrote:

In my case dd pushes 5 seconds of data before the disks start writing
(dirty_writeback_centiseces = 500). dd stays always at least 5 seconds
ahead of the writes. This should fill all stripes completely causing no
reads. I even tried to raise the dirty_writeback_centisecs with no
measurable performance benefit.

Where is this 5secs of data stored? Is it at the ext3 layer or at the
LVM layer (I doubt this one, also I notice there is no LVM kernel thread
runing) or at the MD layer?

Most likely it's in the page cache, above both layers.

Why do you think dd stays at 100% CPU? (with disks/3ware caches enabled)
Shouldn't that be 0%?

Do you think the CPU is high due to a memory-copy operation? If it was
that, I suppose dd from /dev/zero to /dev/null should go at 200MB/sec
instead it goes at 1.1GB/sec (with 100%CPU occupation indeed, 65% of
which is in kernel mode). That would mean that the number of copies
performed by dd while copying to the ext3-raid is 5 times greater than
that for copying from /dev/zero to /dev/null . Hmmm... a bit difficult
to believe. there must be other stuff performed in the ext3 case so to
hog the CPU. Is the ext3 code running whithin the dd process when dd writes?

Copying from /dev/zero to /dev/null is a special case, as it doesn't
have to do any filesystem work. It's basically measuring memory bandwidth.

When copying to an actual file there will be work to arrange the
filesystem, allocate disk blocks, etc. I wouldn't have expected it to
happen within the context of the dd process, but I'm not a filesystem guy.

Hmm probably not because kjournald had significant CPU occupation. What
is the role of the journal during file overwrites?

I suspect the journal will be involved on any filesystem access.

Just curious, how is your ext3 filesystem configured for data
journalling (journal/ordered/writeback)? Have you tried mounting it
with "noatime"?

Lastly, in your original email you asked about "sync". When run from
the commandline, that command simply flushes all filesystem changes out
to disk and waits for that process to complete. Depending on the disk
the data may or may not have actually hit the platters by the time sync
returns.

Chris

**Waldek Hebisch[_2_]**

In comp.os.linux.development.system kkkk wrote:

Why do you think dd stays at 100% CPU? (with disks/3ware caches enabled)
Shouldn't that be 0%?

Do you think the CPU is high due to a memory-copy operation? If it was
that, I suppose dd from /dev/zero to /dev/null should go at 200MB/sec
instead it goes at 1.1GB/sec (with 100%CPU occupation indeed, 65% of
which is in kernel mode). That would mean that the number of copies
performed by dd while copying to the ext3-raid is 5 times greater than
that for copying from /dev/zero to /dev/null . Hmmm... a bit difficult
to believe. there must be other stuff performed in the ext3 case so to
hog the CPU. Is the ext3 code running whithin the dd process when dd writes?

I did not check the kernel code, but logically writing to /dev/null
you do not need to copy data. So I normally I would expect 2 times
more copying. I would try bs parameter to dd, for example
on my machine

dd if=/dev/zero of=/dev/null count=1000000

needs 0.560571s while

time dd if=/dev/zero of=/dev/null count=100000 bs=10240

(which copies twice as much data) needs 0.109896.

By default dd uses 512 byte block which means that you do a lot
of system calls (each block is copied using separate call to
read and write).

And yes, when dd is doing system call work done in kernel is
accounted as work done by dd. That includes many operations
done by ext3 (some work is done by kernel treads and some is
done from interrupts and accounted to whatever process is
running at given time).

Coming back to dd CPU usage: as long as there is enough space
to buffer write dd should have 100% CPU utilization. Simply,
dd is copying data to kernel buffers as fast as it can. Once
kernel buffers are full dd should block -- however what you
wrote suggest that you have enough memory to buffer whole
write. Using large blocks dd should be faster than disks,
but for small blocks cost of system calls may be high
(and it does not help that you have many cores, because
dd is single threaded and much of kernel work is done
in the same thread).

--
Waldek Hebisch

**Bill Todd**

lid wrote:
U comp.arch.storage Bill Todd prica:
RAID3 implementation doesn't exist on 3Ware controllers...

I wasn't suggesting that it did, only that you might be being confused
by it.

You are too detail... Engineer, right?

Yes. And when you're making statements that depend upon the details,
it's really a good thing to have understood them first.

RAID3 is very similar to RAID5

No, it is not. With RAID-3 every drive in the array (except the parity

It is true, however, that for full-stripe reads or writes the
performance of a RAID-5 array is similar to that of a RAID-3 array (in
some cases a bit slower, since the spindles aren't synchronized, in
others a bit faster, since there's no dedicated parity drive that can
never participate in read activity).

Very similar compared to RAID1 and RAID0...

Even more similar if you compare them to a chicken coop. So what?

Read something between the
lines, concept is what matters, not details...

Since you clearly don't understand the concepts very well either, it's
hardly surprising that you don't think the details matter.

RAID3 uses parity disk just
as the RAID5 does, but RAID5 uses 'virtual' distributed parity disk...

Well, that's *one* of the differences between RAID-3 and RAID-5.
Another is that RAID-3 doesn't allow parallel accesses to multiple disks
to satisfy multiple concurrent requests. A third is that a RAID-5 array
can offer the combined bandwidth of *all* its disks for streaming reads,
while a RAID-3 array never benefits from its parity disk during
streaming reads. A fourth is that the size of the data in a stripe in
RAID-3 is actually often important, unlike the typical case with RAID-5.
A fifth is that RAID-3 arrays are often spindle-synchronized for
performance reasons, while RAID-5 arrays are usually not
spindle-synchronized (because even if they're used in the same manner
that a RAID-3 array would be used there's just not that much to be
gained and for the other access patterns that RAID-3 can't handle well
possibly something to be lost).

It's
totally different than RAID1 concept...

Correct, but I'm not sure why you think it's relevant to the discussion
that's occurring here.

Seems like I was partially right with 8 or 16 drives as a optimal number of
drives

Only if you define 'partially right' as 'completely wrong', since '8 or
16' is not '6, 10, or 18' (not that the latter numbers are usually
important either).

8+2 drives are 10... 16+2 drives are 18... 8 drives are optimal...

You clearly didn't understand why that's incorrect the first time I
explained it. Perhaps you should just keep reading my previous response
until the light dawns rather than continue to babble on incompetently.

BTW., in EMC CLARiiON storage arrays RAID3 can only be installed as 5 or 9
drives...

Which is what I've been telling you right along: RAID-3 is *different*
in this regard. Aside from my direct references to RAID-3, everything
else I've said (and will say below) is in the context of RAID-5/6:
please try to confine your own comments to them as well.

While that text shows *examples* of drives where the size of the data in
a stripe is a power of 2 it does not state that it *should* be a power
of 2 (for the excellent reason that usually there's no reason for that
in a RAID-5/6 array, though there may well be in a RAID-3 array).

Fine, but, if everything is aligned with Base2, then why go around it?

Because telling people that they should use 10 or 18 drives (let alone
'8 or 16 drives') in a RAID-6 array in the typical cases where this
provides no advantage whatsoever and may be inconvenient (or
unnecessarily expensive) is just stupid.

What matters is request *alignment*, not whether the size of the data in
a full stripe is a power of 2. This usually means request alignment
with respect to the size of a single disk's data stripe segment, so that
the minimum number of disks is required to participate in any access
request. Since access request sizes are often a power of 2 this means
that the size of a single disk's stripe segment should usually be a
power of 2, but says nothing about how many disks should be in the stripe.

Only in cases where the write request size is predictably aligned and
large enough to span most of the disks in the stripe does it make sense
to arrange for the stripe's data size to equal the write request size -
whether this happens to be a power of 2 or not.

You may be confused because for a long time many arrays wouldn't allow
the size of the data in a single disk's stripe segment to exceed 64 KB
or so. This may not have made complete sense even 20 years ago when RAM
was several orders of magnitude smaller/more expensive and streaming
disk data rates were several orders of magnitude slower, and makes no
sense at all today (when 64 KB is more like the absolute *minimum* size
that should be allowed for a single disk's stripe segment, and sizes
approaching 1 MB or even larger are often desirable: just so you don't
get further confused, even with such large stripe segment sizes it's
still easy for the array to handle small reads and writes just as
efficiently as they can with small stripe segment sizes).

RAID5
of 9 drives is better than the one with 8 drives (7+1), and both can be
used... But one is better aligned than the other... Or you have something
contrary to say now again?

Well, yes: you still don't know what you're talking about, and appear
to be very resistant to becoming better-educated about it. Do try to
change both before you respond again.

Incidentally, Patterson et al. didn't invent RAID, they just formalized
the description of it. IBM had a RAID-5 implementation in the '70s, and
RAID-1 is even older.

Norman Ken Ouchi at IBM was awarded a 1978 U.S. patent 4,092,732[19] titled
"System for recovering data stored in failed memory unit." The claims for
this patent describe what would later be termed RAID 5 with full stripe
writes. This 1978 patent also mentions that disk mirroring or duplexing
(what would later be termed RAID 1) and protection with dedicated parity
(that would later be termed RAID 4) were prior art at that time.

The term RAID was first defined by David A. Patterson, Garth A. Gibson and
Randy Katz at the University of California, Berkeley in 1987. They studied
the possibility of using two or more drives to appear as a single device to
the host system and published a paper: "A Case for Redundant Arrays of
Inexpensive Disks (RAID)" in June 1988 at the SIGMOD conference.[20]

The ability to quote wikipedia without attribution does not a scholar
make (though attempting to justify one's assertions in at least *some*
manner is a step in the right direction).

Yup, you're right... In that way, since you have to be right every time,

I just try to be sure that I know what I'm talking about before I blurt
it out to the world (a habit which you might consider attempting to
emulate). A consequence of that tends to be that I *am* usually right,
but that's not my goal per se.

I
will say that Patterson's picture describes the RAID5 small-write penalty...

It may, but IBM had identified it by 1981 at the latest in subsequent
work on optimizing the 1978 patent algorithm
(http://www.freepatentsonline.com/4761785.html):

"U.S. Pat. No. 4,092,732 to Ouchi describes a check sum generator for
generating a check sum segment from segments of a system record as the
system record segments are being transferred between a storage subsystem
and a central processing unit. The check sum segment is actually a
series of parity bits generated from bits in the same location of the
system record segments. In other words, each bit, such as the first bit
of the check sum segment is the parity of the group of first bits of the
record segments. When a storage unit containing a record segment fails,
the record segment is regenerated from the check sum segment and the
remaining system segments. One storage unit is selected for containing
all the check sum segments for a plurality of record storage units.

"In the Ouchi patent, the check sum segment is always generated from
reading all the record segments it covers. If one record segment is
changed, all the record segments covered are read and the checksum
segment is generated. An IBM Technical Disclosure Bulletin, Vol. 24, No.
2, July 1981, pages 986-987, Efficient Mass Storage Parity Recovery
Mechanism, improves upon the generation of the checksum segment, or
parity segment by copying a record segment before it is changed. The
copy of the record segment is then exclusive-ORed with the changed
record segment to create a change mask. The parity segment is then read
and exclusive-ORed with the change mask to generate the new parity
segment which is then written back out to the storage unit."

Interestingly (and contrary to most of the articles on the Web that
credit Ouchi with the first description of RAID-5 - let alone the more
prevalent articles that seem to believe that all the RAID the concepts
originated at Berkeley much later) Ouchi's algorithm appears to have
been RAID-4 (see last sentence of first paragraph quoted above). The
small-write approach described above, however, applies equally to
RAID-5, so clearly existed well before Patterson described it. The IBM
patent cited above for a RAID-5-style algorithm was filed on 06/12/1986
(with no indication of how long IBM had been working on the concept
before then) and contains none of the names of the Berkeley RAID team -
but IBM was working with/sponsoring them at around that time so that's
likely where the Berkeley group picked it up.

- bill

**Bill Todd**

lid wrote:

....

What matters is request *alignment*, not whether the size of the data in a
full stripe is a power of 2. This usually means request alignment with
respect to the size of a single disk's data stripe segment, so that the
minimum number of disks is required to participate in any access request.
Since access request sizes are often a power of 2 this means that the size
of a single disk's stripe segment should usually be a power of 2, but says
nothing about how many disks should be in the stripe.

So basically, if I say that stripe segment is 64kb, it means that when I
write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
the rest of 4 drives are not used (forget about parity drives now)?

In a conventional RAID-5 you can allocate and write to space at the same
granularity you can to a disk (currently a single 512-byte sector). The
sectors are numbered consecutively within each stripe segment and
continuing across each stripe (leaving out the parity segment) and then
continue on to the next stripe. So if you read or write a 512 KB
request (aligned to the start of a stripe segment, rather than just
aligned to an arbitrary single sector boundary) it will indeed hit 8
logically consecutive 64 KB stripe segments.

In a 12-drive array such a 512 KB write will be optimized such that
instead of reading each modified segment, creating the XOR with the new
data, and then reading/XORing with/updating the relevant parity segment
it will just write the 8 modified segments, read in the 3 unmodified
data segments remaining in the stripe, create the full-stripe data XOR,
and update the parity segment (this is the normal optimization applied
when about half or more of the segments in a stripe are modified). If
not all the modified segments fall in the same stripe it will update the
two affected stripes separately (applying that optimization to one of
them if applicable).

In that example it would be more efficient to use a far larger stripe
segment. If, for example, you used a 512 KB stripe segment and the
access was aligned to that granularity you'd only need to read in the
old data and parity, XOR the old data with the new data, XOR the result
with the old parity, and write out the new data and the new parity: 4
disk accesses, and though each would be 512 KB instead of 64 KB this
would less than double the duration of each access resulting in less
than 2/3 the total array overhead that the 12 accesses in the optimized
(and aligned) initial case above required. In both situations unaligned
accesses increase the overhead, but for comparable lack of alignment the
large stripe segment still usually comes out ahead (and is at least as
efficient at servicing small requests, as explained below).

What happens if I try to write 1kb data in a 64kb stripe segment using 4kb
blocks in NTFS (let's do this as an example)?

You don't have to manipulate the entire stripe segment when only part of
it is affected. So at a minimum you just read in the two target data
sectors to get their old value and the corresponding two sectors of the
parity segment, XOR that old data with the two sectors' worth of
modified data, XOR the result with the two sectors of parity data, and
write out the two modified data sectors and the two updated parity sectors.

But if NTFS accesses things at 4 KB granularity at its lower-level disk
interface you actually have to update the entire 4 KB data block and the
corresponding parity block in the affected stripe's parity segment
(still reasonably efficient).

....

But first I had
to make you angry...

Not really: I actually prefer answering reasonable questions to
correcting persistent misinformation. Unfortunately, there seems to be
a lot more of the latter than of the former ever since "Generation Me"
came around (self-discipline doesn't seem to be their forte).

- bill

**Bill Todd**

lid wrote:
U comp.arch.storage Bill Todd prica:
So basically, if I say that stripe segment is 64kb, it means that when I
write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
the rest of 4 drives are not used (forget about parity drives now)?

In a conventional RAID-5 you can allocate and write to space at the same
granularity you can to a disk (currently a single 512-byte sector). The
sectors are numbered consecutively within each stripe segment and
continuing across each stripe (leaving out the parity segment) and then
continue on to the next stripe. So if you read or write a 512 KB
request (aligned to the start of a stripe segment, rather than just
aligned to an arbitrary single sector boundary) it will indeed hit 8
logically consecutive 64 KB stripe segments.

So if you align cache page size with stripe size, you can benefit from it or
no?

Maybe - mostly depending upon how much concurrency exists in your
workload (as seen by the disks).

If there's no concurrency at all (i.e., no request is submitted until
the previous one completes) you're best off using full-stripe writes
(and for that matter RAID-3 rather than RAID-5) - because there's no way
you can get more than a single disk's worth of IOPS out of the array and
you should ensure that nothing causes it to be even worse than that.

If there's a lot of concurrency, you're best off minimizing the
resources that each operation uses, which usually means large stripe
segment sizes - so large that even a single segment will tend to be
larger than most requests (thus the entire stripe will be *much* larger
than almost any request): that way, each read typically is satisfied by
one disk access and each write by 4 disk accesses spread across 2 disks
(3 if the old data still happens to be in cache), so N reads or N/2
writes can proceed in parallel with virtually no worse latency for reads
and only about twice the latency for writes as would happen with
full-stripe RAID-3 accesses.

So you wind up with N times the potential read throughput with minimal
latency penalty and N/4 times the write throughput without dramatic
latency penalty - compared with the *no-load* full-stripe case. But if
having only effectively one disk's worth of IOPS for full-stripe
accesses would have resulted in significant request queuing delays, you
may have a lot *better* latency (as well as throughput) for both reads
and writes when you minimize the number of disks involved in each one by
using a large stripe segment size.

With NVRAM for stable write-back cache in your array controller it can
perform other optimizations - e.g., even if your writes aren't optimally
aligned it can in many situations gather them up in its NVRAM and later
issue them to the disks with more optimal alignment. The same can
happen with a software array controller and lazy writes if your
file-level cache is sufficiently intelligent to do that kind of thing
when presenting data to the array.

Let's say that you've got 16kB cache page size and have 8 drives with
2kb stripe segment size... If you dump cache, you basically write to all
drives at once, right? But this situation can slow down everything since
you've got how many IOPS per one write operation (8)?

Exactly.

....

In that example it would be more efficient to use a far larger stripe
segment. If, for example, you used a 512 KB stripe segment and the
access was aligned to that granularity you'd only need to read in the
old data and parity, XOR the old data with the new data, XOR the result
with the old parity, and write out the new data and the new parity: 4
disk accesses, and though each would be 512 KB instead of 64 KB this
would less than double the duration of each access resulting in less
than 2/3 the total array overhead that the 12 accesses in the optimized
(and aligned) initial case above required. In both situations unaligned
accesses increase the overhead, but for comparable lack of alignment the
large stripe segment still usually comes out ahead (and is at least as
efficient at servicing small requests, as explained below).

Thinking..... So, you need to optimize the cache of a RAID controller to
gather changed data so that it could be written in one dump (utilize all
actuators at once)?

That would certainly be a useful optimization (just mentioned that above).

- bill

**Bill Todd**

lid wrote:
U comp.arch.storage Bill Todd prica:
So if you align cache page size with stripe size, you can benefit from it or
no?

If there's no concurrency at all (i.e., no request is submitted until the
previous one completes) you're best off using full-stripe writes (and for
that matter RAID-3 rather than RAID-5) - because there's no way you can
get more than a single disk's worth of IOPS out of the array and you
should ensure that nothing causes it to be even worse than that.

How do you think about using RAID3 for multimedia broadcasting, and what
about multimedia recording?

For large-streaming-access-only applications that don't interleave
requests to the array at fine granularity RAID-3 is good. But RAID-5
can do nearly as well and occasionally a bit better (e.g., it doesn't
have a dedicated parity drive that can never satisfy read bandwidth), so
it may not be worth going out of your way to get RAID-3 hardware -
especially if the workload is at all mixed.

If there's a lot of concurrency, you're best off minimizing the resources
that each operation uses, which usually means large stripe segment sizes -
so large that even a single segment will tend to be larger than most
requests (thus the entire stripe will be *much* larger than almost any
request): that way, each read typically is satisfied by one disk access
and each write by 4 disk accesses spread across 2 disks (3 if the old data
still happens to be in cache), so N reads or N/2 writes can proceed in
parallel with virtually no worse latency for reads and only about twice
the latency for writes as would happen with full-stripe RAID-3 accesses.

OK, much concurrency is solved using TCQ/NCQ which means you've got only one
IO operation for fetching few data segments per drive...

I have no idea what you're talking about: TCQ/NCQ has nothing to do
with what I said above.

So you wind up with N times the potential read throughput with minimal
latency penalty and N/4 times the write throughput without dramatic
latency penalty - compared with the *no-load* full-stripe case. But if
having only effectively one disk's worth of IOPS for full-stripe accesses
would have resulted in significant request queuing delays, you may have a
lot *better* latency (as well as throughput) for both reads and writes
when you minimize the number of disks involved in each one by using a
large stripe segment size.

I see... So, it's possible to see that 5 1TB/7.2k drives in RAID5 with a
huge segment size can work faster than 15 146GB/15k drives in RAID5 with
small segment size?

Possible, perhaps. But the far more common case is that for
highly-concurrent workloads with the same number of disks a large stripe
segment size beats a small one. Another common case is that using 2x or
3x as many SATA drives will provide a lot better performance (and
capacity) than a system using enterprise drives at considerably lower
cost (though SATA drives can't stand up under continuous seek-intensive
use nearly as well, so for such workloads some of the extra drives
should be used to provide extra redundancy and you should be ready to
replace drives considerably more often).

- bill

**Guy Dawson**

kkkk wrote:

I am writing a sequential 14GB file with dd
time dd if=/dev/zero of=zerofile count=28160000 conv=notrunc ; time sync

What's your block size for dd? I'm guessing it's the default 512bytes
from your figures above. So you're doing lots of little writes.

What happends with a much bigger block size such as 1MB or more?

Guy
-- --------------------------------------------------------------------
Guy Dawson I.T. Manager Crossflight Ltd

**Guy Dawson**

kkkk wrote:
This guy

http://lists.freebsd.org/pipermail/f...er/005170.html

is doing basically the same as I am doing with software raid done with
ZFS in freebsd (raid-Z2 is basically raid-6) writing and reading 10GB
files. His results are a heck of a lot better than mine with defaults
settings and not very distant from the bare hard disks throughput (he
seems to get about 50MB/sec per non-parity disk).

This tells that software raid is indeed capable of doing good stuff in
theory. Just linux MD + ext3 seems to have some performance problems :-(

The key line in that link is

dd bs=1m for a 10GB file.

Note the 1MB block size setting for his test.

Waldek Hebisch's post makes the same point about block size too.

Guy
-- --------------------------------------------------------------------
Guy Dawson I.T. Manager Crossflight Ltd

**Patrick Rother**

In comp.os.linux.development.system kkkk wrote:
We are using an ext3 filesystem with defaults mount on top of LVM + MD
raid 6.

I am writing a sequential 14GB file with dd
time dd if=/dev/zero of=zerofile count=28160000 conv=notrunc ; time sync

Try to use /dev/md/X directly as target for dd, to keep filesystem overhead
out of your measurement. (Please note that your filesystem will be destroyed
by this.)

**kkkk**

Hi everybody,

Thanks for your suggestions

I have seen the suggestions by Guy and Patrick to raise the bs for dd. I
had already tried various values for this, up to a very large value, and
I even tried the exact value of bs that would fill one complete RAID
stripe in one write: no measurable performance improvement.

Regarding trying to put the partition as data=writeback, I will try this
one ASAP (possibly tomorrow: I need to find a moment when nobody is
using the machine).

Regarding trying to dd directly to the raw block device, I will also try
this one ASAP. Luckily I have an unused LVM device located on the same
MD raid 6.

stay tuned... check back in 1-2 days.

Thank everybody for your help

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
two HDs in RAID better than one large drive?	sillyputty[_2_]	Homebuilt PC's	16	November 21st 08 02:24 PM
Slow RAID 1 performance on SATA - can I convert to RAID 0?	Coolasblu	Storage (alternative)	0	July 30th 06 08:02 AM
NCCH-DR large raid drives	adaptabl	Asus Motherboards	9	April 19th 06 11:02 AM
Which SATA drives for large RAID 5 array?	Eli	Storage (alternative)	16	March 26th 05 06:47 PM
Large files on Barracuda IV in RAID	Nick	Storage (alternative)	9	August 27th 03 06:16 PM