HardwareBanter - View Single Post - Unimpressive performance of large MD raid

**Bill Todd**

lid wrote:
U comp.arch.storage Bill Todd prica:
So basically, if I say that stripe segment is 64kb, it means that when I
write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
the rest of 4 drives are not used (forget about parity drives now)?

In a conventional RAID-5 you can allocate and write to space at the same
granularity you can to a disk (currently a single 512-byte sector). The
sectors are numbered consecutively within each stripe segment and
continuing across each stripe (leaving out the parity segment) and then
continue on to the next stripe. So if you read or write a 512 KB
request (aligned to the start of a stripe segment, rather than just
aligned to an arbitrary single sector boundary) it will indeed hit 8
logically consecutive 64 KB stripe segments.

So if you align cache page size with stripe size, you can benefit from it or
no?

Maybe - mostly depending upon how much concurrency exists in your
workload (as seen by the disks).

If there's no concurrency at all (i.e., no request is submitted until
the previous one completes) you're best off using full-stripe writes
(and for that matter RAID-3 rather than RAID-5) - because there's no way
you can get more than a single disk's worth of IOPS out of the array and
you should ensure that nothing causes it to be even worse than that.

If there's a lot of concurrency, you're best off minimizing the
resources that each operation uses, which usually means large stripe
segment sizes - so large that even a single segment will tend to be
larger than most requests (thus the entire stripe will be *much* larger
than almost any request): that way, each read typically is satisfied by
one disk access and each write by 4 disk accesses spread across 2 disks
(3 if the old data still happens to be in cache), so N reads or N/2
writes can proceed in parallel with virtually no worse latency for reads
and only about twice the latency for writes as would happen with
full-stripe RAID-3 accesses.

So you wind up with N times the potential read throughput with minimal
latency penalty and N/4 times the write throughput without dramatic
latency penalty - compared with the *no-load* full-stripe case. But if
having only effectively one disk's worth of IOPS for full-stripe
accesses would have resulted in significant request queuing delays, you
may have a lot *better* latency (as well as throughput) for both reads
and writes when you minimize the number of disks involved in each one by
using a large stripe segment size.

With NVRAM for stable write-back cache in your array controller it can
perform other optimizations - e.g., even if your writes aren't optimally
aligned it can in many situations gather them up in its NVRAM and later
issue them to the disks with more optimal alignment. The same can
happen with a software array controller and lazy writes if your
file-level cache is sufficiently intelligent to do that kind of thing
when presenting data to the array.

Let's say that you've got 16kB cache page size and have 8 drives with
2kb stripe segment size... If you dump cache, you basically write to all
drives at once, right? But this situation can slow down everything since
you've got how many IOPS per one write operation (8)?

Exactly.

....

In that example it would be more efficient to use a far larger stripe
segment. If, for example, you used a 512 KB stripe segment and the
access was aligned to that granularity you'd only need to read in the
old data and parity, XOR the old data with the new data, XOR the result
with the old parity, and write out the new data and the new parity: 4
disk accesses, and though each would be 512 KB instead of 64 KB this
would less than double the duration of each access resulting in less
than 2/3 the total array overhead that the 12 accesses in the optimized
(and aligned) initial case above required. In both situations unaligned
accesses increase the overhead, but for comparable lack of alignment the
large stripe segment still usually comes out ahead (and is at least as
efficient at servicing small requests, as explained below).

Thinking..... So, you need to optimize the cache of a RAID controller to
gather changed data so that it could be written in one dump (utilize all
actuators at once)?

That would certainly be a useful optimization (just mentioned that above).

- bill