Unimpressive performance of large MD raid

**kkkk**

Hi all,
based on your suggestions I have been testing lots of stuff: xfs, ext2,
raw lvm device reads and writes, effect of bs in dd, disk scheduler,
noatime... Interesting stuff is coming out, and not all good :-(
I will post details tomorrow.
Thank you.

**kkkk**

Hi all
here are details. Actually most news are not good, but I appear to have
found one bottleneck: LVM, see below.

I confirm numbers the given previously, plus: (the following benchmarks
are measured are with all caches on, both for disks on mobo-controllers
and for disks under 3ware)

- Ext2 is not measurably faster than ext3 for this sequential write
- Xfs is faster on first write (147MB/sec vs 111 MB/sec) but about same
speed on file rewrite (~185MB/sec)
- Writes directly to the raw LVM device located on the same MD raid-6
are AS FAST AS FILE REWRITES at ~183MB/sec!! So this maximum speed is
not an overhead of the filesystem. During the direct LVM write: the
md1_raid5 process runs at 35%-50% CPU occupation, the pdflush process
runs at 40-80% CPU occupation, dd itself runs at ~80% CPU occupation,
plus all cores are about 1/3 busy in kernel code (which gets accounted
to these 3 and 5 more running processes) and I guess this means they are
servicing disk interrupts.

Mounting with noatime for ext3 does not improve performances for this
sequential write (quite reasonable). Default was relatime anyway, which
is quite optimized.

Regarding the bs parameter in dd: for the first benchmarks I posted one
week ago, I noticed it didn't make any difference whether it was not set
(default=512) or set to the stripe size (160KB) or set very high (I
usually used 5120000). That's the reason why I didn't mention it. I
supposed the elevator and/or page cache was compensating for the small
value of bs.
However in more recent tests, it did make a difference SOMETIMES. And
this "sometimes" is the strangest thing I met in my tests:
Sometimes with bs=512 write performances sucked real bad, such as being
35-50MB/sec, this happened on file overwrite. In these cases I could
confirm with iotop that dd speed was very variable, sometimes being as
low as 3MB/sec, with brief spikes at 380MB/sec, averaging at a total of
35-50MB/sec at the end of the 14GB write. I tried this test many times
while changing the scheduler for all disks, trying both deadline and
anticipatory, and the speed was consistently so low. Htop showed dd at
5% CPU, pdflush usually at 0% with brief spikes at 70%, md1_raid5 at
about 0.6%.
After this I tried using bs=5120000 for the file rewrite, and in this
case the write speed was back to normal at ~185MB/sec. After this I
tried again with bs=512 rewriting the same file and the speed was STILL
HIGH at ~185MB/sec!! At that point I could not experience the slow speed
anymore whatever the bs. Something was unstuck in the kernel. This looks
like a bug somewhere in the Linux to me.
Later in my tests this slowness happened again and this time it was on
the raw LVM device write! Exactly something I would have never expected
to be slow. Writing to the lvm device was even slower than file rewrite:
speed was down to 13MB/sec. When I used bs=5120000 the write speed to
the raw device was high at ~185MB/sec. In this case however the "bug"
was not resolved by writing with high bs once: everytime I wrote to the
lvm device with bs=512 it was unbelievably slow at 12-13MB/sec,
everytime I was writing with high bs speed was normal at ~185MB/sec. I
alternated the two types of write a few times, and the problem was
always reproducible and also independent of the disk scheduler. It is
still reproducible today: writing to the LVM device with bs=512 is
unbelievably slow at 12MB/sec. Still looks like bug to me... Ok maybe
not, actually also writing to the raw MD device (4 disks raid-5, see
below) with 512bytes writes causes the same performance problem. Maybe
it's the read-modify-write (write hole) overhead. The MD code is
probably not capable to use the page cache for caching the stripes
then...? (See the question at the end of this post.) Or maybe yes but
it's not capable to put into the page cache the data just read due to a
write hole, so that the next 512bytes write again causes a write hole on
the same stripe?

Now some bad news: Read Speed.
Oh man I had not noticed that the read speed was so slow on this computer.
Read speed is around 50MB/sec with bs=512, and around 75MB/sec with
bs=5120000. I found no way to make it faster. I tried ext3 and xfs, I
tried AS and Deadline... no way. Reads from the raw LVM device are about
80MB/sec (no bs) t 90MB/sec (high bs). I am positive that the array is
not degraded. I am checking now the read speed for any single physical
disk: it is 95MB/sec. Reading from the md1 raid-6 device (hence not
passing from LVM) is 285MB/sec with dd=512 and 320MB/sec with
dd=5120000!!! Heck it is the LVM layer that slows everything down so bad
then!! I also have a raid-5 MD with 4 disks: reading from that one gives
200MB/sec with bs=512 or 220MB/sec with bs=5120000.
I am now retrying reading from the LVM device... yes, I confirm, it's
bad. Hmm I shouldn't have used LVM then!! Instead of doing 5 logical
volumes on LVM I should have made 5 partitions on each of the 12 disks
and then making 5 separate MD raid-6 devices over those. I think I will
do further investigation on this. Ok, ok... reading on the Internet it
seems I have not aligned the beginning of the LVM devices to the MD
stripes, hence the performance degradation.

By comparison here is the write speed on the MD device: on the 4 disks
raid-5 I can write at 103MB/sec (all controller caches enabled and
bs=5120000 or it would be much slower) so this is much slower than the
read speed from the same device which is 220MB/sec as I mentioned. I
would really like to check also the sustaned write speed for 1 drive but
I cannot do that now. Also unfortunately I cannot check the write speed
on the raw 12-disks raid-6 MD device because it's full of data. The
raid-5 instead was still empty.

Regarding the disk scheduler: AS or Deadline does not make significant
difference (consider the machine is not doing any other significant
I/O), NOOP is the fastest of the three if bs is high, being something
about 10% faster than the other two schedulers, however noop is very
influenced by bs: if bs is low (such as 512) performances usually suck,
so AS is probably better. I have not checked CFQ but I remember a few
months ago I was not impressed by CFQ speed (which was like half of
normal, on concurrent access, on this ubuntu kernel 2.6.24-22-openvz),
also CFQ is not recommended for RAID.

I have one question for you: in case one makes a very small write, such
as 1 byte, to a MD raid 5-6 array, and then issues a sync to force the
write, do you think linux MD raid 5-6 can use the page cache for getting
the rest of the stripe (if present) so to skip the need of performing
all the reads (for the raid "write hole"), and so be able to directly
perform just the writes? In other terms, do you think the MD code is in
a position that can fetch data on demand from the page cache?
And after reading a stripe due to the first write hole, is it capable of
putting it into the page cache so that a further write on the same
stripe would not cause any more reads? (I suspect not, considering my
tests in the dd/LVM paragraph above)

Thanks for your suggestions

**Chris Friesen**

kkkk wrote:
Hi all
here are details. Actually most news are not good, but I appear to have
found one bottleneck: LVM, see below.

Have you communicated this information to the linux filesystem
developers? According to the MAINTAINERS document in the kernel source
the current maintainer is Alasdair Kergon and the contact address is
".

Since there are people getting excellent speed out of raid on linux, I
can only assume that your configuration is suboptimal.

Chris

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
two HDs in RAID better than one large drive?	sillyputty[_2_]	Homebuilt PC's	16	November 21st 08 02:24 PM
Slow RAID 1 performance on SATA - can I convert to RAID 0?	Coolasblu	Storage (alternative)	0	July 30th 06 08:02 AM
NCCH-DR large raid drives	adaptabl	Asus Motherboards	9	April 19th 06 11:02 AM
Which SATA drives for large RAID 5 array?	Eli	Storage (alternative)	16	March 26th 05 06:47 PM
Large files on Barracuda IV in RAID	Nick	Storage (alternative)	9	August 27th 03 06:16 PM