If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#31
|
|||
|
|||
Unimpressive performance of large MD raid
Hi all,
based on your suggestions I have been testing lots of stuff: xfs, ext2, raw lvm device reads and writes, effect of bs in dd, disk scheduler, noatime... Interesting stuff is coming out, and not all good :-( I will post details tomorrow. Thank you. |
#32
|
|||
|
|||
Unimpressive performance of large MD raid (LONG!)
Hi all
here are details. Actually most news are not good, but I appear to have found one bottleneck: LVM, see below. I confirm numbers the given previously, plus: (the following benchmarks are measured are with all caches on, both for disks on mobo-controllers and for disks under 3ware) - Ext2 is not measurably faster than ext3 for this sequential write - Xfs is faster on first write (147MB/sec vs 111 MB/sec) but about same speed on file rewrite (~185MB/sec) - Writes directly to the raw LVM device located on the same MD raid-6 are AS FAST AS FILE REWRITES at ~183MB/sec!! So this maximum speed is not an overhead of the filesystem. During the direct LVM write: the md1_raid5 process runs at 35%-50% CPU occupation, the pdflush process runs at 40-80% CPU occupation, dd itself runs at ~80% CPU occupation, plus all cores are about 1/3 busy in kernel code (which gets accounted to these 3 and 5 more running processes) and I guess this means they are servicing disk interrupts. Mounting with noatime for ext3 does not improve performances for this sequential write (quite reasonable). Default was relatime anyway, which is quite optimized. Regarding the bs parameter in dd: for the first benchmarks I posted one week ago, I noticed it didn't make any difference whether it was not set (default=512) or set to the stripe size (160KB) or set very high (I usually used 5120000). That's the reason why I didn't mention it. I supposed the elevator and/or page cache was compensating for the small value of bs. However in more recent tests, it did make a difference SOMETIMES. And this "sometimes" is the strangest thing I met in my tests: Sometimes with bs=512 write performances sucked real bad, such as being 35-50MB/sec, this happened on file overwrite. In these cases I could confirm with iotop that dd speed was very variable, sometimes being as low as 3MB/sec, with brief spikes at 380MB/sec, averaging at a total of 35-50MB/sec at the end of the 14GB write. I tried this test many times while changing the scheduler for all disks, trying both deadline and anticipatory, and the speed was consistently so low. Htop showed dd at 5% CPU, pdflush usually at 0% with brief spikes at 70%, md1_raid5 at about 0.6%. After this I tried using bs=5120000 for the file rewrite, and in this case the write speed was back to normal at ~185MB/sec. After this I tried again with bs=512 rewriting the same file and the speed was STILL HIGH at ~185MB/sec!! At that point I could not experience the slow speed anymore whatever the bs. Something was unstuck in the kernel. This looks like a bug somewhere in the Linux to me. Later in my tests this slowness happened again and this time it was on the raw LVM device write! Exactly something I would have never expected to be slow. Writing to the lvm device was even slower than file rewrite: speed was down to 13MB/sec. When I used bs=5120000 the write speed to the raw device was high at ~185MB/sec. In this case however the "bug" was not resolved by writing with high bs once: everytime I wrote to the lvm device with bs=512 it was unbelievably slow at 12-13MB/sec, everytime I was writing with high bs speed was normal at ~185MB/sec. I alternated the two types of write a few times, and the problem was always reproducible and also independent of the disk scheduler. It is still reproducible today: writing to the LVM device with bs=512 is unbelievably slow at 12MB/sec. Still looks like bug to me... Ok maybe not, actually also writing to the raw MD device (4 disks raid-5, see below) with 512bytes writes causes the same performance problem. Maybe it's the read-modify-write (write hole) overhead. The MD code is probably not capable to use the page cache for caching the stripes then...? (See the question at the end of this post.) Or maybe yes but it's not capable to put into the page cache the data just read due to a write hole, so that the next 512bytes write again causes a write hole on the same stripe? Now some bad news: Read Speed. Oh man I had not noticed that the read speed was so slow on this computer. Read speed is around 50MB/sec with bs=512, and around 75MB/sec with bs=5120000. I found no way to make it faster. I tried ext3 and xfs, I tried AS and Deadline... no way. Reads from the raw LVM device are about 80MB/sec (no bs) t 90MB/sec (high bs). I am positive that the array is not degraded. I am checking now the read speed for any single physical disk: it is 95MB/sec. Reading from the md1 raid-6 device (hence not passing from LVM) is 285MB/sec with dd=512 and 320MB/sec with dd=5120000!!! Heck it is the LVM layer that slows everything down so bad then!! I also have a raid-5 MD with 4 disks: reading from that one gives 200MB/sec with bs=512 or 220MB/sec with bs=5120000. I am now retrying reading from the LVM device... yes, I confirm, it's bad. Hmm I shouldn't have used LVM then!! Instead of doing 5 logical volumes on LVM I should have made 5 partitions on each of the 12 disks and then making 5 separate MD raid-6 devices over those. I think I will do further investigation on this. Ok, ok... reading on the Internet it seems I have not aligned the beginning of the LVM devices to the MD stripes, hence the performance degradation. By comparison here is the write speed on the MD device: on the 4 disks raid-5 I can write at 103MB/sec (all controller caches enabled and bs=5120000 or it would be much slower) so this is much slower than the read speed from the same device which is 220MB/sec as I mentioned. I would really like to check also the sustaned write speed for 1 drive but I cannot do that now. Also unfortunately I cannot check the write speed on the raw 12-disks raid-6 MD device because it's full of data. The raid-5 instead was still empty. Regarding the disk scheduler: AS or Deadline does not make significant difference (consider the machine is not doing any other significant I/O), NOOP is the fastest of the three if bs is high, being something about 10% faster than the other two schedulers, however noop is very influenced by bs: if bs is low (such as 512) performances usually suck, so AS is probably better. I have not checked CFQ but I remember a few months ago I was not impressed by CFQ speed (which was like half of normal, on concurrent access, on this ubuntu kernel 2.6.24-22-openvz), also CFQ is not recommended for RAID. I have one question for you: in case one makes a very small write, such as 1 byte, to a MD raid 5-6 array, and then issues a sync to force the write, do you think linux MD raid 5-6 can use the page cache for getting the rest of the stripe (if present) so to skip the need of performing all the reads (for the raid "write hole"), and so be able to directly perform just the writes? In other terms, do you think the MD code is in a position that can fetch data on demand from the page cache? And after reading a stripe due to the first write hole, is it capable of putting it into the page cache so that a further write on the same stripe would not cause any more reads? (I suspect not, considering my tests in the dd/LVM paragraph above) Thanks for your suggestions |
#33
|
|||
|
|||
Unimpressive performance of large MD raid (LONG!)
kkkk wrote:
Hi all here are details. Actually most news are not good, but I appear to have found one bottleneck: LVM, see below. Have you communicated this information to the linux filesystem developers? According to the MAINTAINERS document in the kernel source the current maintainer is Alasdair Kergon and the contact address is ". Since there are people getting excellent speed out of raid on linux, I can only assume that your configuration is suboptimal. Chris |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
two HDs in RAID better than one large drive? | sillyputty[_2_] | Homebuilt PC's | 16 | November 21st 08 02:24 PM |
Slow RAID 1 performance on SATA - can I convert to RAID 0? | Coolasblu | Storage (alternative) | 0 | July 30th 06 08:02 AM |
NCCH-DR large raid drives | adaptabl | Asus Motherboards | 9 | April 19th 06 11:02 AM |
Which SATA drives for large RAID 5 array? | Eli | Storage (alternative) | 16 | March 26th 05 06:47 PM |
Large files on Barracuda IV in RAID | Nick | Storage (alternative) | 9 | August 27th 03 06:16 PM |