If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Unimpressive performance of large MD raid
Hi there,
we have a big "storage" computer: dual Xeon (8 cores total) with lots of disks partly connected to a 3ware 9650SE controller and partly to the SATA/SAS controllers in the mobo. The hard disks are Western Digital WD7500AYYS 750GB We are using an ext3 filesystem with defaults mount on top of LVM + MD raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for parity). 6 of those disks are through the mobo controller, the others are through the 3ware. I hoped I would get something like 1 GB/sec sequential write on 10 disks :-P instead I see MUCH lower performances I can't understand where is the bottleneck!? In sequential read with separate instances of "dd" one for each drive (directly from the block device), I can reach at least 800 MB/sec no problem (I can probably go much higher, I just have not tried). So I would exclude that it is a bus bandwidth problem (it's pci-express in any case, and the 3ware is on an 8x). Here are my write performances: I am writing a sequential 14GB file with dd time dd if=/dev/zero of=zerofile count=28160000 conv=notrunc ; time sync (the throughput I report is not the one reported by dd: it is adjusted by hand after also seeing the time sync takes, so it's near to the real throughput. I confirm the drives LEDs are off after sync finishes.) There is no other I/O activity. Disk scheduler is deadline for all drives. All caches enabled on both 3ware an disks attached to the MOBO: first write = 111 MB/sec overwrite = 194 MB/sec Cache enabled only in disks connected to the MOBO (6 over 12): first write = 95 MB/sec overwrite = 120 MB/sec Cache disabled everywhe (this takes an incredibly long time to do the final flush) first write = 63 MB/sec overwrite = 75 MB/sec I have looked in top and htop what happens. Htop reports LOTS of red bars (iowait?) practically 50% red bars in every core (8 cores). Here is what happens in a few of those situations: - Cache all enabled, overwrite: dd is constantly at 100% CPU (question: shouldn't it be always 0% CPU always waiting on blocking IO??). Depending on the moment, either kjournald or pdflush are at about 75%. More time it is kjournald. md1_raid5 (raid6 in fact) is around 35%. - Cache all enabled, first write: like above but there are often moments in which neither kjournald nor pdflush are running. Hence the speed difference. dd is always at near 100% CPU. - cache only in disks attached to mobo, overwrite: Similar to "cache all enabled, overwrite" except that in this case dd can never reach 100%, it is around 40%, the other processes are down accordingly, hence the lower speed. There are more red bars shown in htop, for all cores. - cache only in disks attached to mobo, first write: dd reaches 100% but kjournald reaches 40% max. Pdflush reaches 15% max. md1_raid5 is down to about 15%. - cache all disabled, overwrite: dd reaches about 30%, kjournald is max 20% and md1_raid5 reaches 10% max. Actually dd alone reaches even 100% but only in the first 20 seconds or so and at that time kjournald and md1_raid5 are still at 20% and 10%. - cache all disabled, first write: similar to above. So I don't understand how the thing works here. I don't understand why dd CPU is at 100% (caches on) instead of being 0% I don't understand why kjournald doesn't go 100%, I don't understand what kjournald has to do in the case of overwrite (there is no significant journal on overwrites, right? I am using defaults, should be data=ordered) I don't understand why the caches change the performance so much for sequential write... Also, question: if I had the most powerful hardware RAID, would performances be limited anyway to 200MB/sec due to kjournald?? Then I have another question: "sync" from the bash really seems to work in the sense that it takes time and after this time I confirm that the activity LEDs of the drives are really off. I have a MD-raid-6+LVM here! Weren't both MDraid5-6 AND LVM supposed NOT to pass the write barriers downstream to the disks?? Doesn't sync use exactly barriers? (implemented with device flushes) Sync here seems to work! Thanks for your help |
#2
|
|||
|
|||
Unimpressive performance of large MD raid
On Apr 22, 12:29*pm, kkkk wrote:
we have a big "storage" computer: dual Xeon (8 cores total) with lots of disks partly connected to a 3ware 9650SE controller and partly to the SATA/SAS controllers in the mobo. The hard disks are Western Digital WD7500AYYS 750GB We are using an ext3 filesystem with defaults mount on top of LVM + MD raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for parity). 6 of those disks are through the mobo controller, the others are through the 3ware. How mind-blowingly awful. I'm sure there must be some set of circumstances that justify such a ridiculously poor setup, but I don't know what they could possibly be. How do you justify buying a high-end RAID 6 controller and then not using its RAID capabilities? DS |
#3
|
|||
|
|||
Unimpressive performance of large MD raid
|
#4
|
|||
|
|||
Unimpressive performance of large MD raid
David Schwartz wrote:
How mind-blowingly awful. I'm sure there must be some set of circumstances that justify such a ridiculously poor setup, but I don't know what they could possibly be. Read my other reply How do you justify buying a high-end RAID 6 controller and then not using its RAID capabilities? Had we found a cheap no-raid 16 drives SATA controller for PCI-Express we would have bought it. If you know of any, please tell me. Thank you |
#5
|
|||
|
|||
Unimpressive performance of large MD raid
kkkk wrote:
Hi there, we have a big "storage" computer: dual Xeon (8 cores total) with lots of disks partly connected to a 3ware 9650SE controller and partly to the SATA/SAS controllers in the mobo. The hard disks are Western Digital WD7500AYYS 750GB We are using an ext3 filesystem with defaults mount on top of LVM + MD raid 6. The raid-6 is on 12 disks (hence it is 10 disks for data, 2 for parity). 6 of those disks are through the mobo controller, the others are through the 3ware. I understand entirely your reasons for wanting to use Linux software raid rather than a hardware raid. But I've a couple of other points or questions - as much for my own learning as anything else. If you have so many disks connected, did you consider having at least one as a hot spare? If one of your disks dies and it takes time to replace it, the system will be very slow while running degraded. Secondly, did you consider raid 10 as an alternative? Obviously it is less efficient in terms of disk space, but it should be much faster. It may also be safer (depending on the likely rates of different kinds of failures) since there is no "raid 5 write hole". Raid 6, on the other hand, is probably the slowest raid choice. Any writes that don't cover a complete stripe will need reads from several of the disks, followed by parity calculations - and the more disks you have, the higher the chances of hitting such incomplete stripe writes. http://www.enterprisenetworkingplanet.com/nethub/article.php/10950_3730176_1 |
#6
|
|||
|
|||
Unimpressive performance of large MD raid
David Brown wrote:
If you have so many disks connected, did you consider having at least one as a hot spare? Of course! We have 4 spares shared among all the arrays. Secondly, did you consider raid 10 as an alternative? I wouldn't expect performances of raid 10 via MD to be higher than the raid-6 of my original post (and might even be much slower at the same number of drives) because, as I mentioned, the "md1_raid5" (raid-6 actually) process never goes higher than 35% CPU occupation. Regarding the read+checksum+write problem of raid5/6 for small writes, there shouldn't be any in this case because I am doing a sequential write. Any writes that don't cover a complete stripe will need reads from several of the disks, Not the case here because I am doing sequential write. Also, the overhead you mention is present if the stripe is not in cache, but with large amounts of RAM I expect the stripe should be in cache (especially the stripe related to the file/directory metadata should be... while the rest doesn't matter as it is sequential). Yesterday during the tests the free amount of RAM was 33GB on that machine over a total of 48GB... |
#7
|
|||
|
|||
Unimpressive performance of large MD raid
Had we found a cheap no-raid 16 drives SATA controller for PCI-Express
we would have bought it. If you know of any, please tell me. The standard rule of thumb is "good, fast, cheap" - pick any 2. If you want reasonably good and cheap, you're not going to get fast. I've seen cheap and fast implementations but they weren't any good - I had the pleasure of recovering a corrupt 25TB file system over my Christmas break. We've since replaced it with a more expensive but good solution. |
#8
|
|||
|
|||
Unimpressive performance of large MD raid
kkkk wrote:
.... Some interesting questions. Perhaps you're getting slammed by people for your configuration choices (which do not seem unreasonable given the also-not-unreasonable goals that you say drove them) because they're embarrassed to admit that they have no idea what the answers to those questions are (which is too bad, because people like me would find correct answers to them interesting). Calypso seems especially ignorant when talking about optimal RAID group sizes. Perhaps he's confusing RAID-5/6 with RAID-3 - but even then he'd be wrong, since what you really want with RAID-3 is for the total *data* content (excluding parity) of a stripe to be a convenient value, meaning that you tend to favor group sizes like 5 or 9 (not counting any spares that may be present). And given that you've got both processing power and probably system/memory bus bandwidth to burn, there's no reason why a software RAID-6 implementation shouldn't perform fairly competitively with a hardware one. That said, it's possible that the Linux system file cache interacts poorly with md in terms of how it destages data to the array - e.g., if it hands data to md in chunks that don't correspond to a full stripe set of data to write out (I'm assuming without looking at the code that md implements RAID-6 such that it can write out a full group of stripes without having to read in anything) *and* doesn't tell md that the write is lazy (allowing md to accumulate data in its own buffers until a convenient amount has arrived - assuming that they're large enough) then even sequential writes could get pretty expensive (as you seem to be seeing). A smart implementation might couple the file cache with md such that no such copy operation was necessary at all, but that would tend to complicate the layering interface. Or it's conceivable that ext3's journaling is messing you up, particularly if you've got the journal on the RAID-6 LUN. If you don't need the protection of journaling, try using ext2; if you do need it, make sure that the journal isn't parity-protected (e.g., mirror it instead of just adding it to the RAID-6 LUN). I did spend a few minutes in Google trying to find detailed information about md's RAID-6 implementation but got nowhere. Perhaps its developers think that no one who isn't willing to read the code has any business trying to understand its internals - though that attitude would be difficult to justify in the current case given that they didn't seem to do a very good job of providing the performance that one might reasonably expect from a default set-up. - bill |
#9
|
|||
|
|||
Unimpressive performance of large MD raid
On Apr 23, 3:51*am, kkkk wrote:
Anyway since linux MD raid never occupies more than 35% CPU (of a single core!) in any test, I don't think it is the bottleneck. But this is part of my question. It is the bottleneck, it's just not a CPU bottleneck, it's an I/O bottleneck. The problem is simply the number of I/Os the system has to issue. With a 12 disk RAID 6 array implemented in software, a write of a single byte (admittedly the worst case) will require 10 reads followed by 12 writes that cannot be started until all 10 reads complete. Each of these operations has to be started and completed by the MD driver. I understand the reasoning behind your configuration choices, they just utterly sacrifice performance. DS |
#10
|
|||
|
|||
Unimpressive performance of large MD raid
kkkk wrote:
David Brown wrote: If you have so many disks connected, did you consider having at least one as a hot spare? Of course! We have 4 spares shared among all the arrays. You didn't mention it, so I thought I'd check, since I don't know your background or experience. I've heard of people using raid 6 because then they don't need hot spares - the array will effectively run as raid 5 until they replace the dud drive... Secondly, did you consider raid 10 as an alternative? I wouldn't expect performances of raid 10 via MD to be higher than the raid-6 of my original post (and might even be much slower at the same number of drives) because, as I mentioned, the "md1_raid5" (raid-6 actually) process never goes higher than 35% CPU occupation. Regarding the read+checksum+write problem of raid5/6 for small writes, there shouldn't be any in this case because I am doing a sequential write. Linux raid 10 with "far" layout normally gives sequential read performance around equal to a pure striped array. It will be a little faster than raid 6 for the same number of drives, but not a huge difference (with "near" layout raid 10, you'll get much slower sequential reads). Sequential write for raid 10 will be a little slower than for raid 6 (since you are not cpu bound at all). But random writes, especially of small sizes, will be much better, as will the performance of multiple simultaneous reads (sequential or random). Of course this depends highly on your workload, and is based on how the data is laid out on the disk. Where you will really see the difference is if when a disk fails, and you are running in degraded mode and rebuilding. Replacing a disk and rebuilding will take a tenth of the disk activity with raid 10 than with raid 6 - it only needs to read through a single disk to do the copy. With raid 6, the rebuild involves reading *all* the data off *all* the other disks. And according to some articles I've read, the chances of getting a sector unrecoverable read error during this rebuild with many large disks is very high, leading to a second disk failure. This is, of course, totally independent of whether you are using software raid or (as others suggest) hardware raid. It looks quite likely that your performance issues are some sort of IO bottleneck, but I don't have the experience to help here. Any writes that don't cover a complete stripe will need reads from several of the disks, Not the case here because I am doing sequential write. Also, the overhead you mention is present if the stripe is not in cache, but with large amounts of RAM I expect the stripe should be in cache (especially the stripe related to the file/directory metadata should be... while the rest doesn't matter as it is sequential). Yesterday during the tests the free amount of RAM was 33GB on that machine over a total of 48GB... You're right here - caching the data will make a very big difference. And this could be an area where software raid on Linux will do much better than hardware raid on the card - the software raid can use all of that 48 GB for such caching, not just the memory on the raid card. Thanks for your comments - as I said, I'm learning about this myself (currently mostly theory - when I get the time, I can put it into practice). |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
two HDs in RAID better than one large drive? | sillyputty[_2_] | Homebuilt PC's | 16 | November 21st 08 02:24 PM |
Slow RAID 1 performance on SATA - can I convert to RAID 0? | Coolasblu | Storage (alternative) | 0 | July 30th 06 08:02 AM |
NCCH-DR large raid drives | adaptabl | Asus Motherboards | 9 | April 19th 06 11:02 AM |
Which SATA drives for large RAID 5 array? | Eli | Storage (alternative) | 16 | March 26th 05 06:47 PM |
Large files on Barracuda IV in RAID | Nick | Storage (alternative) | 9 | August 27th 03 06:16 PM |