If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
Unimpressive performance of large MD raid
Bill Todd wrote:
kkkk wrote: ... Some interesting questions. Perhaps you're getting slammed by people for your configuration choices (which do not seem unreasonable given the also-not-unreasonable goals that you say drove them) because they're embarrassed to admit that they have no idea what the answers to those questions are (which is too bad, because people like me would find correct answers to them interesting). Calypso seems especially ignorant when talking about optimal RAID group sizes. Perhaps he's confusing RAID-5/6 with RAID-3 - but even then he'd be wrong, since what you really want with RAID-3 is for the total *data* content (excluding parity) of a stripe to be a convenient value, meaning that you tend to favor group sizes like 5 or 9 (not counting any spares that may be present). And given that you've got both processing power and probably system/memory bus bandwidth to burn, there's no reason why a software RAID-6 implementation shouldn't perform fairly competitively with a hardware one. That said, it's possible that the Linux system file cache interacts poorly with md in terms of how it destages data to the array - e.g., if it hands data to md in chunks that don't correspond to a full stripe set of data to write out (I'm assuming without looking at the code that md implements RAID-6 such that it can write out a full group of stripes without having to read in anything) *and* doesn't tell md that the write is lazy (allowing md to accumulate data in its own buffers until a convenient amount has arrived - assuming that they're large enough) then even sequential writes could get pretty expensive (as you seem to be seeing). A smart implementation might couple the file cache with md such that no such copy operation was necessary at all, but that would tend to complicate the layering interface. Or it's conceivable that ext3's journaling is messing you up, particularly if you've got the journal on the RAID-6 LUN. If you don't need the protection of journaling, try using ext2; if you do need it, make sure that the journal isn't parity-protected (e.g., mirror it instead of just adding it to the RAID-6 LUN). An alternative to consider, especially if you are working mainly with large files, is xfs rather than ext3. xfs works better with large files (mainly due to it's support of extents), and has good support for working with raid (it matches its data and structures with the raid stripes). I did spend a few minutes in Google trying to find detailed information about md's RAID-6 implementation but got nowhere. Perhaps its developers think that no one who isn't willing to read the code has any business trying to understand its internals - though that attitude would be difficult to justify in the current case given that they didn't seem to do a very good job of providing the performance that one might reasonably expect from a default set-up. There is a lot more information about linux raid5 than raid6. I think that reflects usage. Raid 6 is typically used when you have a larger number of drives - say, 8 or more. People using such large arrays are much more likely to be looking for higher-end solutions with strong support contracts, and are thus more likely to be using something with high-end hardware raid cards. Raid 5 needs only 3 disks, and is a very common solution for small servers. If you search around for configuration how-tos, benchmarks, etc., you'll find relatively few that have more than 4 disks, and therefore few that use raid 6. There's also a trend (so I've read) towards raid 10 (whether it be linux raid10, or standard raid 1 + 0) rather than raid 5/6 because of better recovery. - bill |
#12
|
|||
|
|||
Unimpressive performance of large MD raid
David Schwartz wrote:
It is the bottleneck, it's just not a CPU bottleneck, it's an I/O bottleneck. With an 8x PCI-e bus there should be space for 2 GB/sec transfer... The problem is simply the number of I/Os the system has to issue. With a 12 disk RAID 6 array implemented in software, a write of a single byte (admittedly the worst case) will require 10 reads followed by 12 writes that cannot be started until all 10 reads complete. Each of these operations has to be started and completed by the MD driver. This is true only for non-sequential write. In my case the system starts writing 5 seconds after dd is pushing data out (dirty_writeback_centisecs = 500). At that time there is so much sequential data to write that it will fill many stripes completely. |
#13
|
|||
|
|||
Unimpressive performance of large MD raid
David Brown wrote:
I did spend a few minutes in Google trying to find detailed information about md's RAID-6 implementation but got nowhere. ... There is a lot more information about linux raid5 than raid6. You mean on *Linux MD* raid5? That could be good. Where? Raid-6 algorithms are practically equivalent to raid-5, except parity computation obviously . |
#14
|
|||
|
|||
Unimpressive performance of large MD raid
Bill Todd wrote:
That said, it's possible that the Linux system file cache interacts poorly with md in terms of how it destages data to the array - e.g., if it hands data to md in chunks that don't correspond to a full stripe set of data to write out (I'm assuming without looking at the code that md implements RAID-6 such that it can write out a full group of stripes without having to read in anything) *and* doesn't tell md that the write is lazy (allowing md to accumulate data in its own buffers until a convenient amount has arrived - assuming that they're large enough) then even sequential writes could get pretty expensive (as you seem to be seeing). A smart implementation might couple the file cache with md such that no such copy operation was necessary at all, but that would tend to complicate the layering interface. In my case dd pushes 5 seconds of data before the disks start writing (dirty_writeback_centiseces = 500). dd stays always at least 5 seconds ahead of the writes. This should fill all stripes completely causing no reads. I even tried to raise the dirty_writeback_centisecs with no measurable performance benefit. Where is this 5secs of data stored? Is it at the ext3 layer or at the LVM layer (I doubt this one, also I notice there is no LVM kernel thread runing) or at the MD layer? Why do you think dd stays at 100% CPU? (with disks/3ware caches enabled) Shouldn't that be 0%? Do you think the CPU is high due to a memory-copy operation? If it was that, I suppose dd from /dev/zero to /dev/null should go at 200MB/sec instead it goes at 1.1GB/sec (with 100%CPU occupation indeed, 65% of which is in kernel mode). That would mean that the number of copies performed by dd while copying to the ext3-raid is 5 times greater than that for copying from /dev/zero to /dev/null . Hmmm... a bit difficult to believe. there must be other stuff performed in the ext3 case so to hog the CPU. Is the ext3 code running whithin the dd process when dd writes? Or it's conceivable that ext3's journaling is messing you up, particularly if you've got the journal on the RAID-6 LUN. If you don't need the protection of journaling, try using ext2; if you do need it, make sure that the journal isn't parity-protected (e.g., mirror it instead of just adding it to the RAID-6 LUN). I think this overhead should affect the first-writes but not the rewrites performance for ext3 defaults mount (defaults should be data=ordered which I think means no journal written for rewrites, correct?). Am I correct? Hmm probably not because kjournald had significant CPU occupation. What is the role of the journal during file overwrites? I did spend a few minutes in Google trying to find detailed information about md's RAID-6 implementation but got nowhere. Perhaps its developers think that no one who isn't willing to read the code has any business trying to understand its internals - though that attitude would be difficult to justify in the current case given that they didn't seem to do a very good job of providing the performance that one might reasonably expect from a default set-up. Agreed. Thanks for your answer |
#15
|
|||
|
|||
Unimpressive performance of large MD raid
|
#16
|
|||
|
|||
Unimpressive performance of large MD raid
|
#17
|
|||
|
|||
Unimpressive performance of large MD raid
On Apr 24, 2:10*am, kkkk wrote:
David Schwartz wrote: It is the bottleneck, it's just not a CPU bottleneck, it's an I/O bottleneck. With an 8x PCI-e bus there should be space for 2 GB/sec transfer... Yeah, I agree with you. It looks like an MD issue. On the bright side, I heard from a reliable source that: "Furthermore we trust visible, open, old/tested, linux MD code more than any embedded RAID code which nobody knows except 3ware. What if there was a bug in 9650SE code? It was a recent controller when we bought it, and we would have found out only later, maybe years later after setting up our array. Also, we were already proficient with linux MD." The flipside is, you have an untested configuration and nobody specific who is obligated to provide you with support. You're probably ahead of the curve, so you may hit every problem before anyone else does. DS |
#18
|
|||
|
|||
Unimpressive performance of large MD raid
NTFS unsafe in case of power loss?
User data is not protected by the journaling. You missed something, we're not talking about FAT here (which is faster than NTFS)... Depends on scenario. With 2000 files per directory, things do change - FAT uses linear directories, and NTFS uses B-trees similar to database indices. -- Maxim S. Shatskih Windows DDK MVP http://www.storagecraft.com |
#19
|
|||
|
|||
Unimpressive performance of large MD raid
kkkk wrote:
David Brown wrote: I did spend a few minutes in Google trying to find detailed information about md's RAID-6 implementation but got nowhere. ... There is a lot more information about linux raid5 than raid6. You mean on *Linux MD* raid5? That could be good. Where? Google for "linux raid 5" - there are a few million hits, most of which are for software raid (i.e., MD raid). Googling for "linux raid 6" only gets you a few hundred thousand hits. Raid-6 algorithms are practically equivalent to raid-5, except parity computation obviously . Here is a link that might be useful, if you want to know the details of Linux raid 6: http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf |
#20
|
|||
|
|||
Unimpressive performance of large MD raid
This guy
http://lists.freebsd.org/pipermail/f...er/005170.html is doing basically the same as I am doing with software raid done with ZFS in freebsd (raid-Z2 is basically raid-6) writing and reading 10GB files. His results are a heck of a lot better than mine with defaults settings and not very distant from the bare hard disks throughput (he seems to get about 50MB/sec per non-parity disk). This tells that software raid is indeed capable of doing good stuff in theory. Just linux MD + ext3 seems to have some performance problems :-( |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
two HDs in RAID better than one large drive? | sillyputty[_2_] | Homebuilt PC's | 16 | November 21st 08 03:24 PM |
Slow RAID 1 performance on SATA - can I convert to RAID 0? | Coolasblu | Storage (alternative) | 0 | July 30th 06 08:02 AM |
NCCH-DR large raid drives | adaptabl | Asus Motherboards | 9 | April 19th 06 11:02 AM |
Which SATA drives for large RAID 5 array? | Eli | Storage (alternative) | 16 | March 26th 05 07:47 PM |
Large files on Barracuda IV in RAID | Nick | Storage (alternative) | 9 | August 27th 03 06:16 PM |