If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Newbie storage questions... (RAID5, SANs, SCSI)
Hi,
I'm reading a book that describes how to plan an SQL Server installation. The book warns that one should never use RAID5 unless the volume receives less than 10% writes (i.e. 90% reads). Apparently the performance penalty for data writes is quite high with RAID5 but I'm having trouble understanding exactly what the penalty is. Consider the following example: - Let's say it takes x seconds to write a chunk of data to a single hard drive in a single IO operation. - Now let's say that I have 3 of these drives in a RAID5 array and I want to write the same chunk of data. Instead of using a single IO operation, four operations are now involved because to write a bit of data to a drive, the RAID controller must: 1) read the preexisting bit of data on the drive 2) read the preexisting bit of data on the parity drive ...calculate the new parity bit and then... 3) write the new bit of data to the drive 4) write the new parity bit to the parity drive. ...although there are now 4 operations, the operations are spread over 3 drives. So the time to perform this operation is [4x/3] seconds, that is to say, 1.33x seconds or 33% longer than it would take to write to a single drive. Using this logic, if I had 4 drives the write speed would be identical to writing to a single drive. Only when I have more than four drives is the write time of the RAID5 volume superior to writing to a single drive. Is this logic correct? This is how my book describes it, but in practice RAID5 doesn't seem to be as slow as this. Could someone please confirm? Now let me ask you a question about bandwidth for data transfer from hard drives. Consider the following device that Dell sells which appears to be a stand-alone rack-mountable RAID device: http://www1.us.dell.com/content/prod...555&l=en&s=biz This thing costs nearly $12,000, but yet its bandwidth of 200MB/sec is inferior to standard SCSI which is 320MB/sec. My question is: What is the advantage of this device over traditional SCSI RAID? One last question: The SQL Server seems to be a single point of failure. If the motherboard or power supply in the SQL Server machine goes down, my entire application will go down. Is it possible to set things up in such a way that TWO machines running SQL Server can be attached to the same harddisk? If one SQL machine dies, the other machine automatically comes online? In this scenario, where is the harddrive located? It can't be in one of the machines because if that machine were the one that crashed, the other SQL Server machine would be unable to access that drive! That's why I'm wondering if that Dell external RAID solution (see above) might be appropriate for me. What do you think? Thanks, David |
#2
|
|||
|
|||
David Sworder wrote:
Hi, I'm reading a book that describes how to plan an SQL Server installation. The book warns that one should never use RAID5 unless the volume receives less than 10% writes (i.e. 90% reads). Apparently the performance penalty for data writes is quite high with RAID5 but I'm having trouble understanding exactly what the penalty is. Consider the following example: A lot of the "don't use RAID5" for databases relates to software RAID implemntations on the host where instead of performing a single I/O operation the host has to perform the multiple operations and also to calculate the parity. If you offload the RAID functions an external subsystem (especially a relativiely modern one with plenty of write-back cache) the write performce of RAID5 will almost certanly be more than adequate for your needs. The caveat of "almost" is there because you don't mention any specific performance requirements. This is how my book describes it, but in practice RAID5 doesn't seem to be as slow as this. Could someone please confirm? RAID-5 will certainly be slower than say RAID 0+1, it's just the nature of the beast, but its performance is more than adequate for an awful lot of applications, otherwise it wouldn't be so popular. A common approach with databases is to put the more write intensive portions (like transaction logs) onto a less write-sensitive RAID device (say mirror a couple of drives and use those for the transaction log device) and use RAID5 for the less write intensive data tables. Now let me ask you a question about bandwidth for data transfer from hard drives. Consider the following device that Dell sells which appears to be a stand-alone rack-mountable RAID device: http://www1.us.dell.com/content/prod...555&l=en&s=biz This thing costs nearly $12,000, but yet its bandwidth of 200MB/sec is inferior to standard SCSI which is 320MB/sec. My question is: What is the advantage of this device over traditional SCSI RAID? First, that 200MB/s is misleading, FC is a duplex protocol, it can do 200MB/s in each direction simulataneously, so it's really more like 400MB/s. Second in a database application, you'll almost certainly never see 200MB/s of throughput, let alone 320MB/s so raw throughput is a poor measure, what you are really interested in is I/O operations/sec and this is the Fibre channel protocol is much more efficient in terms of how it uses the available bandwidth becuase it can setup and tear down an I/O transaction much more quickly than typical SCSI allowing it to handle more transactions/sec for a given amount of bandwidth. There are other benefits to using FC in terms of clustering and other advanced functions which brings me to... One last question: The SQL Server seems to be a single point of failure. If the motherboard or power supply in the SQL Server machine goes down, my entire application will go down. Is it possible to set things up in such a way that TWO machines running SQL Server can be attached to the same harddisk? If one SQL machine dies, the other machine automatically comes online? In this scenario, where is the harddrive located? It can't be in one of the machines because if that machine were the one that crashed, the other SQL Server machine would be unable to access that drive! That's why I'm wondering if that Dell external RAID solution (see above) might be appropriate for me. What do you think? Micrsoft has a multinode clustering function for its operating systems and applications to address just this problem, do a search on Microsoft's website for clustering, you'll find plenty of information. But in order to cluster, you'll need a shareable subsystem and today that pretty much mean a Fibre channel attached subsystem like the one you mentioned earlier. Theoretically, you can do the same with SCSI, but's its usually ugly and I don't if MS still supports any external SCSI subsystems for clustering applications. -- Nik Simpson |
#3
|
|||
|
|||
First, that 200MB/s is misleading, FC is a duplex protocol, it can do
200MB/s in each direction simulataneously, so it's really more like 400MB/s. Second in a database application, you'll almost certainly never see 200MB/s of throughput, let alone 320MB/s so raw throughput is a poor measure, what you are really interested in is I/O operations/sec and this is the Fibre channel protocol is much more efficient in terms of how it uses the available bandwidth becuase it can setup and tear down an I/O transaction much more quickly than typical SCSI allowing it to handle more transactions/sec for a given amount of bandwidth. Thanks Nik, That was a very helpful post. After reading some of your other posts on this newsgroup (via Google), it's clear to me that I/O operations/sec are more important to me than total bandwidth. What's odd is that the SCSI manufacturers such as Adaptec don't list IOs/sec on their spec sheets which makes it difficult to compare apples to apples. How would one go about finding the IOs/sec for a SCSI RAID card? That fibre channel unit that I mentioned in my original post handles 40,000 IOs/sec according to the Dell site. I'd like to see how high-end SCSI RAID cards compare. The next logical question in this little learning exercise of mine is: What exactly defines an IO operation? Maybe an example would clarify my question... Let's say I have a machine with one hard drive. SQL Server runs on that machine and attempts to read 4,000 bytes of data sequentially from the drive. This would count as one IO operation correct? Now let's change the situation so that instead of one harddrive, I have that cool fibre channel device that I mentioned in my previous post. It has 10 drives configured as RAID 1+0. In this situation, when SQL Server attempts to read 4,000 bytes, Windows still thinks of this as 1 IO operation -- but does the FC device consider this to be 1 operation? In order to read 4,000 bytes, the FC device is accessing all 10 disks simultaneously. So is this operation considered as *ten* IOs, or only one? I'm asking this question because my books states that I should closely monitor the IOs/second in 'perfmon' and make sure that it does not exceed 85% of the maximum throughput. So if the FC RAID device above supports a maximum of 40,000 IOs/second, I'd want to make sure that I don't exceed 34,000 IOs/second on a regular basis. Tracking IOs/second is easy using the Windows PerfMonitor, but I'm not sure if PerfMon is tracking the correct value since Windows has no way of knowing that each IO read/write will trigger multiple read/writes across the various drives in the array. Do you see what I mean? So what exactly defines an "IO" on a RAID device, be it a SCSI RAID or an FC RAID? Thanks, David |
#4
|
|||
|
|||
"David Sworder" wrote in message ...
That was a very helpful post. After reading some of your other posts on this newsgroup (via Google), it's clear to me that I/O operations/sec are more important to me than total bandwidth. What's odd is that the SCSI manufacturers such as Adaptec don't list IOs/sec on their spec sheets which makes it difficult to compare apples to apples. How would one go about finding the IOs/sec for a SCSI RAID card? That fibre channel unit that I mentioned in my original post handles 40,000 IOs/sec according to the Dell site. I'd like to see how high-end SCSI RAID cards compare. It's usually a non-issue. You're grossly limited by the drive subsystem, which will rarely allow more than a few hundred random I/Os per second. Sequential I/O can run significantly faster, but only rarely do you do long sequential writes on a database. On sequential reads (for instance during a table scan), most SCSI (RAID or otherwise) can pretty well keep up with the disk drives. Just to put a number on things, let's say you've got a RAID 5 array of five drives with 5ms typical access time (pretty optimistic). So each drive can do about 200 I/Os per second. So you could sustain something like 1000 random reads per second (with zero writes), or 250 writes (with zero reads). Or at something like 80% reads, a total of 625 (mixed) I/Os per second. With extensive caching at the RAID controller you can get higher numbers, but on modest sized systems like the one you're discussing, it's almost always a better idea to put extra cache into the server instead of the disk subsystem. FC HBAs are often attached to very large disk arrays (sometimes thousands of drives), where the number of I/Os per second the *HBA* can sustain can become a limiting factor. It's rare to see a SCSI RAID subsystem that can support more than a couple of dozen drives which puts a pretty low upper limit on the number of random I/Os per second. In any event, vendors quote raw I/Os per second for FC HBAs because that's all they can measure, as there's no disk involved. The vendor of the disk subsystem that you attach to you FC HBA will have a set of performance numbers, those can be compared to the SCSI RAID controller performance figures. On a small system, the FC HBA will simply not be the bottleneck for random I/Os, the drive array *will* be the bottleneck. It doesn't matter that your HBA can do 40,000 I/Os per second if it's only talking to a drive subsystem with a dozen disks that can hit 1500 IO/s only with a tailwind. OTOH, hang 30 of those subsystems on your SAN, and you're going to be limited by the HBA (assuming you've only got the one host). The next logical question in this little learning exercise of mine is: What exactly defines an IO operation? Maybe an example would clarify my question... Let's say I have a machine with one hard drive. SQL Server runs on that machine and attempts to read 4,000 bytes of data sequentially SQL Server (and most other DBMSs) will typically format his datasets in blocks that are some power-of-2 multiple of 4KB. He'll then read or write those blocks as units (or in groups of blocks if he can). from the drive. This would count as one IO operation correct? Now let's change the situation so that instead of one harddrive, I have that cool fibre channel device that I mentioned in my previous post. It has 10 drives configured as RAID 1+0. In this situation, when SQL Server attempts to read 4,000 bytes, Windows still thinks of this as 1 IO operation -- but does the FC device consider this to be 1 operation? In order to read 4,000 bytes, the FC device is accessing all 10 disks simultaneously. So is this operation considered as *ten* IOs, or only one? I'm asking this question because my books states that I should closely monitor the IOs/second in 'perfmon' and make sure that it does not exceed 85% of the maximum throughput. So if the FC RAID device above supports a maximum of 40,000 IOs/second, I'd want to make sure that I don't exceed 34,000 IOs/second on a regular basis. Tracking IOs/second is easy using the Windows PerfMonitor, but I'm not sure if PerfMon is tracking the correct value since Windows has no way of knowing that each IO read/write will trigger multiple read/writes across the various drives in the array. Do you see what I mean? So what exactly defines an "IO" on a RAID device, be it a SCSI RAID or an FC RAID? Loosely, an I/O is a single read or write operation. The size is context dependent, but in the case of a DBMS it's going to typically be the page size for the table or table space. If you've got hardware RAID, the host will see a single I/O for either a read or a write. The RAID controller will issue multiple I/Os to the attached devices as necessary. For example, let's say you have a FC HBA in the host connected to a RAID disk subsystem that's got a bunch of disk drives on an internal SCSI channel. Your database writes a disk page. There will be a single I/O across the fiber from the host to the RAID controller, and then (assuming no fortuitous caching) *four* I/Os across the internal SCSI channel from the RAID controller to the disk drives. |
#5
|
|||
|
|||
This is really great information. I apologize for the basic questions,
but I've only been examining this stuff for the better part of one day. Let me ask you a few follow ups... Just to put a number on things, let's say you've got a RAID 5 array of five drives with 5ms typical access time (pretty optimistic). So each drive can do about 200 I/Os per second. So you could sustain something like 1000 random reads per second (with zero writes).... I don't quite understand this concept. You've got five drives, each of which can handle 200 I/Os per second. You're multiplying 5*200 to get 1000 IOPs for the array. I understand your calculation but I'm not sure why it works as you state. In a trivial example, let's say the RAID controller is instrutcted to read 5 bytes of data. This is considered one IO by the RAID controller, but doesn't the RAID controller then have to issue *5* read commands, one to each disk? My understanding of RAID (as it applies to reading data) is that the 5 disks would always be accessed simultaneously in order to speed up the read process. So for each IO read-request that the RAID controller receives, it has to issue 5 IO requests, one to each drive. So it seems that the RAID controller would *still* be limited to 200 IOPs, regardless of how many drives on are on the array. Why is it that you say the reality of the situation is that the RAID controller can actually handle 1000 IOPs? I don't understand. With extensive caching at the RAID controller you can get higher numbers, but on modest sized systems like the one you're discussing, it's almost always a better idea to put extra cache into the server instead of the disk subsystem. When you say that the extra cache should be put in the server but not on the RAID controller or disk subsystem, what do you mean exactly? Where in the server would I want to increase the cache? FC HBAs are often attached to very large disk arrays (sometimes thousands of drives), where the number of I/Os per second the *HBA* can sustain can become a limiting factor. It's rare to see a SCSI RAID subsystem that can support more than a couple of dozen drives which puts a pretty low upper limit on the number of random I/Os per second. In any event, vendors quote raw I/Os per second for FC HBAs because that's all they can measure, as there's no disk involved. The vendor of the disk subsystem that you attach to you FC HBA will have a set of performance numbers, those can be compared to the SCSI RAID controller performance figures. Ah, ok... This clarifies things a bit. I think I had a fundamental understanding of what an HBA is. So an HBA is a "host bus adapter." It lives in the server [in a PCI slot I assume]. This HBA has no idea how many drives are in the array. I just passes I/O requests over a 2GB/s fibre cable. It can pass up to 40,000 of these requests/second [using the example from the Dell site in my previous post]. At the other end of the cable is that rack-mountable box containing all of the drives. Are you saying that the real brains of the RAID lies within that box instead of the HBA card? So I really need to be asking myself "how many IOPs can those drives handle" because it's the IOPs limitation of the DRIVES, not the HBA card, that is my bottleneck. Is this correct? Loosely, an I/O is a single read or write operation.[...] I think I understand your explanation, but again, see my first question above. In the simpler case of doing a single *read* operation, is the single I/O request actually morphed into X number of requests where X is the number of drives in the array since each drive will have to be touched in order to perform the read? David |
#6
|
|||
|
|||
On Thu, 27 Nov 2003 16:20:23 GMT, David Sworder wrote:
This is really great information. I apologize for the basic questions, but I've only been examining this stuff for the better part of one day. Let me ask you a few follow ups... Just to put a number on things, let's say you've got a RAID 5 array of five drives with 5ms typical access time (pretty optimistic). So each drive can do about 200 I/Os per second. So you could sustain something like 1000 random reads per second (with zero writes).... I don't quite understand this concept. You've got five drives, each of which can handle 200 I/Os per second. You're multiplying 5*200 to get 1000 IOPs for the array. I understand your calculation but I'm not sure why it works as you state. In a trivial example, let's say the RAID controller is instrutcted to read 5 bytes of data. This is considered one IO by the RAID controller, but doesn't the RAID controller then have to issue *5* read commands, one to each disk? My understanding of RAID (as it applies to reading data) is that the 5 disks would always be accessed simultaneously in order to speed up the read process. So for each IO read-request that the RAID controller receives, it has to issue 5 IO requests, one to each drive. So it seems that the RAID controller would *still* be limited to 200 IOPs, regardless of how many drives on are on the array. Why is it that you say the reality of the situation is that the RAID controller can actually handle 1000 IOPs? I don't understand. Here comes the term "stripe size". This is the number of consequtive bytes allocated on the same disc. Depending on your performance requirement you will chose a small or large stripe size. (8k-64k or even much larger) /hjj |
#7
|
|||
|
|||
Here comes the term "stripe size".
This is the number of consequtive bytes allocated on the same disc. Depending on your performance requirement you will chose a small or large stripe size. (8k-64k or even much larger) Ha! Just when I think I'm beginning to get a handle on things, a new term/concept comes along that reveals just how ignorant I really was (am). Ok... "stripe size"... So in a RAID array of 5 disks with a stripe size of 8k, if I submit a request to the RAID controller to write 5,000 bytes, these bytes will not be scattered equally across all drives? Since the size of the data being written is less than the stripe size, all of the data could conceivably written to one disk? |
#8
|
|||
|
|||
"David Sworder" wrote in message ... This is really great information. I apologize for the basic questions, but I've only been examining this stuff for the better part of one day. Let me ask you a few follow ups... Just to put a number on things, let's say you've got a RAID 5 array of five drives with 5ms typical access time (pretty optimistic). So each drive can do about 200 I/Os per second. So you could sustain something like 1000 random reads per second (with zero writes).... Just to be a bit more complete: As Robert noted, 5 ms. for an average single random access is a bit optimistic: the fastest current 15Krpm drives take about 5.5 ms., 10Krpm drives take more like 7 - 8 ms., and 7200 rpm ATA drives take 12 - 13 ms. However, that's for requests submitted serially, such that one request is satisfied before the next is submitted. If the workload performs many tasks in parallel such that multiple requests can be submitted without waiting for any to complete (as FC and SCSI disks allow but most ATA disk to not - yet), the average latency goes up (because all but the first one satisfied is waiting in a queue) but the throughput does as well (because the disk can pick an optimal order in which to satisfy them that minimizes the latency betweeen them): if your request stream has sufficient parallelism, the throughput of an individual disk can easily double - though each request will sustain on average much more latency than it would in a serial stream, so if individual response times are critical spreading the requests across a larger array will improve it even though the per-disk throughput will decrease. I don't quite understand this concept. You've got five drives, each of which can handle 200 I/Os per second. You're multiplying 5*200 to get 1000 IOPs for the array. I understand your calculation but I'm not sure why it works as you state. In a trivial example, let's say the RAID controller is instrutcted to read 5 bytes of data. This is considered one IO by the RAID controller, but doesn't the RAID controller then have to issue *5* read commands, one to each disk? My understanding of RAID (as it applies to reading data) is that the 5 disks would always be accessed simultaneously in order to speed up the read process. So for each IO read-request that the RAID controller receives, it has to issue 5 IO requests, one to each drive. So it seems that the RAID controller would *still* be limited to 200 IOPs, regardless of how many drives on are on the array. Why is it that you say the reality of the situation is that the RAID controller can actually handle 1000 IOPs? I don't understand. As already noted, most RAID implementations do not work this way: instead, data is spread across the disks in the array in coarser chunks - usually no smaller than 4 KB per disk, often 64 KB per disk, and there are good reasons in most workloads to make them even larger. Some early implementations of RAID-3 distributed the data at finer grain (much as you describe above), but I've never heard of RAID-0, -1, -4, or -5 doing so. With extensive caching at the RAID controller you can get higher numbers, but on modest sized systems like the one you're discussing, it's almost always a better idea to put extra cache into the server instead of the disk subsystem. When you say that the extra cache should be put in the server but not on the RAID controller or disk subsystem, what do you mean exactly? Where in the server would I want to increase the cache? Just adding server RAM will normally suffice: the operating system should put it to good use caching data for most workloads, though a few (those that perform lots of small writes and require that each complete before the next is submitted) might better benefit from cache in the array controller. Having *some* cache in the controller that allows it to defer disk writes until a convenient opportunity (and hence significantly decrease their overhead) is desirable, though. It must be non-volatile (such that its contents aren't lost if power fails: some people trust a simple external UPS to suffice here, but having a back-up battery right on the array cache card tends to be safer), and to provide safety equivalent to the RAID array behind it it really needs to be duplicated (otherwise, it becomes a single point of failure). - bill |
#9
|
|||
|
|||
David Sworder wrote:
Here comes the term "stripe size". This is the number of consequtive bytes allocated on the same disc. Depending on your performance requirement you will chose a small or large stripe size. (8k-64k or even much larger) Ha! Just when I think I'm beginning to get a handle on things, a new term/concept comes along that reveals just how ignorant I really was (am). Ok... "stripe size"... So in a RAID array of 5 disks with a stripe size of 8k, if I submit a request to the RAID controller to write 5,000 bytes, these bytes will not be scattered equally across all drives? Since the size of the data being written is less than the stripe size, all of the data could conceivably written to one disk? You nailed it. Stripe size is the minimum size of a write to a physical disk in the array. Trying allocate evenly at the byte level to each disk would be insane in terms of the effect on performance. -- Nik Simpson |
#10
|
|||
|
|||
"Nik Simpson" wrote in
: Ok... "stripe size"... So in a RAID array of 5 disks with a stripe size of 8k, if I submit a request to the RAID controller to write 5,000 bytes, these bytes will not be scattered equally across all drives? Since the size of the data being written is less than the stripe size, all of the data could conceivably written to one disk? You nailed it. Stripe size is the minimum size of a write to a physical disk in the array. Trying allocate evenly at the byte level to each disk would be insane in terms of the effect on performance. Unless you're using RAID3 where the stripe size is basically one bit -- /Jesper Monsted |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
15K rpm SCSI-disk | Ronny Mandal | General | 26 | December 8th 04 09:04 PM |
Newbie Question re hardware vs software RAID | Gilgamesh | General | 44 | November 22nd 04 11:52 PM |
asus p2b-ds and scsi (from a scsi newbie) | [email protected] | Asus Motherboards | 8 | May 30th 04 09:43 AM |
120 gb is the Largest hard drive I can put in my 4550? | David H. Lipman | Dell Computers | 65 | December 11th 03 02:51 PM |
newb questions about SCSI hard drives | fred.do | Homebuilt PC's | 7 | June 26th 03 01:59 AM |