Newbie storage questions... (RAID5, SANs, SCSI)

#11 November 28th 03, 03:35 PM

Jesper Monsted wrote:
"Nik Simpson" wrote in
:
Ok... "stripe size"... So in a RAID array of 5 disks with a stripe
size of 8k, if I submit a request to the RAID controller to write
5,000 bytes, these bytes will not be scattered equally across all
drives? Since the size of the data being written is less than the
stripe size, all of the data could conceivably written to one disk?

You nailed it. Stripe size is the minimum size of a write to a
physical disk in the array. Trying allocate evenly at the byte level
to each disk would be insane in terms of the effect on performance.

Unless you're using RAID3 where the stripe size is basically one bit

IIRC, the original RAID definition for RAID3 is striping at the byte level,
not the bit level, perhaps you are thinking of RAID2.

--
Nik Simpson

#12 November 28th 03, 04:34 PM

As already noted, most RAID implementations do not work this way:
instead,
data is spread across the disks in the array in coarser chunks - usually
no
smaller than 4 KB per disk, often 64 KB per disk, and there are good
reasons
in most workloads to make them even larger. Some early implementations of
RAID-3 distributed the data at finer grain (much as you describe above),
but
I've never heard of RAID-0, -1, -4, or -5 doing so.

Bill / Nik / Robert,

Thanks, guys. This is really great information.

I need to keep an eye on the total number of IOs/second that my SQL
Server is generating. I've learned from this thread that if I have a
reasonably small number of disks in my RAID, I can estimate my maximum
number of IOPs by multiplying the IOPs rating of an individual disk by the
number of disks in the array. I've also learned from our discussion of
"stripe size" that if SQL Server decides to read some data whose size is
LESS THAN the stripe size, it very well may read this data from a single
disk as opposed to reading all of the disks in parallel. Fair enough... but
let's say SQL Server needs to read 65k of data [perhaps it's doing a table
scan], and let's say the stripe size is 64k, and let's say I have a RAID0
array (just to keep the example simple) with 10 disks. What happens in this
scenario? Here's what I'm thinking:

- SQL Server sends a single IO request to the HBA. Windows registers
this in its Performance Monitor as one IO request.
- The HBA realizes that the first 64k of data that it needs to read is
on Disk#0 and the final 1k is on Disk#1. It generates two IO requests, one
for each disk and submits them in parallel.

So the bottom line is: Each IO request of 64k or less will generate 1
IO. Each read-request that is asking for *more* than 64k will very likely
generate MORE than one IO request since more than one disk needs to be
touched. So when trying to estimate the number of IOs/sec that my
application will require, I need to consider the number of reads/writes that
will exceed the stripe size since these operations, which Windows perceives
as single IOs, are in reality generating multiple IOs. Is this correct?

David

#13 November 28th 03, 04:38 PM

David Sworder wrote:
As already noted, most RAID implementations do not work this way:
instead, data is spread across the disks in the array in coarser
chunks - usually no smaller than 4 KB per disk, often 64 KB per
disk, and there are good reasons in most workloads to make them even
larger. Some early implementations of RAID-3 distributed the data
at finer grain (much as you describe above), but I've never heard of
RAID-0, -1, -4, or -5 doing so.

Bill / Nik / Robert,

Thanks, guys. This is really great information.

I need to keep an eye on the total number of IOs/second that my
SQL Server is generating. I've learned from this thread that if I
have a reasonably small number of disks in my RAID, I can estimate my
maximum number of IOPs by multiplying the IOPs rating of an
individual disk by the number of disks in the array. I've also
learned from our discussion of "stripe size" that if SQL Server
decides to read some data whose size is LESS THAN the stripe size, it
very well may read this data from a single disk as opposed to reading
all of the disks in parallel. Fair enough... but let's say SQL Server
needs to read 65k of data [perhaps it's doing a table scan], and
let's say the stripe size is 64k, and let's say I have a RAID0 array
(just to keep the example simple) with 10 disks. What happens in this
scenario? Here's what I'm thinking:

- SQL Server sends a single IO request to the HBA. Windows
registers this in its Performance Monitor as one IO request.
- The HBA realizes that the first 64k of data that it needs to
read is on Disk#0 and the final 1k is on Disk#1. It generates two IO
requests, one for each disk and submits them in parallel.

So the bottom line is: Each IO request of 64k or less will
generate 1 IO. Each read-request that is asking for *more* than 64k
will very likely generate MORE than one IO request since more than
one disk needs to be touched. So when trying to estimate the number
of IOs/sec that my application will require, I need to consider the
number of reads/writes that will exceed the stripe size since these
operations, which Windows perceives as single IOs, are in reality
generating multiple IOs. Is this correct?

David

Yup, you pretty much got it.

--
Nik Simpson

#14 November 28th 03, 07:08 PM

"David Sworder" wrote in message
...

....

So the bottom line is: Each IO request of 64k or less will generate 1
IO.

Only if the request happens to be perfectly aligned with the stripe layout.
Otherwise, the best you can guarantee is the each (read) request equal to or
smaller than the per-disk 'chunk' size will require no more than 2
(parallel) disk accesses.

That's one reason that larger chunk sizes are desirable: they minimize the
likelihood that a single request will span multiple disks (at least in
situations where request sizes vary: if you can control the environment
such that request sizes are always appropriately aligned and never exceed
your chunk size, then there's no reason to increase that chunk size).
Accessing data from multiple disks in parallel does improve response time
for large requests, but even if that's all you're interested in there's
little reason to use a size any less than 64 KB - 128 KB; if large-request
latency is not a very important aspect of your workload, then chunk sizes in
the multi-megabyte range may be appropriate, since they'll minimize
multiple-disk seeks over the widest range of request sizes and hence
maximize throughput if the workload has a good deal of parallelism in it.

Each read-request that is asking for *more* than 64k will very likely
generate MORE than one IO request since more than one disk needs to be
touched.

Unless part of the request hits in cache, it *will* generate more than a
single request if it exceeds the chunk size - no 'very likely' about it.

So when trying to estimate the number of IOs/sec that my
application will require, I need to consider the number of reads/writes
that
will exceed the stripe size since these operations, which Windows
perceives
as single IOs, are in reality generating multiple IOs. Is this correct?

And whatever multiple disk I/Os are generated by write activity. Large
writes (that span an entire array stripe) are relatively more efficient: a
full-stripe write doesn't have to perform any reads at all, it just plunks
down the data on the n-1 data disks in the stripe and calculates parity
directly from that data for the parity chunk. Reasonably smart arrays
perform intermediate optimizations, such that the number of disk accesses is
minimized (e.g., if you're writing to all but one data disk in the stripe,
it's cheaper to read the remaining unwritten chunk than it would be to read
all the chunks on the disks that you're modifying).

- bill

#15 November 29th 03, 07:04 AM

"Nik Simpson" wrote in message ...
IIRC, the original RAID definition for RAID3 is striping at the byte level,
not the bit level, perhaps you are thinking of RAID2.

Both RAID2 and RAID3 are (effectively) striped at the bit level - the
smallest addressable unit ("sector" or "block") of the array is split
across all the drives, thus reading (or writing) that unit requires
hitting all those drives (hopefully in parallel). What's different is
how the error correction works. In RAID2 an EC scheme is used on a
bit-by-bit basis, in RAID3 you've got a block parity scheme just like
in RAID4/5. I've never actually seen a RAID2 implementation, but it's
possible someone has one somewhere.

The point of RAID2/3 is to improve *sequential* I/O performance.
Random I/O performance is that of a single drive, but sequential
performance is improved proportionally to the number of data disks in
the array. Typically RAID3 arrays have the spindles synchronized for
best performance. Mostly used by the HPC folks.

#16 November 29th 03, 07:31 AM

"Nik Simpson" wrote in message . ..
David Sworder wrote:
Here comes the term "stripe size".
This is the number of consequtive bytes allocated on the same disc.
Depending on your performance requirement you will chose a small or
large stripe size. (8k-64k or even much larger)

Ha! Just when I think I'm beginning to get a handle on things, a new
term/concept comes along that reveals just how ignorant I really was
(am).

Ok... "stripe size"... So in a RAID array of 5 disks with a stripe
size of 8k, if I submit a request to the RAID controller to write
5,000 bytes, these bytes will not be scattered equally across all
drives? Since the size of the data being written is less than the
stripe size, all of the data could conceivably written to one disk?

You nailed it. Stripe size is the minimum size of a write to a physical disk
in the array. Trying allocate evenly at the byte level to each disk would be
insane in terms of the effect on performance.

That's not correct. The minimum size of a write on all RAID5 arrays
remains a single sector. However any write smaller than a complete
stripe requires a read-modify-write cycle for the appropriate data and
parity blocks.

Often you get to define the stripe size indirectly by specifying a
per-disk block size (often that's the 64KB number that's tossed
around). The strip size for RAID4/5 is then (n-1) times the block
size (for the five drive array being discussed, that would result in a
256KB stripe, a six drive array would have a 320KB stripe). Vendor
usage of the terms block and stripe are often more than a bit
confusing.

Ideally, you'd like all your (random) reads to fit within a single
block (which would allow the read to be satisfied by hitting only a
single disk), and all your writes to cover an entire stripe (which
would allow the stripe to be written without the read-modify-write
cycle). Obviously those two goals conflict. For most database
workloads, reads dominate, and the writes that do occur tend to be
tiny in relation to practical stripe sizes (so you almost never get a
full stripe write). So most database applications just set up a nice
large stripe.

#17 December 1st 03, 11:58 PM

On 28 Nov 2003 23:04:02 -0800, (Robert Wessel)
wrote:

"Nik Simpson" wrote in message ...
IIRC, the original RAID definition for RAID3 is striping at the byte level,
not the bit level, perhaps you are thinking of RAID2.

Both RAID2 and RAID3 are (effectively) striped at the bit level - the
smallest addressable unit ("sector" or "block") of the array is split
across all the drives, thus reading (or writing) that unit requires
hitting all those drives (hopefully in parallel).

This is, perhaps, a little misleading: the definitions of levels 2 and
3 do *not* detail precisely how the data is split, just that it is
split *below* the level of the host addressable block (whatever that
it). An easy RAID3 implementation might present itself to the host as
a 2K block device (which would confuse most hosts, but from the device
side, we can *pretend* that that is some else's problem... at least,
until we want to actually make money from the device).

In that instance, the data may be striped in any convenient fashion:
bit, byte, word, longword, doubleword, 64 byte, 512 byte -- the latter
being, perhaps, the easiest.

All that matters is, as Robert notes, that a single block host IO
involves all the disks.

What's different is
how the error correction works. In RAID2 an EC scheme is used on a
bit-by-bit basis, in RAID3 you've got a block parity scheme just like
in RAID4/5. I've never actually seen a RAID2 implementation, but it's
possible someone has one somewhere.

Someone did have a RAID2 system, using 37 disks: 32 data plus 5 ECC,
and a 16KB "native" block size. It was one of the HPC manufactures
(Thinking Machines, perhaps).

Malc.

#18 December 2nd 03, 01:10 AM

Malcolm Weir wrote:
On 28 Nov 2003 23:04:02 -0800, (Robert Wessel)
wrote:
Someone did have a RAID2 system, using 37 disks: 32 data plus 5 ECC,
and a 16KB "native" block size. It was one of the HPC manufactures
(Thinking Machines, perhaps).

That's what my admittedly foggy memory says as well ;-)

--
Nik Simpson

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
15K rpm SCSI-disk	Ronny Mandal	General	26	December 8th 04 08:04 PM
Newbie Question re hardware vs software RAID	Gilgamesh	General	44	November 22nd 04 10:52 PM
asus p2b-ds and scsi (from a scsi newbie)	[email protected]	Asus Motherboards	8	May 30th 04 09:43 AM
120 gb is the Largest hard drive I can put in my 4550?	David H. Lipman	Dell Computers	65	December 11th 03 01:51 PM
newb questions about SCSI hard drives	fred.do	Homebuilt PC's	7	June 26th 03 01:59 AM