1GB cache related plus ...

#11 December 27th 03, 07:11 AM

Bill Todd wrote:
+
+ wrote in message ...
+
+ ...
+
+ A bit of googling led to:
+
+
+ http://www.specbench.org/sfs97r1/res...30623-00158.ht
+ ml
+
+ which reports 1.5 ms average response time at 25K NFS ops per second
+ for a NetApp GF960 backed by a 9980V with 32GB of cache. Since that's
+ way faster than disk access time, this outstanding performance is
+ visibly due to the large cache.
+
+ My vague recollection is that a fair amount of the LADDIS benchmark (of
+ which the SPEC benchmark that you refer to may be a descendant) consists of
+ metadata operations (directory look-ups, stats, etc.) which benefit from
+ caching out of proportion to the amount of space that the target metadata
+ consumes on disk. Depending upon how closely your workload resembles the
+ NFS benchmark it may or may not be representative for you.

I guess it's a matter of worst-case working-set size. I should try to
find out how many GB of distinct data get accessed in any five minute
period (worst case) over the course of a week.

+ My specific interest is the possbility of a 0.5TB NFS/CIFS server
+ based on dual Opterons with 16GB of RAM and four 250GB SATA drives
+ using RAID1+0 -- or, beter yet, using a RAID0 scheme where each write
+ is replicated on its mod4-successor drive.
+
+ The latter is called 'chained declustering'. I'm not aware of any
+ commercial RAIDs that support it, and for an array of only 4 disks it would
+ provide relatively little advantage over striping across two mirrored pairs
+ (for larger arrays, it offers at least the potential to distribute Read
+ loads from a failed disk fairly evenly around the survivors, but if you have
+ any significant percentage of Writes then the imbalance under failure with
+ the more conventional approach won't be all that much worse - and under
+ normal operation if the RAID algorithm is competent there'll be virtually no
+ difference at all between the two designs).

If the caching works well, the read load should be relatively small.

+ The possible OSes include:
+ - Linux, whose NFS service consistently breaks down under loads.
+ - *BSD, whose file systems lack some features we need.
+ - Windows2003 Storage Server, which is suspect for largely
+ historical reasons.
+ - SolarisX86, the currently favorite option.
+ The workload is software development from roughly 250 Linux-based
+ workstations, a timeshared Linux server with an average of 100
+ logged-in users, and five Windows2003TS systems each with an average
+ of 20 logged-in users. Obviously, this four is a very small number
+ of spindles for such a load,
+
+ That's something of an understatement: it's a *ridiculously* small number,
+ unless your workload is extremely atypical.
+
+ and the issue is the effectiveness of
+ the cache in diminishing the per-spindle load.
+
+ Whatever 'the issue' may be, it's probably not that. Studies of this kind
+ of central server system have consistently shown that until the server cache
+ size starts to approach the *aggregate* cache size of the clients, the
+ server cache isn't all that much use - and with 250 client workstations plus
+ another 100 users on a timesharing system the aggregate effective client
+ cache likely exceeds 16 GB by a considerable margin.

So, if something's not in the client's cache, it's not likely to be in
the server's cache either.

+ You'd likely be better off
+
+ 1) trading a good chunk of that server memory for more disks,
+
+ 2) perhaps figuring out where to scrounge up the money for yet more disks,
+ and/or
+
+ 3) substituting a larger number of less expensive spindles for the somewhat
+ more costly 250 GB devices (if you only need a total of 0.5 TB, using
+ something like twelve 80 GB drives - which seems like about the bottom of
+ the 'sweet spot' in cost/GB today, though it continues to climb as we
+ speak - in mirrored pairs might be reasonable, though more would be better).

IDE-class drives don't hold up well under our load. My hope is/was
that a reasonably large cache would, in addition to improving access
times, significantly diminish the load on the disks and and thus
diminish their failure rate.

Thanks,
Tom Payne

#12 December 27th 03, 10:04 AM

wrote in message ...
Bill Todd wrote:

....

+ You'd likely be better off
+
+ 1) trading a good chunk of that server memory for more disks,
+
+ 2) perhaps figuring out where to scrounge up the money for yet more
disks,
+ and/or
+
+ 3) substituting a larger number of less expensive spindles for the
somewhat
+ more costly 250 GB devices (if you only need a total of 0.5 TB, using
+ something like twelve 80 GB drives - which seems like about the bottom
of
+ the 'sweet spot' in cost/GB today, though it continues to climb as we
+ speak - in mirrored pairs might be reasonable, though more would be
better).

IDE-class drives don't hold up well under our load. My hope is/was
that a reasonably large cache would, in addition to improving access
times, significantly diminish the load on the disks and and thus
diminish their failure rate.

Spreading the load across a significantly larger number of drives will
likely decrease the load on individual drives considerably more than 16 GB
of server cache would - as long as the stripe unit is coarse enough to make
it unlikely that most accesses will hit more than a single drive (1 - 4 MB
per disk chunk gets you about as much as can be gotten in this area: once
the per-drive transfer time becomes close to an order of magnitude larger
than the average seek time, disk utilization is close to maximum - though
unless you've got a lot of very large, streaming-type data accesses you may
not need to get that coarse).

Last I knew WD made a (single-platter?) 80 GB ATA drive that they were
willing to warranty for 3 years and (I think) spec for at least limited
server-style use. Of course, that won't get you tagged-command queuing
(SATA may not either until SATA II becomes common), which can create
large-write performance problems (unless you enable write-back caching at
the disk, which has dubious impacts on data integrity in many situations, or
have stable write-back caching in your controller that's mirrored to achieve
the same level of redundancy that you have at the disks) - but them's the
breaks in the econo-server biz.

- bill

#13 December 27th 03, 12:49 PM

John S. wrote:
Even at 1GB, that's not a lot of write cache... even my old Sun T3
trays have 1GB of write cache... ;-)

Nope, T3's only have 256Mb of cache. The T3+'s have 1Gb, but only a
maximum of 256Mb of that will ever be used for write-cache. If you've got
a partner pair then mirroring will eat into that somewhat too.

Scott.

#14 December 27th 03, 07:12 PM

Bill Todd wrote:
+
+ wrote in message ...
+ Bill Todd wrote:
+
+ ...
+
+ + You'd likely be better off
+ +
+ + 1) trading a good chunk of that server memory for more disks,
+ +
+ + 2) perhaps figuring out where to scrounge up the money for yet more
+ disks,
+ + and/or
+ +
+ + 3) substituting a larger number of less expensive spindles for the
+ somewhat
+ + more costly 250 GB devices (if you only need a total of 0.5 TB, using
+ + something like twelve 80 GB drives - which seems like about the bottom
+ of
+ + the 'sweet spot' in cost/GB today, though it continues to climb as we
+ + speak - in mirrored pairs might be reasonable, though more would be
+ better).
+
+ IDE-class drives don't hold up well under our load. My hope is/was
+ that a reasonably large cache would, in addition to improving access
+ times, significantly diminish the load on the disks and and thus
+ diminish their failure rate.
+
+ Spreading the load across a significantly larger number of drives will
+ likely decrease the load on individual drives considerably more than 16 GB
+ of server cache would - as long as the stripe unit is coarse enough to make
+ it unlikely that most accesses will hit more than a single drive (1 - 4 MB
+ per disk chunk gets you about as much as can be gotten in this area: once
+ the per-drive transfer time becomes close to an order of magnitude larger
+ than the average seek time, disk utilization is close to maximum - though
+ unless you've got a lot of very large, streaming-type data accesses you may
+ not need to get that coarse).

We've been using two RAID5 banks of seven ATA drives. They're less
than two years old and have been dropping like flies, so we recently
replaced all, rather than chance a two-drive failure.

+ Last I knew WD made a (single-platter?) 80 GB ATA drive that they were
+ willing to warranty for 3 years and (I think) spec for at least limited
+ server-style use. Of course, that won't get you tagged-command queuing
+ (SATA may not either until SATA II becomes common), which can create
+ large-write performance problems (unless you enable write-back caching at
+ the disk, which has dubious impacts on data integrity in many situations, or
+ have stable write-back caching in your controller that's mirrored to achieve
+ the same level of redundancy that you have at the disks) - but them's the
+ breaks in the econo-server biz.

It's my impression that a large on-disk cache (e.g., 8MB) can recover
most of the write performance lost from lack of command queuing. But
that raises the question of the cost and availability of suitable
NVRAMs that plug into a PCI bus, or possibly even USB2, to stabilize
the write data.

Tom Payne

#15 December 27th 03, 10:21 PM

wrote in message ...

....

We've been using two RAID5 banks of seven ATA drives. They're less
than two years old and have been dropping like flies, so we recently
replaced all, rather than chance a two-drive failure.

The obvious question to ask is *why* they've been dropping like flies.
E.g., some low-end RAID enclosures pay little attention to things like
proper heat dissipation, and more don't pay any attention at all to
mechanical coupling between drives that can drastically increase the amount
of re-seeking they may have to do. Another question is what you were using
for a stripe size: as I noted, larger is usually better for improving disk
utilization/reducing seek activity. And if your workload it as all
write-intensive, using RAID-5 rather than per-disk mirroring significantly
drives up the access rate.

Of course ATA drive quality itself can vary as well - though I trust you
weren't using the flakey IBM units of recent memory. All that doesn't
guarantee that there's an ATA RAID that would work for your application, but
unless you've explored these areas there might be.

+ Last I knew WD made a (single-platter?) 80 GB ATA drive that they were
+ willing to warranty for 3 years and (I think) spec for at least limited
+ server-style use. Of course, that won't get you tagged-command queuing
+ (SATA may not either until SATA II becomes common), which can create
+ large-write performance problems (unless you enable write-back caching
at
+ the disk, which has dubious impacts on data integrity in many
situations, or
+ have stable write-back caching in your controller that's mirrored to
achieve
+ the same level of redundancy that you have at the disks) - but them's
the
+ breaks in the econo-server biz.

It's my impression that a large on-disk cache (e.g., 8MB) can recover
most of the write performance lost from lack of command queuing.

It can't help at all for random Read loads (where CTQing allows the drive to
reorder independent requests for optimal throughput) or for large Reads
(where you'll miss a rev between each segment into which they must be broken
up due to ATA request-size limits - unless you use a per-disk stripe unit
that doesn't exceed the max supported request size, which is too small for
good overall disk utilization if large requests are common). And while it
can largely compensate for the similar size limits on Write requests and
allow some reordering of small random Writes, enabling the disk's write-back
cache has significant impacts on data integrity for any application that
depends on data being on the platters when the disk returns status (e.g.,
any log-protected file system like NTFS): some people are comfortable with
assuming that a server UPS sufficies to guarantee that any write submitted
to such a drive will eventually complete, but make sure your configuration
will always flush the dirty data to disk before a drive reset is issued on,
e.g., a reboot following a server OS crash.

But
that raises the question of the cost and availability of suitable
NVRAMs that plug into a PCI bus, or possibly even USB2, to stabilize
the write data.

If you use the disks' own write-back caches it doesn't (with the caveats
noted above).

- bill

#16 December 27th 03, 10:26 PM

"Bill Todd" wrote in message
...

wrote in message ...

....

It's my impression that a large on-disk cache (e.g., 8MB) can recover
most of the write performance lost from lack of command queuing.

It can't help at all for random Read loads (where CTQing allows the drive
to
reorder independent requests for optimal throughput) or for large Reads
(where you'll miss a rev between each segment into which they must be
broken
up due to ATA request-size limits

Whoops - must not have quite woken up yet. The disk's normal read-ahead
facilities will often handle such large Reads at least fairly well, by
prefetching additional data into the disk's cache after the initial request
is satisfied so that it's present and ready when the next sequential segment
request comes along. The same is not true for large Writes, however.

- bill

#17 December 27th 03, 11:46 PM

Bill Todd wrote:
+
+ wrote in message ...
+
+ ...
+
+ We've been using two RAID5 banks of seven ATA drives. They're less
+ than two years old and have been dropping like flies, so we recently
+ replaced all, rather than chance a two-drive failure.
+
+ The obvious question to ask is *why* they've been dropping like flies.
+ E.g., some low-end RAID enclosures pay little attention to things like
+ proper heat dissipation, and more don't pay any attention at all to
+ mechanical coupling between drives that can drastically increase the amount
+ of re-seeking they may have to do.

It's an ACNC array.

+ Another question is what you were using
+ for a stripe size: as I noted, larger is usually better for improving disk
+ utilization/reducing seek activity.

Good question. I'll have to check.

+ And if your workload it as all
+ write-intensive, using RAID-5 rather than per-disk mirroring significantly
+ drives up the access rate.

With the new drives, we'll have sufficient capacity to cover mirroring and
then some.

+ Of course ATA drive quality itself can vary as well - though I trust you
+ weren't using the flakey IBM units of recent memory.

One bank was DeathStars, but the other was Maxtors.

+ All that doesn't
+ guarantee that there's an ATA RAID that would work for your application, but
+ unless you've explored these areas there might be.

Thanks for the perspective.

+ + Last I knew WD made a (single-platter?) 80 GB ATA drive that they were
+ + willing to warranty for 3 years and (I think) spec for at least limited
+ + server-style use. Of course, that won't get you tagged-command queuing
+ + (SATA may not either until SATA II becomes common), which can create
+ + large-write performance problems (unless you enable write-back caching
+ at
+ + the disk, which has dubious impacts on data integrity in many
+ situations, or
+ + have stable write-back caching in your controller that's mirrored to
+ achieve
+ + the same level of redundancy that you have at the disks) - but them's
+ the
+ + breaks in the econo-server biz.
+
+ It's my impression that a large on-disk cache (e.g., 8MB) can recover
+ most of the write performance lost from lack of command queuing.
+
+ It can't help at all for random Read loads (where CTQing allows the drive to
+ reorder independent requests for optimal throughput) or for large Reads
+ (where you'll miss a rev between each segment into which they must be broken
+ up due to ATA request-size limits - unless you use a per-disk stripe unit
+ that doesn't exceed the max supported request size, which is too small for
+ good overall disk utilization if large requests are common). And while it
+ can largely compensate for the similar size limits on Write requests and
+ allow some reordering of small random Writes, enabling the disk's write-back
+ cache has significant impacts on data integrity for any application that
+ depends on data being on the platters when the disk returns status (e.g.,
+ any log-protected file system like NTFS): some people are comfortable with
+ assuming that a server UPS sufficies to guarantee that any write submitted
+ to such a drive will eventually complete, but make sure your configuration
+ will always flush the dirty data to disk before a drive reset is issued on,
+ e.g., a reboot following a server OS crash.

Good advice. I'll check. (We aren't doing sensitive transaction
processing.)

+ But
+ that raises the question of the cost and availability of suitable
+ NVRAMs that plug into a PCI bus, or possibly even USB2, to stabilize
+ the write data.
+
+ If you use the disks' own write-back caches it doesn't (with the caveats
+ noted above).

Good point.

Tom Payne

#18 December 28th 03, 07:14 AM

writes:

My specific interest is the possbility of a 0.5TB NFS/CIFS server
based on dual Opterons with 16GB of RAM and four 250GB SATA drives
using RAID1+0 -- or, beter yet, using a RAID0 scheme where each write
is replicated on its mod4-successor drive. The possible OSes include:
- Linux, whose NFS service consistently breaks down under loads.
- *BSD, whose file systems lack some features we need.
- Windows2003 Storage Server, which is suspect for largely
historical reasons.
- SolarisX86, the currently favorite option.
The workload is software development from roughly 250 Linux-based
workstations, a timeshared Linux server with an average of 100
logged-in users, and five Windows2003TS systems each with an average
of 20 logged-in users. Obviously, this four is a very small number
of spindles for such a load, and the issue is the effectiveness of
the cache in diminishing the per-spindle load.

As others have correctly pointed out, spending money on 16GB of RAM
hoping that a large cache will reduce load on the disks isn't going
to be too helpful. Spend the money on more spindles, you say "four
is a very small number of spindles for such a load". That may be the
understatement of the year! If you use IDE, you'll need even more
spindles than if you used SCSI -- at least twice as many when comparing
7200 rpm IDE to 10k rpm SCSI, 3x as many if you are talking 15k rpm
SCSI. It might cost say $4000 for 8 146GB 10k rpm SCSI disks and a
SCSI RAID controller, versus $1000 if you did it with 4 250GB IDE disks
and an IDE RAID controller, but you could probably manage 4x the IOs
with that setup. And you can pay for it by using 4GB or 8GB of RAM in
your Opteron instead of the 16GB you are talking about.

--
Douglas Siebert

"I feel sorry for people who don't drink. When they wake up in the morning,
that's as good as they're going to feel all day" -- Frank Sinatra

#19 December 29th 03, 01:57 PM

wrote:
+ A bit of googling led to:
+
+ http://www.specbench.org/sfs97r1/res...623-00158.html
+
+ which reports 1.5 ms average response time at 25K NFS ops per second
+ for a NetApp GF960 backed by a 9980V with 32GB of cache. Since that's
+ way faster than disk access time, this outstanding performance is
+ visibly due to the large cache.

Oops! A bit more googling, this time at

http://www.spec.org/osg/sfs97r1/results/sfs97r1.html

showed that my interpretation of that data is completely wrong.
Specifically, another test of the GF960 without the 32MB cache (i.e.,
using only the 6GB of memory in the NetApp yielded roughly only
slightly better overall performance and poorer per-spindle
performance. (NetApp seems to be able to get at least 250-to-300
NFSops/sec per spindle almost regardless of their memory size.)

To compare data from systems of vastly different sizes, it appears
that the key is to focus on NFSops/sec per spindle vs. memory or cache
size as a percent of working set size (which according to SPEC
documents is 1MB per NFSop/sec).

Tom Payne

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Confused over AMD PR rating		Overclocking AMD Processors	2	January 28th 04 01:18 AM
Large System Cache - XP	LostSoul	Ati Videocards	9	January 13th 04 05:48 AM
Regarding a fix for large system cache in xp	LostSoul	Ati Videocards	3	January 8th 04 05:56 PM
New Hard Drive: 2MB cache vs. 8MB Cache	Purp1e	General	9	September 18th 03 05:52 AM
Memtestx86 Cache Questions	S.Heenan	Overclocking AMD Processors	8	August 4th 03 08:19 PM