If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
Bill Todd wrote:
+ + wrote in message ... + + ... + + A bit of googling led to: + + + http://www.specbench.org/sfs97r1/res...30623-00158.ht + ml + + which reports 1.5 ms average response time at 25K NFS ops per second + for a NetApp GF960 backed by a 9980V with 32GB of cache. Since that's + way faster than disk access time, this outstanding performance is + visibly due to the large cache. + + My vague recollection is that a fair amount of the LADDIS benchmark (of + which the SPEC benchmark that you refer to may be a descendant) consists of + metadata operations (directory look-ups, stats, etc.) which benefit from + caching out of proportion to the amount of space that the target metadata + consumes on disk. Depending upon how closely your workload resembles the + NFS benchmark it may or may not be representative for you. I guess it's a matter of worst-case working-set size. I should try to find out how many GB of distinct data get accessed in any five minute period (worst case) over the course of a week. + My specific interest is the possbility of a 0.5TB NFS/CIFS server + based on dual Opterons with 16GB of RAM and four 250GB SATA drives + using RAID1+0 -- or, beter yet, using a RAID0 scheme where each write + is replicated on its mod4-successor drive. + + The latter is called 'chained declustering'. I'm not aware of any + commercial RAIDs that support it, and for an array of only 4 disks it would + provide relatively little advantage over striping across two mirrored pairs + (for larger arrays, it offers at least the potential to distribute Read + loads from a failed disk fairly evenly around the survivors, but if you have + any significant percentage of Writes then the imbalance under failure with + the more conventional approach won't be all that much worse - and under + normal operation if the RAID algorithm is competent there'll be virtually no + difference at all between the two designs). If the caching works well, the read load should be relatively small. + The possible OSes include: + - Linux, whose NFS service consistently breaks down under loads. + - *BSD, whose file systems lack some features we need. + - Windows2003 Storage Server, which is suspect for largely + historical reasons. + - SolarisX86, the currently favorite option. + The workload is software development from roughly 250 Linux-based + workstations, a timeshared Linux server with an average of 100 + logged-in users, and five Windows2003TS systems each with an average + of 20 logged-in users. Obviously, this four is a very small number + of spindles for such a load, + + That's something of an understatement: it's a *ridiculously* small number, + unless your workload is extremely atypical. + + and the issue is the effectiveness of + the cache in diminishing the per-spindle load. + + Whatever 'the issue' may be, it's probably not that. Studies of this kind + of central server system have consistently shown that until the server cache + size starts to approach the *aggregate* cache size of the clients, the + server cache isn't all that much use - and with 250 client workstations plus + another 100 users on a timesharing system the aggregate effective client + cache likely exceeds 16 GB by a considerable margin. So, if something's not in the client's cache, it's not likely to be in the server's cache either. + You'd likely be better off + + 1) trading a good chunk of that server memory for more disks, + + 2) perhaps figuring out where to scrounge up the money for yet more disks, + and/or + + 3) substituting a larger number of less expensive spindles for the somewhat + more costly 250 GB devices (if you only need a total of 0.5 TB, using + something like twelve 80 GB drives - which seems like about the bottom of + the 'sweet spot' in cost/GB today, though it continues to climb as we + speak - in mirrored pairs might be reasonable, though more would be better). IDE-class drives don't hold up well under our load. My hope is/was that a reasonably large cache would, in addition to improving access times, significantly diminish the load on the disks and and thus diminish their failure rate. Thanks, Tom Payne |
#12
|
|||
|
|||
wrote in message ... Bill Todd wrote: .... + You'd likely be better off + + 1) trading a good chunk of that server memory for more disks, + + 2) perhaps figuring out where to scrounge up the money for yet more disks, + and/or + + 3) substituting a larger number of less expensive spindles for the somewhat + more costly 250 GB devices (if you only need a total of 0.5 TB, using + something like twelve 80 GB drives - which seems like about the bottom of + the 'sweet spot' in cost/GB today, though it continues to climb as we + speak - in mirrored pairs might be reasonable, though more would be better). IDE-class drives don't hold up well under our load. My hope is/was that a reasonably large cache would, in addition to improving access times, significantly diminish the load on the disks and and thus diminish their failure rate. Spreading the load across a significantly larger number of drives will likely decrease the load on individual drives considerably more than 16 GB of server cache would - as long as the stripe unit is coarse enough to make it unlikely that most accesses will hit more than a single drive (1 - 4 MB per disk chunk gets you about as much as can be gotten in this area: once the per-drive transfer time becomes close to an order of magnitude larger than the average seek time, disk utilization is close to maximum - though unless you've got a lot of very large, streaming-type data accesses you may not need to get that coarse). Last I knew WD made a (single-platter?) 80 GB ATA drive that they were willing to warranty for 3 years and (I think) spec for at least limited server-style use. Of course, that won't get you tagged-command queuing (SATA may not either until SATA II becomes common), which can create large-write performance problems (unless you enable write-back caching at the disk, which has dubious impacts on data integrity in many situations, or have stable write-back caching in your controller that's mirrored to achieve the same level of redundancy that you have at the disks) - but them's the breaks in the econo-server biz. - bill |
#13
|
|||
|
|||
John S. wrote:
Even at 1GB, that's not a lot of write cache... even my old Sun T3 trays have 1GB of write cache... ;-) Nope, T3's only have 256Mb of cache. The T3+'s have 1Gb, but only a maximum of 256Mb of that will ever be used for write-cache. If you've got a partner pair then mirroring will eat into that somewhat too. Scott. |
#14
|
|||
|
|||
Bill Todd wrote:
+ + wrote in message ... + Bill Todd wrote: + + ... + + + You'd likely be better off + + + + 1) trading a good chunk of that server memory for more disks, + + + + 2) perhaps figuring out where to scrounge up the money for yet more + disks, + + and/or + + + + 3) substituting a larger number of less expensive spindles for the + somewhat + + more costly 250 GB devices (if you only need a total of 0.5 TB, using + + something like twelve 80 GB drives - which seems like about the bottom + of + + the 'sweet spot' in cost/GB today, though it continues to climb as we + + speak - in mirrored pairs might be reasonable, though more would be + better). + + IDE-class drives don't hold up well under our load. My hope is/was + that a reasonably large cache would, in addition to improving access + times, significantly diminish the load on the disks and and thus + diminish their failure rate. + + Spreading the load across a significantly larger number of drives will + likely decrease the load on individual drives considerably more than 16 GB + of server cache would - as long as the stripe unit is coarse enough to make + it unlikely that most accesses will hit more than a single drive (1 - 4 MB + per disk chunk gets you about as much as can be gotten in this area: once + the per-drive transfer time becomes close to an order of magnitude larger + than the average seek time, disk utilization is close to maximum - though + unless you've got a lot of very large, streaming-type data accesses you may + not need to get that coarse). We've been using two RAID5 banks of seven ATA drives. They're less than two years old and have been dropping like flies, so we recently replaced all, rather than chance a two-drive failure. + Last I knew WD made a (single-platter?) 80 GB ATA drive that they were + willing to warranty for 3 years and (I think) spec for at least limited + server-style use. Of course, that won't get you tagged-command queuing + (SATA may not either until SATA II becomes common), which can create + large-write performance problems (unless you enable write-back caching at + the disk, which has dubious impacts on data integrity in many situations, or + have stable write-back caching in your controller that's mirrored to achieve + the same level of redundancy that you have at the disks) - but them's the + breaks in the econo-server biz. It's my impression that a large on-disk cache (e.g., 8MB) can recover most of the write performance lost from lack of command queuing. But that raises the question of the cost and availability of suitable NVRAMs that plug into a PCI bus, or possibly even USB2, to stabilize the write data. Tom Payne |
#15
|
|||
|
|||
wrote in message ... .... We've been using two RAID5 banks of seven ATA drives. They're less than two years old and have been dropping like flies, so we recently replaced all, rather than chance a two-drive failure. The obvious question to ask is *why* they've been dropping like flies. E.g., some low-end RAID enclosures pay little attention to things like proper heat dissipation, and more don't pay any attention at all to mechanical coupling between drives that can drastically increase the amount of re-seeking they may have to do. Another question is what you were using for a stripe size: as I noted, larger is usually better for improving disk utilization/reducing seek activity. And if your workload it as all write-intensive, using RAID-5 rather than per-disk mirroring significantly drives up the access rate. Of course ATA drive quality itself can vary as well - though I trust you weren't using the flakey IBM units of recent memory. All that doesn't guarantee that there's an ATA RAID that would work for your application, but unless you've explored these areas there might be. + Last I knew WD made a (single-platter?) 80 GB ATA drive that they were + willing to warranty for 3 years and (I think) spec for at least limited + server-style use. Of course, that won't get you tagged-command queuing + (SATA may not either until SATA II becomes common), which can create + large-write performance problems (unless you enable write-back caching at + the disk, which has dubious impacts on data integrity in many situations, or + have stable write-back caching in your controller that's mirrored to achieve + the same level of redundancy that you have at the disks) - but them's the + breaks in the econo-server biz. It's my impression that a large on-disk cache (e.g., 8MB) can recover most of the write performance lost from lack of command queuing. It can't help at all for random Read loads (where CTQing allows the drive to reorder independent requests for optimal throughput) or for large Reads (where you'll miss a rev between each segment into which they must be broken up due to ATA request-size limits - unless you use a per-disk stripe unit that doesn't exceed the max supported request size, which is too small for good overall disk utilization if large requests are common). And while it can largely compensate for the similar size limits on Write requests and allow some reordering of small random Writes, enabling the disk's write-back cache has significant impacts on data integrity for any application that depends on data being on the platters when the disk returns status (e.g., any log-protected file system like NTFS): some people are comfortable with assuming that a server UPS sufficies to guarantee that any write submitted to such a drive will eventually complete, but make sure your configuration will always flush the dirty data to disk before a drive reset is issued on, e.g., a reboot following a server OS crash. But that raises the question of the cost and availability of suitable NVRAMs that plug into a PCI bus, or possibly even USB2, to stabilize the write data. If you use the disks' own write-back caches it doesn't (with the caveats noted above). - bill |
#16
|
|||
|
|||
"Bill Todd" wrote in message ... wrote in message ... .... It's my impression that a large on-disk cache (e.g., 8MB) can recover most of the write performance lost from lack of command queuing. It can't help at all for random Read loads (where CTQing allows the drive to reorder independent requests for optimal throughput) or for large Reads (where you'll miss a rev between each segment into which they must be broken up due to ATA request-size limits Whoops - must not have quite woken up yet. The disk's normal read-ahead facilities will often handle such large Reads at least fairly well, by prefetching additional data into the disk's cache after the initial request is satisfied so that it's present and ready when the next sequential segment request comes along. The same is not true for large Writes, however. - bill |
#17
|
|||
|
|||
Bill Todd wrote:
+ + wrote in message ... + + ... + + We've been using two RAID5 banks of seven ATA drives. They're less + than two years old and have been dropping like flies, so we recently + replaced all, rather than chance a two-drive failure. + + The obvious question to ask is *why* they've been dropping like flies. + E.g., some low-end RAID enclosures pay little attention to things like + proper heat dissipation, and more don't pay any attention at all to + mechanical coupling between drives that can drastically increase the amount + of re-seeking they may have to do. It's an ACNC array. + Another question is what you were using + for a stripe size: as I noted, larger is usually better for improving disk + utilization/reducing seek activity. Good question. I'll have to check. + And if your workload it as all + write-intensive, using RAID-5 rather than per-disk mirroring significantly + drives up the access rate. With the new drives, we'll have sufficient capacity to cover mirroring and then some. + Of course ATA drive quality itself can vary as well - though I trust you + weren't using the flakey IBM units of recent memory. One bank was DeathStars, but the other was Maxtors. + All that doesn't + guarantee that there's an ATA RAID that would work for your application, but + unless you've explored these areas there might be. Thanks for the perspective. + + Last I knew WD made a (single-platter?) 80 GB ATA drive that they were + + willing to warranty for 3 years and (I think) spec for at least limited + + server-style use. Of course, that won't get you tagged-command queuing + + (SATA may not either until SATA II becomes common), which can create + + large-write performance problems (unless you enable write-back caching + at + + the disk, which has dubious impacts on data integrity in many + situations, or + + have stable write-back caching in your controller that's mirrored to + achieve + + the same level of redundancy that you have at the disks) - but them's + the + + breaks in the econo-server biz. + + It's my impression that a large on-disk cache (e.g., 8MB) can recover + most of the write performance lost from lack of command queuing. + + It can't help at all for random Read loads (where CTQing allows the drive to + reorder independent requests for optimal throughput) or for large Reads + (where you'll miss a rev between each segment into which they must be broken + up due to ATA request-size limits - unless you use a per-disk stripe unit + that doesn't exceed the max supported request size, which is too small for + good overall disk utilization if large requests are common). And while it + can largely compensate for the similar size limits on Write requests and + allow some reordering of small random Writes, enabling the disk's write-back + cache has significant impacts on data integrity for any application that + depends on data being on the platters when the disk returns status (e.g., + any log-protected file system like NTFS): some people are comfortable with + assuming that a server UPS sufficies to guarantee that any write submitted + to such a drive will eventually complete, but make sure your configuration + will always flush the dirty data to disk before a drive reset is issued on, + e.g., a reboot following a server OS crash. Good advice. I'll check. (We aren't doing sensitive transaction processing.) + But + that raises the question of the cost and availability of suitable + NVRAMs that plug into a PCI bus, or possibly even USB2, to stabilize + the write data. + + If you use the disks' own write-back caches it doesn't (with the caveats + noted above). Good point. Tom Payne |
#19
|
|||
|
|||
wrote:
+ A bit of googling led to: + + http://www.specbench.org/sfs97r1/res...623-00158.html + + which reports 1.5 ms average response time at 25K NFS ops per second + for a NetApp GF960 backed by a 9980V with 32GB of cache. Since that's + way faster than disk access time, this outstanding performance is + visibly due to the large cache. Oops! A bit more googling, this time at http://www.spec.org/osg/sfs97r1/results/sfs97r1.html showed that my interpretation of that data is completely wrong. Specifically, another test of the GF960 without the 32MB cache (i.e., using only the 6GB of memory in the NetApp yielded roughly only slightly better overall performance and poorer per-spindle performance. (NetApp seems to be able to get at least 250-to-300 NFSops/sec per spindle almost regardless of their memory size.) To compare data from systems of vastly different sizes, it appears that the key is to focus on NFSops/sec per spindle vs. memory or cache size as a percent of working set size (which according to SPEC documents is 1MB per NFSop/sec). Tom Payne |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Confused over AMD PR rating | Overclocking AMD Processors | 2 | January 28th 04 01:18 AM | |
Large System Cache - XP | LostSoul | Ati Videocards | 9 | January 13th 04 05:48 AM |
Regarding a fix for large system cache in xp | LostSoul | Ati Videocards | 3 | January 8th 04 05:56 PM |
New Hard Drive: 2MB cache vs. 8MB Cache | Purp1e | General | 9 | September 18th 03 05:52 AM |
Memtestx86 Cache Questions | S.Heenan | Overclocking AMD Processors | 8 | August 4th 03 08:19 PM |