If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Very fast file system access
I'm looking for information on fast file system access and how to
approach it and achieve it. Many super computer centers use GPFS as do some commercial, though you hear of it mostly in the edu space. Then you have the standard players: QFS, CXFS, Polyserve, Ibrix, GPFS, GFS, SANFs, etc... My goal is to have N number of Unix hosts accessing the same data set where N is probably 30. Access has to be fast, transfer has to be fast, data integrity is an absolute. Thanks. ~F |
#2
|
|||
|
|||
Very fast file system access
On 2007-05-21, Faeandar wrote:
I'm looking for information on fast file system access and how to approach it and achieve it. I'm looking for any questions in your posting :-) Many super computer centers use GPFS as do some commercial, though you hear of it mostly in the edu space. AFAIK HPC-centres mostly use GPFS in a setup where you have few I/O-servers which access the SAN directly, and many compute nodes that stripe their I/O over these I/O-servers, probably utilizing some kind of fast'n'wide network (inifinband or similar, but GbE works for smaller clusters too). For commercial, my experience is that it's mostly used as a SAN-fs, where all nodes are FC-attached to the same storage. All I/O goes directly to SAN, and only cluster/locking information is sent over tcp/ip between the nodes. -jf |
#3
|
|||
|
|||
Very fast file system access
On Tue, 22 May 2007 10:14:12 +0200, Jan-Frode Myklebust
wrote: On 2007-05-21, Faeandar wrote: I'm looking for information on fast file system access and how to approach it and achieve it. I'm looking for any questions in your posting :-) Many super computer centers use GPFS as do some commercial, though you hear of it mostly in the edu space. AFAIK HPC-centres mostly use GPFS in a setup where you have few I/O-servers which access the SAN directly, and many compute nodes that stripe their I/O over these I/O-servers, probably utilizing some kind of fast'n'wide network (inifinband or similar, but GbE works for smaller clusters too). For commercial, my experience is that it's mostly used as a SAN-fs, where all nodes are FC-attached to the same storage. All I/O goes directly to SAN, and only cluster/locking information is sent over tcp/ip between the nodes. -jf Well, my goal would have been more of a problem statement for which I hoped others would have a solution statement. GPFS was only one option I mentioned and primarily because it's so prominent in the HPC space. So, for the fastest shared access for 30 or less nodes, what are peoples' experience or recommendations and why? Thanks. ~F |
#4
|
|||
|
|||
Very fast file system access
Faeandar wrote:
I'm looking for information on fast file system access and how to approach it and achieve it. OK. Many super computer centers use GPFS as do some commercial, though you hear of it mostly in the edu space. Its acronym has the word 'parallel' in it for good reason, but it's difficult to tell whether its specific strengths would be important for the workload that you have barely sketched out. Then you have the standard players: QFS, CXFS, Polyserve, Ibrix, GPFS, GFS, SANFs, etc... Kind of an eclectic lot there - e.g., a mix of direct-to-disk/central-metadata-server designs with some more-fully-distributed ones. One thing they have in common though is the ability to serve data from more than a single server node: do you think that will be critical (i.e., that your bandwidth requirements will exceed that available from any single server)? My goal is to have N number of Unix hosts accessing the same data set where N is probably 30. I suspect that the value of N may be less important than the value of N multiplied by the average host load (both in terms of requests-per-second and aggregate bandwidth). Access has to be fast, How so? A network hop or the passage of data through server RAM is a couple of orders of magnitude faster than an actual disk access, so any competent implementation should make such considerations irrelevant (e.g., a direct-to-data model should offer little advantage per se). About the only thing you can do to reduce the cost of disk reads is to cache aggressively. Direct-to-disk implementations may not do this well, at least for the data itself (they'd have to support a cooperative - not just invalidating - cache distributed among the clients, and my impression is that most do not; besides, any direct-to-disk design requires that all the hosts trust each other completely, which in many environments may be a deal-breaker). Well, there's one more way to reduce the impact of metadata disk reads: use an internal file system structure that doesn't squander an extra disk access at every directory level (i.e., embeds inode-style information in the parent directory; extent-based allocation is important too in terms of minimizing mapping indirection for large files). ReiserFS may qualify here (perhaps ZFS does as well). Synchronous writes are more amenable to expediting, via stable write-back cache or logging of one kind or another. transfer has to be fast, I'm guessing you're talking about high bandwidth here, since transfer latency per se should be fairly well down in the noise compared with disk access latency. data integrity is an absolute. If you really mean that, it may reduce your server file system options to two: ZFS (unless you feel that its immaturity compromises its data-integrity guarantees) and WAFL (which AFAICT offers similar end-to-end guarantees in a far more mature implementation). NetApp boxes also scale up to fairly high levels and handle write-intense loads well (but you knew that...): is cost what's holding you back? - bill |
#5
|
|||
|
|||
Very fast file system access
If you really mean that, it may reduce your server file system options to two: ZFS (unless you feel that its immaturity compromises its data-integrity guarantees) and WAFL (which AFAICT offers similar end-to-end guarantees in a far more mature implementation). NetApp boxes also scale up to fairly high levels and handle write-intense loads well (but you knew that...): is cost what's holding you back? - bill WAFL is Network Appliance's proprietary filesystem, and largely irrelevant in this discussion. IF the discussion is about LUNs shared across FC to multiple hosts, WAFL doesn't enter into it. NetApps can share LUNs among FC or iSCSI attached hosts, but doesn't address the client-side lock management and concurrent access... |
#6
|
|||
|
|||
Very fast file system access
On Tue, 22 May 2007 20:35:51 -0400, Bill Todd
wrote: Many super computer centers use GPFS as do some commercial, though you hear of it mostly in the edu space. Its acronym has the word 'parallel' in it for good reason, but it's difficult to tell whether its specific strengths would be important for the workload that you have barely sketched out. I realize I'm underwhelming you with details, but regrettably I'm not in a position to disclose too much so I'm err'ing on the side of disclosing nothing. but if I want some input I guess I need to draw a better picture. Then you have the standard players: QFS, CXFS, Polyserve, Ibrix, GPFS, GFS, SANFs, etc... Kind of an eclectic lot there - e.g., a mix of direct-to-disk/central-metadata-server designs with some more-fully-distributed ones. One thing they have in common though is the ability to serve data from more than a single server node: do you think that will be critical (i.e., that your bandwidth requirements will exceed that available from any single server)? All of these allow multi-writer access to data in some fashion (though Ibrix is arguably an oddball). How they do it is of less concern than how fast they do it and how stable it is. The IO patterns are somewhat random but can be categorized as primarily large, 1GB to 30GB files, but there are the random small file access. In both cases, small and large files, it varies between streaming and offset locking. It's not the bandwidth of a single server that concerns me, I'm not intending to use this as a file serving backend. It is primarily to share the same data set among multiple nodes and avoid NFS latency. My goal is to have N number of Unix hosts accessing the same data set where N is probably 30. I suspect that the value of N may be less important than the value of N multiplied by the average host load (both in terms of requests-per-second and aggregate bandwidth). I am concerned about the value of N primarily because I think metadata and lock coherence will be a bottleneck for any more than that. Polyserve has a limit of 24 nodes, but it slows down long before that. QFS allows it's metadata to be placed anywhere you want so it could be all stored on uber fast drives or even RAM disk for extreme performance (I suspect it would be extreme anyway). I don't know about CXFS but suspect it's similar to QFS. GFS, SANFs, and GPFS are fairly unknown to me other than the glossies. The nodes will be doing their own thing and not acting in concert for anything other than cache/locl/metadata coherency. Access has to be fast, How so? A network hop or the passage of data through server RAM is a couple of orders of magnitude faster than an actual disk access, so any competent implementation should make such considerations irrelevant (e.g., a direct-to-data model should offer little advantage per se). About the only thing you can do to reduce the cost of disk reads is to cache aggressively. Direct-to-disk implementations may not do this well, at least for the data itself (they'd have to support a cooperative - not just invalidating - cache distributed among the clients, and my impression is that most do not; besides, any direct-to-disk design requires that all the hosts trust each other completely, which in many environments may be a deal-breaker). Fast access meaning I'm not noticeably limited in access speeds by metadata or cache or lock issues. With multi-node access to the data via FC I expect there to be no noticeable difference than if it were single-node access, all else being equal. Noticeable being 100us or more. Trust is not an issue. Cache will only get you so far as it simply moves the bottleneck about 30 seconds into the future. Well, there's one more way to reduce the impact of metadata disk reads: use an internal file system structure that doesn't squander an extra disk access at every directory level (i.e., embeds inode-style information in the parent directory; extent-based allocation is important too in terms of minimizing mapping indirection for large files). ReiserFS may qualify here (perhaps ZFS does as well). Synchronous writes are more amenable to expediting, via stable write-back cache or logging of one kind or another. transfer has to be fast, I'm guessing you're talking about high bandwidth here, since transfer latency per se should be fairly well down in the noise compared with disk access latency. Yes and no. With onboard drive cache these days you could see a difference in "disk access" if the stripe was sufficiently wide. This is primarily for writes of course, I don't know how well it handles read-ahead at that level. And the difference between 4gb and 2gb latency is very noticeable even to disk, assuming you compare 4gb drives to 2gb drives as well. High bandwidth is fairly easy to get with 4gb, as is lower latency. Again my concern is with the architecture of multi-writer access solutions. Where do they work and where do they not. I've some experience with a few of the products I listed and I have found that reality and marketing can be ...... out of sync. data integrity is an absolute. If you really mean that, it may reduce your server file system options to two: ZFS (unless you feel that its immaturity compromises its data-integrity guarantees) and WAFL (which AFAICT offers similar end-to-end guarantees in a far more mature implementation). NetApp boxes also scale up to fairly high levels and handle write-intense loads well (but you knew that...): is cost what's holding you back? I did not realize ZFS was multi-writer capable. And it would not be a consideration atm because it is so new. Data integrity comes from time, not a vendor. WAFL is not multi-writer capable other than through NFS, which I'm trying to architect to avoid. ~F |
#7
|
|||
|
|||
Very fast file system access
Faeandar wrote:
.... It's not the bandwidth of a single server that concerns me, I'm not intending to use this as a file serving backend. It is primarily to share the same data set among multiple nodes and avoid NFS latency. Are you absolutely certain that some - perhaps many - of the options you've said you're considering don't have *more* 'latency' (in whatever specific sense you're worried about it) than a good NFS implementation does? My goal is to have N number of Unix hosts accessing the same data set where N is probably 30. I suspect that the value of N may be less important than the value of N multiplied by the average host load (both in terms of requests-per-second and aggregate bandwidth). I am concerned about the value of N primarily because I think metadata and lock coherence will be a bottleneck for any more than that. It might, with a distributed implementation. That's one reason that a good, centralized NFS implementation might be attractive. Polyserve has a limit of 24 nodes, but it slows down long before that. Then they screwed up their design. .... Fast access meaning I'm not noticeably limited in access speeds by metadata or cache or lock issues. You're always going to be limited by metadata unless all the metadata you need is cached (with stable cache to hold any updates: if you really need bullet-proof data integrity, you can't defer metadata persistence) - another good argument for a centralized implementation, at least for the metadata (because building a bullet-proof distributed-update metadata facility is *hard*: VMS did it in a somewhat limited and special-case fashion when it developed clusters, and IBM's C. Mohan at least described how to build a more general mechanism back in the '90s, but I'd be a bit cautious about trusting any recently-developed products in this area until they've had a decade or so to wring the bugs out). You're always going to be limited by cache unless it's sufficient to hold everything you need. You're always going to be limited by locks unless all locks are held only instantaneously (at least with a competent lock-management implementation that doesn't choke as it scales up). With multi-node access to the data via FC I expect there to be no noticeable difference than if it were single-node access, all else being equal. Noticeable being 100us or more. 100 us. is *not* noticeable where disk access is concerned: it's *way* down in the noise compared with random variations in seek and rotational latency. And that's for small, random accesses: for larger streaming accesses it's hardly measurable, let alone noticeable. Trust is not an issue. Cache will only get you so far as it simply moves the bottleneck about 30 seconds into the future. I doubt that: cache should be critical in avoiding unnecessary metadata read latency (if large files constitute the bulk of the data in your system, metadata should be small enough to be mostly cacheable). Where you can afford to do lazy writes, cache allows disk reordering optimizations as well (and stable cache helps keep metadata updates from sopping up disk bandwidth). .... transfer has to be fast, I'm guessing you're talking about high bandwidth here, since transfer latency per se should be fairly well down in the noise compared with disk access latency. Yes and no. With onboard drive cache these days you could see a difference in "disk access" if the stripe was sufficiently wide. I have no idea what you mean by that. Onboard drive caches are nowhere nearly large enough to cache a useful amount of data (though they do help buffer it intelligently - especially if write-back reordering can be enabled): caching is what system caches are for. This is primarily for writes of course, I don't know how well it handles read-ahead at that level. Drive-level read-ahead should be irrelevant as well: the system should be detecting and handling this. And the difference between 4gb and 2gb latency is very noticeable even to disk, assuming you compare 4gb drives to 2gb drives as well. I simply don't believe that - unless something is seriously misconfigured. High bandwidth is fairly easy to get with 4gb, as is lower latency. High bandwidth and low latency are fairly easy to get with Gigabit Ethernet, for heaven's sake. And with multiple GigE pipes at low cost, if your drivers and server aren't brain-dead. If you were talking tens of GB/sec that might be different, but you said crushing bandwidth per se was not an issue. You sound a bit like some CxO-level weenie with a checklist of buzzwords he since I'm reasonably sure that's not what you are, I'm wondering whether this 'architect' thing has you somehow feeling inadequate and thus trying to cover more bases than you need to. But it's always possible that this project you can't describe has truly unusual characteristics, I guess. - bill |
#8
|
|||
|
|||
Very fast file system access
On 2007-05-24, Faeandar wrote:
I don't know about CXFS but suspect it's similar to QFS. GFS, SANFs, and GPFS are fairly unknown to me other than the glossies. You can probably rule out SANFS, as it seems to be replaced by GPFS: ftp://ftp.software.ibm.com/common/ss...W03003USEN.PDF Fast access meaning I'm not noticeably limited in access speeds by metadata or cache or lock issues. I believe GPFS has some per-directory locking when creating new files, which has caused problem for me when flooding the same Maildir with new messages.. 10's of thousands of new messages, but I wouldn't expect this to be a problem for your large files. With multi-node access to the data via FC I expect there to be no noticeable difference than if it were single-node access, all else being equal. Agree. Trust is not an issue. Cache will only get you so far as it simply moves the bottleneck about 30 seconds into the future. For streaming large files yes.. It's quite common to only give each node 64-128MB local page cache with GPFS, which works quite well for streaming. But my boxes are more random I/O, and then I normally give it ~50% of available the memory. Again my concern is with the architecture of multi-writer access solutions. AFAIK this is where GPFS is king. Google for "project fastball", 100GB/s of sustained read and write performance to a single file... -jf |
#9
|
|||
|
|||
Very fast file system access
On Fri, 25 May 2007 00:47:36 +0200, Jan-Frode Myklebust
wrote: On 2007-05-24, Faeandar wrote: I don't know about CXFS but suspect it's similar to QFS. GFS, SANFs, and GPFS are fairly unknown to me other than the glossies. You can probably rule out SANFS, as it seems to be replaced by GPFS: ftp://ftp.software.ibm.com/common/ss...W03003USEN.PDF Fast access meaning I'm not noticeably limited in access speeds by metadata or cache or lock issues. I believe GPFS has some per-directory locking when creating new files, which has caused problem for me when flooding the same Maildir with new messages.. 10's of thousands of new messages, but I wouldn't expect this to be a problem for your large files. With multi-node access to the data via FC I expect there to be no noticeable difference than if it were single-node access, all else being equal. Agree. Trust is not an issue. Cache will only get you so far as it simply moves the bottleneck about 30 seconds into the future. For streaming large files yes.. It's quite common to only give each node 64-128MB local page cache with GPFS, which works quite well for streaming. But my boxes are more random I/O, and then I normally give it ~50% of available the memory. Again my concern is with the architecture of multi-writer access solutions. AFAIK this is where GPFS is king. Google for "project fastball", 100GB/s of sustained read and write performance to a single file... -jf Thank you, that pdf is a worthwhile read. My path now is probably to compare GPFS and QFS. I pretty much ruled out GFS and Polyserve in the beginning and the rest just seem more "catch-phrase" to me than solid high performing file systems. ~F |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Slow vs Fast Multi Dimensional Array Access Index Order. | Skybuck Flying | Asus Motherboards | 8 | December 2nd 06 08:48 PM |
Slow vs Fast Multi Dimensional Array Access Index Order. | Skybuck Flying | Nvidia Videocards | 8 | December 2nd 06 08:48 PM |
Access Denied to Portable Hard Drive File | ahenderson7466 | Storage (alternative) | 3 | March 16th 06 09:12 AM |
takes awhile to spool a file - actual printing is fast | Diana | Printers | 0 | December 9th 04 12:23 AM |
Questions about FAST data access, and 64 bit proc's | upgrdman | Homebuilt PC's | 3 | August 4th 04 08:32 AM |