Very fast file system access

**Faeandar** · #1 May 21st 07, 11:09 PM posted to comp.arch.storage

I'm looking for information on fast file system access and how to
approach it and achieve it.

Many super computer centers use GPFS as do some commercial, though you
hear of it mostly in the edu space.

Then you have the standard players: QFS, CXFS, Polyserve, Ibrix, GPFS,
GFS, SANFs, etc...

My goal is to have N number of Unix hosts accessing the same data set
where N is probably 30. Access has to be fast, transfer has to be
fast, data integrity is an absolute.

Thanks.

~F

**Jan-Frode Myklebust[_2_]** · #2 May 22nd 07, 09:14 AM posted to comp.arch.storage

On 2007-05-21, Faeandar wrote:
I'm looking for information on fast file system access and how to
approach it and achieve it.

I'm looking for any questions in your posting :-)

Many super computer centers use GPFS as do some commercial, though you
hear of it mostly in the edu space.

AFAIK HPC-centres mostly use GPFS in a setup where you have few
I/O-servers which access the SAN directly, and many compute nodes
that stripe their I/O over these I/O-servers, probably utilizing
some kind of fast'n'wide network (inifinband or similar, but GbE
works for smaller clusters too).

For commercial, my experience is that it's mostly used as a SAN-fs,
where all nodes are FC-attached to the same storage. All I/O goes
directly to SAN, and only cluster/locking information is sent over
tcp/ip between the nodes.

-jf

**Faeandar** · #3 May 22nd 07, 09:34 PM posted to comp.arch.storage

On Tue, 22 May 2007 10:14:12 +0200, Jan-Frode Myklebust
wrote:

On 2007-05-21, Faeandar wrote:
I'm looking for information on fast file system access and how to
approach it and achieve it.

I'm looking for any questions in your posting :-)

Many super computer centers use GPFS as do some commercial, though you
hear of it mostly in the edu space.

AFAIK HPC-centres mostly use GPFS in a setup where you have few
I/O-servers which access the SAN directly, and many compute nodes
that stripe their I/O over these I/O-servers, probably utilizing
some kind of fast'n'wide network (inifinband or similar, but GbE
works for smaller clusters too).

For commercial, my experience is that it's mostly used as a SAN-fs,
where all nodes are FC-attached to the same storage. All I/O goes
directly to SAN, and only cluster/locking information is sent over
tcp/ip between the nodes.

-jf

Well, my goal would have been more of a problem statement for which I
hoped others would have a solution statement.

GPFS was only one option I mentioned and primarily because it's so
prominent in the HPC space.

So, for the fastest shared access for 30 or less nodes, what are
peoples' experience or recommendations and why?

Thanks.

~F

**Bill Todd** · #4 May 23rd 07, 01:35 AM posted to comp.arch.storage

Faeandar wrote:
I'm looking for information on fast file system access and how to
approach it and achieve it.

OK.

Many super computer centers use GPFS as do some commercial, though you
hear of it mostly in the edu space.

Its acronym has the word 'parallel' in it for good reason, but it's
difficult to tell whether its specific strengths would be important for
the workload that you have barely sketched out.

Then you have the standard players: QFS, CXFS, Polyserve, Ibrix, GPFS,
GFS, SANFs, etc...

Kind of an eclectic lot there - e.g., a mix of
direct-to-disk/central-metadata-server designs with some
more-fully-distributed ones. One thing they have in common though is
the ability to serve data from more than a single server node: do you
think that will be critical (i.e., that your bandwidth requirements will
exceed that available from any single server)?

My goal is to have N number of Unix hosts accessing the same data set
where N is probably 30.

I suspect that the value of N may be less important than the value of N
multiplied by the average host load (both in terms of
requests-per-second and aggregate bandwidth).

Access has to be fast,

How so?

A network hop or the passage of data through server RAM is a couple of
orders of magnitude faster than an actual disk access, so any competent
implementation should make such considerations irrelevant (e.g., a
direct-to-data model should offer little advantage per se).

About the only thing you can do to reduce the cost of disk reads is to
cache aggressively. Direct-to-disk implementations may not do this
well, at least for the data itself (they'd have to support a cooperative
- not just invalidating - cache distributed among the clients, and my
impression is that most do not; besides, any direct-to-disk design
requires that all the hosts trust each other completely, which in many
environments may be a deal-breaker).

Well, there's one more way to reduce the impact of metadata disk reads:
use an internal file system structure that doesn't squander an extra
disk access at every directory level (i.e., embeds inode-style
information in the parent directory; extent-based allocation is
important too in terms of minimizing mapping indirection for large
files). ReiserFS may qualify here (perhaps ZFS does as well).

Synchronous writes are more amenable to expediting, via stable
write-back cache or logging of one kind or another.

transfer has to be
fast,

I'm guessing you're talking about high bandwidth here, since transfer
latency per se should be fairly well down in the noise compared with
disk access latency.

data integrity is an absolute.

If you really mean that, it may reduce your server file system options
to two: ZFS (unless you feel that its immaturity compromises its
data-integrity guarantees) and WAFL (which AFAICT offers similar
end-to-end guarantees in a far more mature implementation). NetApp
boxes also scale up to fairly high levels and handle write-intense loads
well (but you knew that...): is cost what's holding you back?

- bill

**Knut** · #5 May 23rd 07, 09:34 PM posted to comp.arch.storage

If you really mean that, it may reduce your server file system options
to two: ZFS (unless you feel that its immaturity compromises its
data-integrity guarantees) and WAFL (which AFAICT offers similar
end-to-end guarantees in a far more mature implementation). NetApp
boxes also scale up to fairly high levels and handle write-intense loads
well (but you knew that...): is cost what's holding you back?

- bill

WAFL is Network Appliance's proprietary filesystem, and largely
irrelevant in this discussion.
IF the discussion is about LUNs shared across FC to multiple hosts,
WAFL doesn't enter into it.
NetApps can share LUNs among FC or iSCSI attached hosts, but doesn't
address the client-side lock management and concurrent access...

**Faeandar** · #6 May 24th 07, 02:50 AM posted to comp.arch.storage

On Tue, 22 May 2007 20:35:51 -0400, Bill Todd
wrote:

Many super computer centers use GPFS as do some commercial, though you
hear of it mostly in the edu space.

Its acronym has the word 'parallel' in it for good reason, but it's
difficult to tell whether its specific strengths would be important for
the workload that you have barely sketched out.

I realize I'm underwhelming you with details, but regrettably I'm not
in a position to disclose too much so I'm err'ing on the side of
disclosing nothing. but if I want some input I guess I need to draw a
better picture.

Then you have the standard players: QFS, CXFS, Polyserve, Ibrix, GPFS,
GFS, SANFs, etc...

Kind of an eclectic lot there - e.g., a mix of
direct-to-disk/central-metadata-server designs with some
more-fully-distributed ones. One thing they have in common though is
the ability to serve data from more than a single server node: do you
think that will be critical (i.e., that your bandwidth requirements will
exceed that available from any single server)?

All of these allow multi-writer access to data in some fashion (though
Ibrix is arguably an oddball). How they do it is of less concern than
how fast they do it and how stable it is.

The IO patterns are somewhat random but can be categorized as
primarily large, 1GB to 30GB files, but there are the random small
file access. In both cases, small and large files, it varies between
streaming and offset locking.

It's not the bandwidth of a single server that concerns me, I'm not
intending to use this as a file serving backend. It is primarily to
share the same data set among multiple nodes and avoid NFS latency.

My goal is to have N number of Unix hosts accessing the same data set
where N is probably 30.

I suspect that the value of N may be less important than the value of N
multiplied by the average host load (both in terms of
requests-per-second and aggregate bandwidth).

I am concerned about the value of N primarily because I think metadata
and lock coherence will be a bottleneck for any more than that.
Polyserve has a limit of 24 nodes, but it slows down long before that.
QFS allows it's metadata to be placed anywhere you want so it could be
all stored on uber fast drives or even RAM disk for extreme
performance (I suspect it would be extreme anyway).
I don't know about CXFS but suspect it's similar to QFS. GFS, SANFs,
and GPFS are fairly unknown to me other than the glossies.

The nodes will be doing their own thing and not acting in concert for
anything other than cache/locl/metadata coherency.

Access has to be fast,

How so?

A network hop or the passage of data through server RAM is a couple of
orders of magnitude faster than an actual disk access, so any competent
implementation should make such considerations irrelevant (e.g., a
direct-to-data model should offer little advantage per se).

About the only thing you can do to reduce the cost of disk reads is to
cache aggressively. Direct-to-disk implementations may not do this
well, at least for the data itself (they'd have to support a cooperative
- not just invalidating - cache distributed among the clients, and my
impression is that most do not; besides, any direct-to-disk design
requires that all the hosts trust each other completely, which in many
environments may be a deal-breaker).

Fast access meaning I'm not noticeably limited in access speeds by
metadata or cache or lock issues. With multi-node access to the data
via FC I expect there to be no noticeable difference than if it were
single-node access, all else being equal. Noticeable being 100us or
more.
Trust is not an issue. Cache will only get you so far as it simply
moves the bottleneck about 30 seconds into the future.

Well, there's one more way to reduce the impact of metadata disk reads:
use an internal file system structure that doesn't squander an extra
disk access at every directory level (i.e., embeds inode-style
information in the parent directory; extent-based allocation is
important too in terms of minimizing mapping indirection for large
files). ReiserFS may qualify here (perhaps ZFS does as well).

Synchronous writes are more amenable to expediting, via stable
write-back cache or logging of one kind or another.

transfer has to be
fast,

I'm guessing you're talking about high bandwidth here, since transfer
latency per se should be fairly well down in the noise compared with
disk access latency.

Yes and no. With onboard drive cache these days you could see a
difference in "disk access" if the stripe was sufficiently wide. This
is primarily for writes of course, I don't know how well it handles
read-ahead at that level. And the difference between 4gb and 2gb
latency is very noticeable even to disk, assuming you compare 4gb
drives to 2gb drives as well.
High bandwidth is fairly easy to get with 4gb, as is lower latency.
Again my concern is with the architecture of multi-writer access
solutions. Where do they work and where do they not. I've some
experience with a few of the products I listed and I have found that
reality and marketing can be ...... out of sync.

data integrity is an absolute.

If you really mean that, it may reduce your server file system options
to two: ZFS (unless you feel that its immaturity compromises its
data-integrity guarantees) and WAFL (which AFAICT offers similar
end-to-end guarantees in a far more mature implementation). NetApp
boxes also scale up to fairly high levels and handle write-intense loads
well (but you knew that...): is cost what's holding you back?

I did not realize ZFS was multi-writer capable. And it would not be a
consideration atm because it is so new. Data integrity comes from
time, not a vendor.
WAFL is not multi-writer capable other than through NFS, which I'm
trying to architect to avoid.

~F

**Bill Todd** · #7 May 24th 07, 10:19 AM posted to comp.arch.storage

Faeandar wrote:

....

It's not the bandwidth of a single server that concerns me, I'm not
intending to use this as a file serving backend. It is primarily to
share the same data set among multiple nodes and avoid NFS latency.

Are you absolutely certain that some - perhaps many - of the options
you've said you're considering don't have *more* 'latency' (in whatever
specific sense you're worried about it) than a good NFS implementation does?

My goal is to have N number of Unix hosts accessing the same data set
where N is probably 30.
I suspect that the value of N may be less important than the value of N
multiplied by the average host load (both in terms of
requests-per-second and aggregate bandwidth).

I am concerned about the value of N primarily because I think metadata
and lock coherence will be a bottleneck for any more than that.

It might, with a distributed implementation. That's one reason that a
good, centralized NFS implementation might be attractive.

Polyserve has a limit of 24 nodes, but it slows down long before that.

Then they screwed up their design.

....

Fast access meaning I'm not noticeably limited in access speeds by
metadata or cache or lock issues.

You're always going to be limited by metadata unless all the metadata
you need is cached (with stable cache to hold any updates: if you
really need bullet-proof data integrity, you can't defer metadata
persistence) - another good argument for a centralized implementation,
at least for the metadata (because building a bullet-proof
distributed-update metadata facility is *hard*: VMS did it in a
somewhat limited and special-case fashion when it developed clusters,
and IBM's C. Mohan at least described how to build a more general
mechanism back in the '90s, but I'd be a bit cautious about trusting any
recently-developed products in this area until they've had a decade or
so to wring the bugs out).

You're always going to be limited by cache unless it's sufficient to
hold everything you need.

You're always going to be limited by locks unless all locks are held
only instantaneously (at least with a competent lock-management
implementation that doesn't choke as it scales up).

With multi-node access to the data
via FC I expect there to be no noticeable difference than if it were
single-node access, all else being equal. Noticeable being 100us or
more.

100 us. is *not* noticeable where disk access is concerned: it's *way*
down in the noise compared with random variations in seek and rotational
latency.

And that's for small, random accesses: for larger streaming accesses
it's hardly measurable, let alone noticeable.

Trust is not an issue. Cache will only get you so far as it simply
moves the bottleneck about 30 seconds into the future.

I doubt that: cache should be critical in avoiding unnecessary metadata
read latency (if large files constitute the bulk of the data in your
system, metadata should be small enough to be mostly cacheable). Where
you can afford to do lazy writes, cache allows disk reordering
optimizations as well (and stable cache helps keep metadata updates from
sopping up disk bandwidth).

....

transfer has to be
fast,
I'm guessing you're talking about high bandwidth here, since transfer
latency per se should be fairly well down in the noise compared with
disk access latency.

Yes and no. With onboard drive cache these days you could see a
difference in "disk access" if the stripe was sufficiently wide.

I have no idea what you mean by that. Onboard drive caches are nowhere
nearly large enough to cache a useful amount of data (though they do
help buffer it intelligently - especially if write-back reordering can
be enabled): caching is what system caches are for.

This
is primarily for writes of course, I don't know how well it handles
read-ahead at that level.

Drive-level read-ahead should be irrelevant as well: the system should
be detecting and handling this.

And the difference between 4gb and 2gb
latency is very noticeable even to disk, assuming you compare 4gb
drives to 2gb drives as well.

I simply don't believe that - unless something is seriously misconfigured.

High bandwidth is fairly easy to get with 4gb, as is lower latency.

High bandwidth and low latency are fairly easy to get with Gigabit
Ethernet, for heaven's sake. And with multiple GigE pipes at low cost,
if your drivers and server aren't brain-dead. If you were talking tens
of GB/sec that might be different, but you said crushing bandwidth per
se was not an issue.

You sound a bit like some CxO-level weenie with a checklist of buzzwords
he since I'm reasonably sure that's not what you are, I'm wondering
whether this 'architect' thing has you somehow feeling inadequate and
thus trying to cover more bases than you need to.

But it's always possible that this project you can't describe has truly
unusual characteristics, I guess.

- bill

**Jan-Frode Myklebust[_2_]** · #8 May 24th 07, 11:47 PM posted to comp.arch.storage

On 2007-05-24, Faeandar wrote:

I don't know about CXFS but suspect it's similar to QFS. GFS, SANFs,
and GPFS are fairly unknown to me other than the glossies.

You can probably rule out SANFS, as it seems to be replaced by GPFS:

ftp://ftp.software.ibm.com/common/ss...W03003USEN.PDF

Fast access meaning I'm not noticeably limited in access speeds by
metadata or cache or lock issues.

I believe GPFS has some per-directory locking when creating new files, which
has caused problem for me when flooding the same Maildir with new messages..
10's of thousands of new messages, but I wouldn't expect this to be a problem
for your large files.

With multi-node access to the data
via FC I expect there to be no noticeable difference than if it were
single-node access, all else being equal.

Agree.

Trust is not an issue. Cache will only get you so far as it simply
moves the bottleneck about 30 seconds into the future.

For streaming large files yes.. It's quite common to only give each
node 64-128MB local page cache with GPFS, which works quite well
for streaming. But my boxes are more random I/O, and then I normally
give it ~50% of available the memory.

Again my concern is with the architecture of multi-writer access
solutions.

AFAIK this is where GPFS is king. Google for "project fastball",
100GB/s of sustained read and write performance to a single file...

-jf

**Faeandar** · #9 May 26th 07, 02:30 AM posted to comp.arch.storage

On Fri, 25 May 2007 00:47:36 +0200, Jan-Frode Myklebust
wrote:

On 2007-05-24, Faeandar wrote:

I don't know about CXFS but suspect it's similar to QFS. GFS, SANFs,
and GPFS are fairly unknown to me other than the glossies.

You can probably rule out SANFS, as it seems to be replaced by GPFS:

ftp://ftp.software.ibm.com/common/ss...W03003USEN.PDF

Fast access meaning I'm not noticeably limited in access speeds by
metadata or cache or lock issues.

I believe GPFS has some per-directory locking when creating new files, which
has caused problem for me when flooding the same Maildir with new messages..
10's of thousands of new messages, but I wouldn't expect this to be a problem
for your large files.

With multi-node access to the data
via FC I expect there to be no noticeable difference than if it were
single-node access, all else being equal.

Agree.

Trust is not an issue. Cache will only get you so far as it simply
moves the bottleneck about 30 seconds into the future.

For streaming large files yes.. It's quite common to only give each
node 64-128MB local page cache with GPFS, which works quite well
for streaming. But my boxes are more random I/O, and then I normally
give it ~50% of available the memory.

Again my concern is with the architecture of multi-writer access
solutions.

AFAIK this is where GPFS is king. Google for "project fastball",
100GB/s of sustained read and write performance to a single file...

-jf

Thank you, that pdf is a worthwhile read. My path now is probably to
compare GPFS and QFS. I pretty much ruled out GFS and Polyserve in
the beginning and the rest just seem more "catch-phrase" to me than
solid high performing file systems.

~F

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Slow vs Fast Multi Dimensional Array Access Index Order.	Skybuck Flying	Asus Motherboards	8	December 2nd 06 08:48 PM
Slow vs Fast Multi Dimensional Array Access Index Order.	Skybuck Flying	Nvidia Videocards	8	December 2nd 06 08:48 PM
Access Denied to Portable Hard Drive File	ahenderson7466	Storage (alternative)	3	March 16th 06 09:12 AM
takes awhile to spool a file - actual printing is fast	Diana	Printers	0	December 9th 04 12:23 AM
Questions about FAST data access, and 64 bit proc's	upgrdman	Homebuilt PC's	3	August 4th 04 08:32 AM