PanFS?

#11 December 3rd 04, 09:37 PM

In article ,
Faeandar wrote:
In implementations like Lustre, it refers to low-level local metadata stored
on all storage units, plus file-level metadata stored in a coordinating
central metadata server (or metadata server cluster). The low-level local
metadata is completely partitioned (i.e., of only local significance) and
hence has no scaling problem - but by offloading that portion of metadata
management from the central metadata server helps it scale too.

Help me understand this. If file level metadata is coordinated by a
single server how is that not a large bottleneck? For reads I can see
why it's not, but for writes (in the 10,00 node cluster they talk
about) this would be like trying to stuff the Atlantic through a hose.
At least in my understanding.

Depends. On the definition of "metadata", on the exact mode of
caching, on how the consistency locking is implemented, on the
workload, and most importantly, whether the actual workload matches
the expectation that the implementors of the shared file system had.
There are existance proofs of clustered / distributed filesystems that
centralize metadata handling, yet work extremely well, even when
writing.

This topic is very well studied. There are at least dozens of papers
on it in the open research literature. The number of published
patents about it must be in the hundreds. Many of the experts in the
field could hold talks lasting several hours apiece on the subject. I
suggest that you review the literature on the subject. I don't
offhand know a good overview or introductory textbook; just do a
google search for things like GFS (Minnesota), Lustre, Panasas or
PanFS (also look for CMU NASD, its predecessor), IBM StorageTank or
SAN-FS, GFS (the Google file system), IBM GPFS, Digital
Petal/Frangipani, Berkeley xFS, SGI CXFS, parallel NFSv4, and many
academic predecessors (Sprite, Slice, whatevers).

My guess is we will see something like NVRAM cards inside the hosts
all connected via IB or some other high bandwidth low latency
interconnect. For now though gigabit seems to do the trick primarily
for the reason you mentioned.
Although I gotta say, it is extremely appealing to use NVRAM in thos
hosts anyway just for the added write performance boost. If only they
could be clustered by themselves and not require host-based
clustering.... Ah well, someday.

If you had NVRAM, and IB (or even better, something like SCI, which
has the cache-coherency built in), and the NVRAM is fault-tolerant
(meaning: remains accessible even if the host has failed), then
implementing a clustered file system becomes trivial. Or to put it
differently: In an environment where the hardware is plentiful, even
really bad architectural choices will perform adequately. But today,
we don't have hardware NVRAM in every host, even less do we have
fault-tolerant NVRAM, and we have high-latency networking. Or to be
exact: Cost considerations prevent us from deploying
NVRAM/IB/... universally (they certainly do exist, but they are
expensive and not available in commodity computers). This means that
implementing a cluster or distributed file system becomes an art form,
and architectural choices must be well thought out, and matched to the
intended usage pattern and to the customers expectations.

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy

#12 December 3rd 04, 10:23 PM

On Fri, 03 Dec 2004 21:37:48 -0000,
wrote:

In article ,
Faeandar wrote:
In implementations like Lustre, it refers to low-level local metadata stored
on all storage units, plus file-level metadata stored in a coordinating
central metadata server (or metadata server cluster). The low-level local
metadata is completely partitioned (i.e., of only local significance) and
hence has no scaling problem - but by offloading that portion of metadata
management from the central metadata server helps it scale too.

Help me understand this. If file level metadata is coordinated by a
single server how is that not a large bottleneck? For reads I can see
why it's not, but for writes (in the 10,00 node cluster they talk
about) this would be like trying to stuff the Atlantic through a hose.
At least in my understanding.

Depends. On the definition of "metadata", on the exact mode of
caching, on how the consistency locking is implemented, on the
workload, and most importantly, whether the actual workload matches
the expectation that the implementors of the shared file system had.
There are existance proofs of clustered / distributed filesystems that
centralize metadata handling, yet work extremely well, even when
writing.

This topic is very well studied. There are at least dozens of papers
on it in the open research literature. The number of published
patents about it must be in the hundreds. Many of the experts in the
field could hold talks lasting several hours apiece on the subject. I
suggest that you review the literature on the subject. I don't
offhand know a good overview or introductory textbook; just do a
google search for things like GFS (Minnesota), Lustre, Panasas or
PanFS (also look for CMU NASD, its predecessor), IBM StorageTank or
SAN-FS, GFS (the Google file system), IBM GPFS, Digital
Petal/Frangipani, Berkeley xFS, SGI CXFS, parallel NFSv4, and many
academic predecessors (Sprite, Slice, whatevers).

Well, so far I've read architectural and white papers on, and talked
to the vendors of (engineers not just sales), GFS, Polyserve, Ibrix,
Lustre, Panasas, Acopia, and soon to be GPFS. pNFS doesn't really fit
this space, at least not for what I want.
In each case the solutions with a single metadata server will see a
significant difference between read and write performance after 30-odd
node servers. Once you get to 60+ the write performance blows chunks,
albeit the read continues to rocket in most cases.

My guess is we will see something like NVRAM cards inside the hosts
all connected via IB or some other high bandwidth low latency
interconnect. For now though gigabit seems to do the trick primarily
for the reason you mentioned.
Although I gotta say, it is extremely appealing to use NVRAM in thos
hosts anyway just for the added write performance boost. If only they
could be clustered by themselves and not require host-based
clustering.... Ah well, someday.

If you had NVRAM, and IB (or even better, something like SCI, which
has the cache-coherency built in), and the NVRAM is fault-tolerant
(meaning: remains accessible even if the host has failed), then
implementing a clustered file system becomes trivial. Or to put it
differently: In an environment where the hardware is plentiful, even
really bad architectural choices will perform adequately. But today,
we don't have hardware NVRAM in every host, even less do we have
fault-tolerant NVRAM, and we have high-latency networking. Or to be
exact: Cost considerations prevent us from deploying
NVRAM/IB/... universally (they certainly do exist, but they are
expensive and not available in commodity computers). This means that
implementing a cluster or distributed file system becomes an art form,
and architectural choices must be well thought out, and matched to the
intended usage pattern and to the customers expectations.

NVRAM cards are already avaialable for off-the-shelf servers and they
work quite well for the most part. The big problem in a high
performance file system is that host-based clustering is still
required for the NVRAM failover, it's not builtin directly. I would
love to throw a MicroMemory 4gb NVRAM card in each of my node servers
for enhanced write performance but I would need something on the host
to handle failover and coherency, a layer of complexity I avoid in
this environment because it's not needed and generally not wanted.
That's why I have the file system and 30 node servers to begin with...

~F

#13 December 3rd 04, 10:30 PM

On Fri, 3 Dec 2004 20:45:16 GMT, (Keith
Michaels) wrote:

In article ,
"Bill Todd" writes:
|
| "Keith Michaels" wrote in message
| ...
| In article ,
| Faeandar writes:
| ...
|
| | There is something I don't understand about these distributed
| | filesystems: how is the namespace mapped to the devices in a fully
| | distributed environment? If there is complete independence between
| | the name layer and the storage layer then there is an n-squared
| | locking problem, if devices are tied to the namespace then there
| | is a load-leveling/utilization problem.
| |
| | Yes, and yes (though maybe not squared). In most cases the products
| | use a VIP that clients mount. In cases like Polyserve and GFS the
| | client requests are round-robin'd among the node servers. Not exactly
| | load balancing but over time and enough node servers the balance finds
| | itself.
| |
| | Products like Acopia and Panasas have internal algorithms that they
| | user for latency testing and then write to the node server that
| | returns the fastest. Good to some extent but in the case of Acopia,
| | not being a clustered file system, it means a potential for one system
| | to house alot more data than the others. Maybe a problem, maybe not.
| |
| | The only time you have a potential locking issue is with a clustered
| | file system and a dedicated metadata server. This can become a
| | serious bottleneck. Some products use distributed metadata which,
| | while not alleviating the problem completely, greatly reduces it.
| |
| | ~F
|
| If all nodes are equal (distributed metadata case) then there needs
| to be a mapping from filesystem semantics to the underlying
| consistency model implemented by the storage system e.g., if
| filesystems guarantee strict write ordering (reads always return
| the latest write) then the storage system must implement equally
| strong cache coherency between all nodes, which doesn't scale.
| So there's a tradeoff between performance and utilization. I was
| just wondering how modern DFSs deal with this fundamental
| limitation. SMPs have hardware support for cache management;
| do we need similar functionality in the SAN?
|
| 'Distributed metadata' means different things in different contexts.
|
| On VMS clusters, it refers to distributed management of metadata stored on
| shared devices by cooperating hosts. But even there only a single host
| manages a given metadatum at any given time (using the distributed locking
| facility). This approach scales to hundreds of host nodes (the nominal
| supported limit on cluster size is 96, but sizes close to 200 have been used
| in practice), though contention varies with the degree of access locality to
| a given datum (the worst case obviously being if all hosts access it
| equally, with intense update activity).
|
| In implementations like Lustre, it refers to low-level local metadata stored
| on all storage units, plus file-level metadata stored in a coordinating
| central metadata server (or metadata server cluster). The low-level local
| metadata is completely partitioned (i.e., of only local significance) and
| hence has no scaling problem - but by offloading that portion of metadata
| management from the central metadata server helps it scale too.
|
| Another alternative is to partition higher-level metadata as well (e.g.,
| localizing the management of metadata for any given file - or even
| directory - to a single server (plus perhaps a mirror partner), even if the
| file data may be spread more widely). This certainly qualifies as
| distributed metadata, even though management of any single metadatum is not
| distributed, and scales well at the file system level as long as the
| partitioning granularity manages to spread out metadata loads reasonably
| (e.g., you'd have a problem if all system activity concentrated on one
| humongous database file - though with some ingenuity subdividing at least
| some of the metadata management even within a single file is possible).
|
| Until SAN bandwidths and latencies approach those of local RAM much more
| closely than they do today, there will likely be little reason to use
| hardware management of distributed file caches - especially since the most
| effective location for such caching is very often (save in cases of intense
| contention) at the client itself, where special hardware is least likely to
| be found.
|
| - bill

Thanks Bill, that's a very interesting assessment. It seems to me
that the promise of grid-based/object storage is scalability; if
we lose that we're back where we started. Lustre cleverly exploits
the difference between local and global attributes but in the end
all of these DFSs lose scalability when trying to emulate strictly
ordered filesystem semantics and centralized resource management
without making assumptions about sharing and access patterns.
Maybe there's a perfect technology coming that will do it all; in
the meantime what is the business case for DFS deployment? Maybe
a better question is how do we move beyond the filesystem model
so that these technologies fit our applications better?

I don't think that's strictly true, that we lose scalability. A
single file system by today's norm is over a single host, so right out
of the gate the CFS's beat that, DFS's too for that matter. The
purpose of these file systems came out of high performance computing
needs, basically striping writes/reads across multiple hosts to gain
(in theory) linear performance increases per node addition. Obviously
that's not the case but it definitely improves performance
dramatically. So scalability is already there.

Can it scale to 10,000 nodes? Well not currently (read-only not
withstanding) but 10 years ago scaling this type of file system to
anything beyond 1 server was voodoo in the open systems world.

~F

#14 December 3rd 04, 11:25 PM

"Faeandar" wrote in message
...
On Fri, 3 Dec 2004 14:29:21 -0500, "Bill Todd"
wrote:

....

In implementations like Lustre, it refers to low-level local metadata
stored
on all storage units, plus file-level metadata stored in a coordinating
central metadata server (or metadata server cluster). The low-level
local
metadata is completely partitioned (i.e., of only local significance) and
hence has no scaling problem - but by offloading that portion of metadata
management from the central metadata server helps it scale too.

Help me understand this. If file level metadata is coordinated by a
single server how is that not a large bottleneck?

By making the server suitably hefty. Or, as noted above, by using a
metadata server cluster to divvy up the load (though even a relatively
modest SMP system can handle impressive levels of metadata management before
clustering it is necessary for anything save to increase availability via
fail-over mechanisms).

For reads I can see
why it's not, but for writes (in the 10,00 node cluster they talk
about) this would be like trying to stuff the Atlantic through a hose.
At least in my understanding.

The actual user data doesn't pass through the metadata server, just the
metadata operations. So in a product like Lustre (or, IIUC, Storage Tank),
the metadata server merely need be made aware of the write operation to
update things like last-modified date and EOF mark, while the grunt-level
data servers handle the actual movement to disk (and at least potentially
any interlocking involved, though the interaction of byte-range locks with
full-file locks on a file large enough to span many data servers becomes an
interesting challenge then).

Directory operations may often place a heavier load on metadata servers than
even write-intensive file activity.

....

Until SAN bandwidths and latencies approach those of local RAM much more
closely than they do today, there will likely be little reason to use
hardware management of distributed file caches - especially since the
most
effective location for such caching is very often (save in cases of
intense
contention) at the client itself, where special hardware is least likely
to
be found.

My guess is we will see something like NVRAM cards inside the hosts
all connected via IB or some other high bandwidth low latency
interconnect. For now though gigabit seems to do the trick primarily
for the reason you mentioned.
Although I gotta say, it is extremely appealing to use NVRAM in thos
hosts anyway just for the added write performance boost. If only they
could be clustered by themselves and not require host-based
clustering.... Ah well, someday.

Use of NVRAM is eminently feasible today, and has (or at least with suitable
design need have) relatively little connection to distributed caching (let
alone with hardware support for same) - especially if clients aren't
completely trustworthy.

- bill

#15 December 3rd 04, 11:43 PM

In article ,
Faeandar writes:
| On Fri, 3 Dec 2004 20:45:16 GMT, (Keith
| Michaels) wrote:
|
| In article ,
| "Bill Todd" writes:
| |
| | "Keith Michaels" wrote in message
| | ...
| | In article ,
| | Faeandar writes:
| | ...
| |
| | | There is something I don't understand about these distributed
| | | filesystems: how is the namespace mapped to the devices in a fully
| | | distributed environment? If there is complete independence between
| | | the name layer and the storage layer then there is an n-squared
| | | locking problem, if devices are tied to the namespace then there
| | | is a load-leveling/utilization problem.
| | |
| | | Yes, and yes (though maybe not squared). In most cases the products
| | | use a VIP that clients mount. In cases like Polyserve and GFS the
| | | client requests are round-robin'd among the node servers. Not exactly
| | | load balancing but over time and enough node servers the balance finds
| | | itself.
| | |
| | | Products like Acopia and Panasas have internal algorithms that they
| | | user for latency testing and then write to the node server that
| | | returns the fastest. Good to some extent but in the case of Acopia,
| | | not being a clustered file system, it means a potential for one system
| | | to house alot more data than the others. Maybe a problem, maybe not.
| | |
| | | The only time you have a potential locking issue is with a clustered
| | | file system and a dedicated metadata server. This can become a
| | | serious bottleneck. Some products use distributed metadata which,
| | | while not alleviating the problem completely, greatly reduces it.
| | |
| | | ~F
| |
| | If all nodes are equal (distributed metadata case) then there needs
| | to be a mapping from filesystem semantics to the underlying
| | consistency model implemented by the storage system e.g., if
| | filesystems guarantee strict write ordering (reads always return
| | the latest write) then the storage system must implement equally
| | strong cache coherency between all nodes, which doesn't scale.
| | So there's a tradeoff between performance and utilization. I was
| | just wondering how modern DFSs deal with this fundamental
| | limitation. SMPs have hardware support for cache management;
| | do we need similar functionality in the SAN?
| |
| | 'Distributed metadata' means different things in different contexts.
| |
| | On VMS clusters, it refers to distributed management of metadata stored on
| | shared devices by cooperating hosts. But even there only a single host
| | manages a given metadatum at any given time (using the distributed locking
| | facility). This approach scales to hundreds of host nodes (the nominal
| | supported limit on cluster size is 96, but sizes close to 200 have been used
| | in practice), though contention varies with the degree of access locality to
| | a given datum (the worst case obviously being if all hosts access it
| | equally, with intense update activity).
| |
| | In implementations like Lustre, it refers to low-level local metadata stored
| | on all storage units, plus file-level metadata stored in a coordinating
| | central metadata server (or metadata server cluster). The low-level local
| | metadata is completely partitioned (i.e., of only local significance) and
| | hence has no scaling problem - but by offloading that portion of metadata
| | management from the central metadata server helps it scale too.
| |
| | Another alternative is to partition higher-level metadata as well (e.g.,
| | localizing the management of metadata for any given file - or even
| | directory - to a single server (plus perhaps a mirror partner), even if the
| | file data may be spread more widely). This certainly qualifies as
| | distributed metadata, even though management of any single metadatum is not
| | distributed, and scales well at the file system level as long as the
| | partitioning granularity manages to spread out metadata loads reasonably
| | (e.g., you'd have a problem if all system activity concentrated on one
| | humongous database file - though with some ingenuity subdividing at least
| | some of the metadata management even within a single file is possible).
| |
| | Until SAN bandwidths and latencies approach those of local RAM much more
| | closely than they do today, there will likely be little reason to use
| | hardware management of distributed file caches - especially since the most
| | effective location for such caching is very often (save in cases of intense
| | contention) at the client itself, where special hardware is least likely to
| | be found.
| |
| | - bill
|
| Thanks Bill, that's a very interesting assessment. It seems to me
| that the promise of grid-based/object storage is scalability; if
| we lose that we're back where we started. Lustre cleverly exploits
| the difference between local and global attributes but in the end
| all of these DFSs lose scalability when trying to emulate strictly
| ordered filesystem semantics and centralized resource management
| without making assumptions about sharing and access patterns.
| Maybe there's a perfect technology coming that will do it all; in
| the meantime what is the business case for DFS deployment? Maybe
| a better question is how do we move beyond the filesystem model
| so that these technologies fit our applications better?
|
| I don't think that's strictly true, that we lose scalability. A
| single file system by today's norm is over a single host, so right out
| of the gate the CFS's beat that, DFS's too for that matter. The
| purpose of these file systems came out of high performance computing
| needs, basically striping writes/reads across multiple hosts to gain
| (in theory) linear performance increases per node addition. Obviously
| that's not the case but it definitely improves performance
| dramatically. So scalability is already there.
|
| Can it scale to 10,000 nodes? Well not currently (read-only not
| withstanding) but 10 years ago scaling this type of file system to
| anything beyond 1 server was voodoo in the open systems world.
|
| ~F

The way to preserve scalability is to stripe, which improves data
bandwidth to a host, and to partition the metadata, which controls
the cost of allocation and other metadata updates. Both of these
strategies reduce utilization; striping across a SAN gobbles
up switch bandwidth geometrically, partitioning the metadata
either suffers latency (metaserver far away), or uneven allocation
(nearby storage is full). We need global load-leveling
to achieve true grid-based storage and managing the namespace
outside the hosts is part of that; so we put the namespace and
global attributes in LDAP, let it handle metadata locality and let
OSDs handle the rest. Just my $.02

#16 December 4th 04, 12:15 AM

On Fri, 3 Dec 2004 18:25:44 -0500, "Bill Todd"
wrote:

"Faeandar" wrote in message
.. .
On Fri, 3 Dec 2004 14:29:21 -0500, "Bill Todd"
wrote:

...

In implementations like Lustre, it refers to low-level local metadata
stored
on all storage units, plus file-level metadata stored in a coordinating
central metadata server (or metadata server cluster). The low-level
local
metadata is completely partitioned (i.e., of only local significance) and
hence has no scaling problem - but by offloading that portion of metadata
management from the central metadata server helps it scale too.

Help me understand this. If file level metadata is coordinated by a
single server how is that not a large bottleneck?

By making the server suitably hefty. Or, as noted above, by using a
metadata server cluster to divvy up the load (though even a relatively
modest SMP system can handle impressive levels of metadata management before
clustering it is necessary for anything save to increase availability via
fail-over mechanisms).

For reads I can see
why it's not, but for writes (in the 10,00 node cluster they talk
about) this would be like trying to stuff the Atlantic through a hose.
At least in my understanding.

The actual user data doesn't pass through the metadata server, just the
metadata operations. So in a product like Lustre (or, IIUC, Storage Tank),
the metadata server merely need be made aware of the write operation to
update things like last-modified date and EOF mark, while the grunt-level
data servers handle the actual movement to disk (and at least potentially
any interlocking involved, though the interaction of byte-range locks with
full-file locks on a file large enough to span many data servers becomes an
interesting challenge then).

Directory operations may often place a heavier load on metadata servers than
even write-intensive file activity.

I understand only metadata ops are passed through a metadata server
and not the data itself, but in the case of high performance computing
that load is sizeable. And the more node servers you add to the
cluster the more that load increases.
And locking, though handled at the node server, still has to be
distributed so the other nodes know what's locked and what's not.
This also can be large for compute farms.

I see how you thought I was an idiot with the Atlantic analogy. I was
just illustrating a point, exagerated though it was.

...

Until SAN bandwidths and latencies approach those of local RAM much more
closely than they do today, there will likely be little reason to use
hardware management of distributed file caches - especially since the
most
effective location for such caching is very often (save in cases of
intense
contention) at the client itself, where special hardware is least likely
to
be found.

My guess is we will see something like NVRAM cards inside the hosts
all connected via IB or some other high bandwidth low latency
interconnect. For now though gigabit seems to do the trick primarily
for the reason you mentioned.
Although I gotta say, it is extremely appealing to use NVRAM in thos
hosts anyway just for the added write performance boost. If only they
could be clustered by themselves and not require host-based
clustering.... Ah well, someday.

Use of NVRAM is eminently feasible today, and has (or at least with suitable
design need have) relatively little connection to distributed caching (let
alone with hardware support for same) - especially if clients aren't
completely trustworthy.

It certainly is feasible, but having battery backed NVRAM does little
good if it isn't sync'd with a failover partner or pool. In the case
of a critical host failure for long periods of time that data is
essentially lost.

~F

#17 December 4th 04, 02:30 AM

"Faeandar" wrote in message
...

....

I understand only metadata ops are passed through a metadata server
and not the data itself, but in the case of high performance computing
that load is sizeable.

My impression is that HPC applications don't typically care all that much
about consistent last-modified dates. And for files too large to be handled
locally at a single data-server node, it's certainly possible for the
metadata server just to track which data server currently contains the
file's EOF and off-load end-of-file-mark maintenance to that server (thus
the metadata server is only bothered each time the file grows or shrinks
enough for its EOF mark to move to another data server - or not at all if
the file is confined to a single data server, if that's where individual
file metadata is managed).

The secret of off-loading centralized metadata servers is to minimize the
need to talk with them. 'Object-oriented' storage takes the first step by
isolating the metadata servers from the details of file placement below the
node level (even if the metadata server tracks placement explicitly at the
node level rather than uses some algorithm to determine distribution across
the data servers, that's still a major reduction in metadata server
workload). The next step is to isolate the metadata servers from metadata
management associated with individual files, farming that off to the data
server on which the file resides (or the first data server of a set over
which the file is spread). The final step is to distribute directories as
well, eliminating the metadata servers entirely and making the system a
group of peers, but this requires some ingenuity to avoid significantly
increasing the overhead of path look-ups.

And the more node servers you add to the
cluster the more that load increases.
And locking, though handled at the node server, still has to be
distributed so the other nodes know what's locked and what's not.

The only node that needs to know what's locked is the node controlling the
data that's locked (and perhaps a mirroring partner, for availability):
everyone has to go there for that portion of the file anyway, in the kind of
system I was describing (where clients have sufficient intelligence to
target data servers directly in most cases after acquiring the file map from
the metadata server). It's also nice for that data server to be able to
delegate some locking to a specific client in the absense of contention, but
that doesn't change the basic approach.

....

Use of NVRAM is eminently feasible today, and has (or at least with
suitable
design need have) relatively little connection to distributed caching
(let
alone with hardware support for same) - especially if clients aren't
completely trustworthy.

It certainly is feasible, but having battery backed NVRAM does little
good if it isn't sync'd with a failover partner or pool. In the case
of a critical host failure for long periods of time that data is
essentially lost.

NVRAM works most efficiently when in the service of a single server. Since
you have to allow for the failure of that server anyway, you just arrange
partnering at the server level without any special considerations for NVRAM
that may be present for performance reasons. In fact, if you're
sufficiently confident in your UPSs and have the two partners on separate
ones, you can dispense with NVRAM entirely (the UPS-backed system RAM will
suffice, as long as you can be sure that at least one of the mirror partners
will continue running long enough to dump its contents to disk and revert to
more conservative strategies if its partner fails).

In general, when trying to avoid single points of failure, using
coarse-granularity partnering with near-commodity-level components is far
less expensive than any attempt to avoid failure points within any single
node. The only real exception to this is when you're not shooting merely
for availability but absolute reliability, and must therefore use redundant
everything (with hardware comparators) within each single server to guard
against the possibility that it will fail without noticing the problem and
proceed to corrupt its portion of the database (even then, you still need
geographically-separated mirror partners to guard against site-level
failures, so the cost can easily reach an order of magnitude greater than
simple commodity server mirroring).

- bill

#18 December 4th 04, 05:17 AM

In article ,
Faeandar wrote:
....
Can it scale to 10,000 nodes? Well not currently (read-only not
withstanding)

I've heard of 1024-node clusters today. Ask some of the 1st-tier
vendors. And everyone in the field is working on improving
scalability. Clearly, object-storage is one of the tools in the
arsenal. As is NVRAM, but that has problems (that Bill explained
well).

but 10 years ago scaling this type of file system to
anything beyond 1 server was voodoo in the open systems world.

What was there in the open systems world 10 years ago? 10 years ago
you could run Linux version 0.99.14 (which IMHO had ext but not ext2
file systems, it was my first Linux) if you wanted open-source, or
dream of Linux V1, or you could buy BSDI. I think none of the fully
free *bsds was available yet: I remember considering the Jolitzes
386bsd in 1995 (or am I one year late?).

In the commercial world, there was Ultrix on Alpha, SunOS, HP-UX V8,
AIX 3, and some IRIX version (and NextStep, if you were into
masochism). If you wanted cluster computing, you could buy a big SGI,
or an IBM SP2 (were they available in 1994? I don't remember).

I refuse to count Windows as "open systems", even though MCS =
wolfpack (the cluster version of WinNT) was already on peoples mind
then.

On the other hand, 10 years ago, a very-high-quality cluster file
system had already been available for a decade, in the form of
VAXcluster. I've personally used VAXclusters of up to 90 machines, in
the late 80s. And if you wanted multi-node fault-tolerant NVRAM, it
was already available in the form of "servernet", the Tandem's
backplane technology.

It's actually very sad, how little we have accomplished since then.
The destruction of Digital and of Tandem contributed much to the lack
of progress.

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy

#19 December 4th 04, 02:17 PM

wrote in message
news:1102137471.806300@smirk...

I've heard of 1024-node clusters today. Ask some of the 1st-tier
vendors. And everyone in the field is working on improving
scalability. Clearly, object-storage is one of the tools in the
arsenal. As is NVRAM, but that has problems (that Bill explained
well).

We have ~1800 compute nodes. 150TB of disk space. Scalability
is a major problem. For instance, we push NFS further than it was
probably intended. So I see a GFS as a solution for helping not only
our performance, but scalability also.

In the commercial world, there was Ultrix on Alpha, SunOS, HP-UX V8,
AIX 3, and some IRIX version (and NextStep, if you were into
masochism). If you wanted cluster computing, you could buy a big SGI,
or an IBM SP2 (were they available in 1994? I don't remember).

don't forget CRAY and a bunch of multi-proc proprietary solutions. But
they were all very expensive.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode