PanFS?

#1 December 1st 04, 06:20 PM

Any know know much about the solutions from Panasas? The PanFS looks
similar to clustered filesystem like CxFS, GFS, StorNext,
Lustre...etc. except it is object-based. Their solution seems to
include the backend storage, control blades, and some NFS/CIFS heads
for interop. What is the benefit of using them versus getting Lustre
and a regular SAN? Any idea?

Regards,
Ernest

#2 December 1st 04, 07:50 PM

On 1 Dec 2004 10:20:06 -0800, (Ernest Siu) wrote:

Any know know much about the solutions from Panasas? The PanFS looks
similar to clustered filesystem like CxFS, GFS, StorNext,
Lustre...etc. except it is object-based. Their solution seems to
include the backend storage, control blades, and some NFS/CIFS heads
for interop. What is the benefit of using them versus getting Lustre
and a regular SAN? Any idea?

Regards,
Ernest

Well first thing is don't think about Lustre, it's a beast still and
not ready for primetime. The others you mentioned are viable but the
leaders today seem to be Polyserve, Ibrix, GPFS, and GFS. Personally
I like Polyserve personally but Ibrix and GPFS both look good as well.

PanFS is very similar to Polyserve except they do it turnkey and in a
box. The box is proprietary (not necessarily bad) but from what I
understand also fairly expensive. Otherwise I've only heard good
things about them.

I would go for something a little more open. Any of the software
solutions plus my pick of hardware. I like the options. So if I want
to use HDS disk with qlogic controllers and Dell blades, I can. Not
so with PanFS as I understand it.

But if you want something all-in-one then they are likely the best in
town right now.

~F

#3 December 2nd 04, 12:38 PM

"Faeandar" wrote in message
...
On 1 Dec 2004 10:20:06 -0800, (Ernest Siu) wrote:

Any know know much about the solutions from Panasas? The PanFS looks
similar to clustered filesystem like CxFS, GFS, StorNext,
Lustre...etc. except it is object-based. Their solution seems to
include the backend storage, control blades, and some NFS/CIFS heads
for interop. What is the benefit of using them versus getting Lustre
and a regular SAN? Any idea?

Regards,
Ernest

Hey Guys. I run a large cluster with a lot of storage so that is my
perspective.

Well first thing is don't think about Lustre, it's a beast still and
not ready for primetime.

Lustre is a lot of work-there is no doubt about it. There are no
real tools (GUIs etc) to help. But they make their money in
customization I figure. What is really a good thing about Lustre
is that it is very scaleable. Most of the GFSes I have seen like
SANs and most of our boxes are NAS boxes. Lustre will let
us manage all our boxes (OSTs).

The others you mentioned are viable but the
leaders today seem to be Polyserve, Ibrix, GPFS, and GFS. Personally
I like Polyserve personally but Ibrix and GPFS both look good as well.

We have a small CX500 w/ 18TB and IBM 2.5GHZ dual Opteron/QLogic
heads for testing. We will be testing IBrix (GPFS and Lustre. I am going to
install RedHat's next week if all goes well with my other work.

IBrix has some really good features to help us manage our disks-including
a nice GUI. There are some sophisticated technical features for
fault-tolerance
and recovery you may want to look at.

PanFS is very similar to Polyserve except they do it turnkey and in a
box. The box is proprietary (not necessarily bad) but from what I
understand also fairly expensive. Otherwise I've only heard good
things about them.

We have a fair bit of experience with the Panases product. It does not
work for us and we are looking elsewhere-but they are making serious
in-roads. One of my issues is getting as much of my disk servers off NFS
as I can. With it is that I have 150TB (we just ordered another 30+) I
won't be able to get of NFS with.

Performance was really good.

I would go for something a little more open. Any of the software
solutions plus my pick of hardware. I like the options. So if I want
to use HDS disk with qlogic controllers and Dell blades, I can. Not
so with PanFS as I understand it.

Pretty much, but then if it works who cares?

But if you want something all-in-one then they are likely the best in
town right now.

I would recommend you test, test, test before plucking down $$$. Also
you may want to look at Terragrid from Terrascale. Performance is also
really good.

Anyway, that's my two cents.
--
Wolf
----------------------------------------------------------------
Please post all responses to UseNet. All email cheerfully and automagically
routed to Dave Null

#4 December 2nd 04, 07:46 PM

In article ,
"Wolf" writes:
| "Faeandar" wrote in message
| ...
| On 1 Dec 2004 10:20:06 -0800, (Ernest Siu) wrote:
|
| Any know know much about the solutions from Panasas? The PanFS looks
| similar to clustered filesystem like CxFS, GFS, StorNext,
| Lustre...etc. except it is object-based. Their solution seems to
| include the backend storage, control blades, and some NFS/CIFS heads
| for interop. What is the benefit of using them versus getting Lustre
| and a regular SAN? Any idea?
|
| Regards,
| Ernest
|
| Hey Guys. I run a large cluster with a lot of storage so that is my
| perspective.
|
| Well first thing is don't think about Lustre, it's a beast still and
| not ready for primetime.
|
| Lustre is a lot of work-there is no doubt about it. There are no
| real tools (GUIs etc) to help. But they make their money in
| customization I figure. What is really a good thing about Lustre
| is that it is very scaleable. Most of the GFSes I have seen like
| SANs and most of our boxes are NAS boxes. Lustre will let
| us manage all our boxes (OSTs).
|
| The others you mentioned are viable but the
| leaders today seem to be Polyserve, Ibrix, GPFS, and GFS. Personally
| I like Polyserve personally but Ibrix and GPFS both look good as well.
....

There is something I don't understand about these distributed
filesystems: how is the namespace mapped to the devices in a fully
distributed environment? If there is complete independence between
the name layer and the storage layer then there is an n-squared
locking problem, if devices are tied to the namespace then there
is a load-leveling/utilization problem.

#5 December 2nd 04, 11:51 PM

On Thu, 2 Dec 2004 19:46:08 GMT, (Keith
Michaels) wrote:

In article ,
"Wolf" writes:
| "Faeandar" wrote in message
| ...
| On 1 Dec 2004 10:20:06 -0800, (Ernest Siu) wrote:
|
| Any know know much about the solutions from Panasas? The PanFS looks
| similar to clustered filesystem like CxFS, GFS, StorNext,
| Lustre...etc. except it is object-based. Their solution seems to
| include the backend storage, control blades, and some NFS/CIFS heads
| for interop. What is the benefit of using them versus getting Lustre
| and a regular SAN? Any idea?
|
| Regards,
| Ernest
|
| Hey Guys. I run a large cluster with a lot of storage so that is my
| perspective.
|
| Well first thing is don't think about Lustre, it's a beast still and
| not ready for primetime.
|
| Lustre is a lot of work-there is no doubt about it. There are no
| real tools (GUIs etc) to help. But they make their money in
| customization I figure. What is really a good thing about Lustre
| is that it is very scaleable. Most of the GFSes I have seen like
| SANs and most of our boxes are NAS boxes. Lustre will let
| us manage all our boxes (OSTs).
|
| The others you mentioned are viable but the
| leaders today seem to be Polyserve, Ibrix, GPFS, and GFS. Personally
| I like Polyserve personally but Ibrix and GPFS both look good as well.
...

There is something I don't understand about these distributed
filesystems: how is the namespace mapped to the devices in a fully
distributed environment? If there is complete independence between
the name layer and the storage layer then there is an n-squared
locking problem, if devices are tied to the namespace then there
is a load-leveling/utilization problem.

Yes, and yes (though maybe not squared). In most cases the products
use a VIP that clients mount. In cases like Polyserve and GFS the
client requests are round-robin'd among the node servers. Not exactly
load balancing but over time and enough node servers the balance finds
itself.

Products like Acopia and Panasas have internal algorithms that they
user for latency testing and then write to the node server that
returns the fastest. Good to some extent but in the case of Acopia,
not being a clustered file system, it means a potential for one system
to house alot more data than the others. Maybe a problem, maybe not.

The only time you have a potential locking issue is with a clustered
file system and a dedicated metadata server. This can become a
serious bottleneck. Some products use distributed metadata which,
while not alleviating the problem completely, greatly reduces it.

~F

#6 December 2nd 04, 11:58 PM

On Thu, 02 Dec 2004 12:38:17 GMT, "Wolf"
wrote:

"Faeandar" wrote in message
.. .
On 1 Dec 2004 10:20:06 -0800, (Ernest Siu) wrote:

Any know know much about the solutions from Panasas? The PanFS looks
similar to clustered filesystem like CxFS, GFS, StorNext,
Lustre...etc. except it is object-based. Their solution seems to
include the backend storage, control blades, and some NFS/CIFS heads
for interop. What is the benefit of using them versus getting Lustre
and a regular SAN? Any idea?

Regards,
Ernest

Hey Guys. I run a large cluster with a lot of storage so that is my
perspective.

Well first thing is don't think about Lustre, it's a beast still and
not ready for primetime.

Lustre is a lot of work-there is no doubt about it. There are no
real tools (GUIs etc) to help. But they make their money in
customization I figure. What is really a good thing about Lustre
is that it is very scaleable. Most of the GFSes I have seen like
SANs and most of our boxes are NAS boxes. Lustre will let
us manage all our boxes (OSTs).

It's scaleable on the nodes but not on performance. You can have 1000
nodes or more but performance will begin to suffer seriously after
about 30 or 40. Read performance is scaleble, but not the write. And
my guess is most people looking at these types of solutions need write
performance as well as read.

The others you mentioned are viable but the
leaders today seem to be Polyserve, Ibrix, GPFS, and GFS. Personally
I like Polyserve personally but Ibrix and GPFS both look good as well.

We have a small CX500 w/ 18TB and IBM 2.5GHZ dual Opteron/QLogic
heads for testing. We will be testing IBrix (GPFS and Lustre. I am going to
install RedHat's next week if all goes well with my other work.

Polyserve beat out GFS by about 30% for us. Plus it was immensely
more simple to install and configure. I have yet to get a real
low-down on GPFS but it sounds about the same as the others, not sure
what they may have that the others don't.
Ibrix is extremely attractive if you already have alot of DAS since
each one can be it's own segment server and be part of the namespace
as well. Nice re-use of existing hardware and storage, no one else
offers this in software (Acopia can take advantage of that but it's a
hw solution).

I would go for something a little more open. Any of the software
solutions plus my pick of hardware. I like the options. So if I want
to use HDS disk with qlogic controllers and Dell blades, I can. Not
so with PanFS as I understand it.

Pretty much, but then if it works who cares?

I'm thinking in terms of expansion. Depending ont he company you may
or may not get a serious discount on enterprise grade hardware. If
you do then you'd want to take advantage of that since Panasas is
pretty expensive.

But if you want something all-in-one then they are likely the best in
town right now.

I would recommend you test, test, test before plucking down $$$. Also
you may want to look at Terragrid from Terrascale. Performance is also
really good.

I'll have to look at terragrid. Heard of it but that's it.

~F

#7 December 3rd 04, 05:04 PM

In article ,
Faeandar writes:
....

| There is something I don't understand about these distributed
| filesystems: how is the namespace mapped to the devices in a fully
| distributed environment? If there is complete independence between
| the name layer and the storage layer then there is an n-squared
| locking problem, if devices are tied to the namespace then there
| is a load-leveling/utilization problem.
|
| Yes, and yes (though maybe not squared). In most cases the products
| use a VIP that clients mount. In cases like Polyserve and GFS the
| client requests are round-robin'd among the node servers. Not exactly
| load balancing but over time and enough node servers the balance finds
| itself.
|
| Products like Acopia and Panasas have internal algorithms that they
| user for latency testing and then write to the node server that
| returns the fastest. Good to some extent but in the case of Acopia,
| not being a clustered file system, it means a potential for one system
| to house alot more data than the others. Maybe a problem, maybe not.
|
| The only time you have a potential locking issue is with a clustered
| file system and a dedicated metadata server. This can become a
| serious bottleneck. Some products use distributed metadata which,
| while not alleviating the problem completely, greatly reduces it.
|
| ~F

If all nodes are equal (distributed metadata case) then there needs
to be a mapping from filesystem semantics to the underlying
consistency model implemented by the storage system e.g., if
filesystems guarantee strict write ordering (reads always return
the latest write) then the storage system must implement equally
strong cache coherency between all nodes, which doesn't scale.
So there's a tradeoff between performance and utilization. I was
just wondering how modern DFSs deal with this fundamental
limitation. SMPs have hardware support for cache management;
do we need similar functionality in the SAN?

#8 December 3rd 04, 07:29 PM

"Keith Michaels" wrote in message
...
In article ,
Faeandar writes:
...

| There is something I don't understand about these distributed
| filesystems: how is the namespace mapped to the devices in a fully
| distributed environment? If there is complete independence between
| the name layer and the storage layer then there is an n-squared
| locking problem, if devices are tied to the namespace then there
| is a load-leveling/utilization problem.
|
| Yes, and yes (though maybe not squared). In most cases the products
| use a VIP that clients mount. In cases like Polyserve and GFS the
| client requests are round-robin'd among the node servers. Not exactly
| load balancing but over time and enough node servers the balance finds
| itself.
|
| Products like Acopia and Panasas have internal algorithms that they
| user for latency testing and then write to the node server that
| returns the fastest. Good to some extent but in the case of Acopia,
| not being a clustered file system, it means a potential for one system
| to house alot more data than the others. Maybe a problem, maybe not.
|
| The only time you have a potential locking issue is with a clustered
| file system and a dedicated metadata server. This can become a
| serious bottleneck. Some products use distributed metadata which,
| while not alleviating the problem completely, greatly reduces it.
|
| ~F

If all nodes are equal (distributed metadata case) then there needs
to be a mapping from filesystem semantics to the underlying
consistency model implemented by the storage system e.g., if
filesystems guarantee strict write ordering (reads always return
the latest write) then the storage system must implement equally
strong cache coherency between all nodes, which doesn't scale.
So there's a tradeoff between performance and utilization. I was
just wondering how modern DFSs deal with this fundamental
limitation. SMPs have hardware support for cache management;
do we need similar functionality in the SAN?

'Distributed metadata' means different things in different contexts.

On VMS clusters, it refers to distributed management of metadata stored on
shared devices by cooperating hosts. But even there only a single host
manages a given metadatum at any given time (using the distributed locking
facility). This approach scales to hundreds of host nodes (the nominal
supported limit on cluster size is 96, but sizes close to 200 have been used
in practice), though contention varies with the degree of access locality to
a given datum (the worst case obviously being if all hosts access it
equally, with intense update activity).

In implementations like Lustre, it refers to low-level local metadata stored
on all storage units, plus file-level metadata stored in a coordinating
central metadata server (or metadata server cluster). The low-level local
metadata is completely partitioned (i.e., of only local significance) and
hence has no scaling problem - but by offloading that portion of metadata
management from the central metadata server helps it scale too.

Another alternative is to partition higher-level metadata as well (e.g.,
localizing the management of metadata for any given file - or even
directory - to a single server (plus perhaps a mirror partner), even if the
file data may be spread more widely). This certainly qualifies as
distributed metadata, even though management of any single metadatum is not
distributed, and scales well at the file system level as long as the
partitioning granularity manages to spread out metadata loads reasonably
(e.g., you'd have a problem if all system activity concentrated on one
humongous database file - though with some ingenuity subdividing at least
some of the metadata management even within a single file is possible).

Until SAN bandwidths and latencies approach those of local RAM much more
closely than they do today, there will likely be little reason to use
hardware management of distributed file caches - especially since the most
effective location for such caching is very often (save in cases of intense
contention) at the client itself, where special hardware is least likely to
be found.

- bill

#9 December 3rd 04, 07:53 PM

On Fri, 3 Dec 2004 14:29:21 -0500, "Bill Todd"
wrote:

"Keith Michaels" wrote in message
...
In article ,
Faeandar writes:
...

| There is something I don't understand about these distributed
| filesystems: how is the namespace mapped to the devices in a fully
| distributed environment? If there is complete independence between
| the name layer and the storage layer then there is an n-squared
| locking problem, if devices are tied to the namespace then there
| is a load-leveling/utilization problem.
|
| Yes, and yes (though maybe not squared). In most cases the products
| use a VIP that clients mount. In cases like Polyserve and GFS the
| client requests are round-robin'd among the node servers. Not exactly
| load balancing but over time and enough node servers the balance finds
| itself.
|
| Products like Acopia and Panasas have internal algorithms that they
| user for latency testing and then write to the node server that
| returns the fastest. Good to some extent but in the case of Acopia,
| not being a clustered file system, it means a potential for one system
| to house alot more data than the others. Maybe a problem, maybe not.
|
| The only time you have a potential locking issue is with a clustered
| file system and a dedicated metadata server. This can become a
| serious bottleneck. Some products use distributed metadata which,
| while not alleviating the problem completely, greatly reduces it.
|
| ~F

If all nodes are equal (distributed metadata case) then there needs
to be a mapping from filesystem semantics to the underlying
consistency model implemented by the storage system e.g., if
filesystems guarantee strict write ordering (reads always return
the latest write) then the storage system must implement equally
strong cache coherency between all nodes, which doesn't scale.
So there's a tradeoff between performance and utilization. I was
just wondering how modern DFSs deal with this fundamental
limitation. SMPs have hardware support for cache management;
do we need similar functionality in the SAN?

'Distributed metadata' means different things in different contexts.

On VMS clusters, it refers to distributed management of metadata stored on
shared devices by cooperating hosts. But even there only a single host
manages a given metadatum at any given time (using the distributed locking
facility). This approach scales to hundreds of host nodes (the nominal
supported limit on cluster size is 96, but sizes close to 200 have been used
in practice), though contention varies with the degree of access locality to
a given datum (the worst case obviously being if all hosts access it
equally, with intense update activity).

In implementations like Lustre, it refers to low-level local metadata stored
on all storage units, plus file-level metadata stored in a coordinating
central metadata server (or metadata server cluster). The low-level local
metadata is completely partitioned (i.e., of only local significance) and
hence has no scaling problem - but by offloading that portion of metadata
management from the central metadata server helps it scale too.

Help me understand this. If file level metadata is coordinated by a
single server how is that not a large bottleneck? For reads I can see
why it's not, but for writes (in the 10,00 node cluster they talk
about) this would be like trying to stuff the Atlantic through a hose.
At least in my understanding.

Another alternative is to partition higher-level metadata as well (e.g.,
localizing the management of metadata for any given file - or even
directory - to a single server (plus perhaps a mirror partner), even if the
file data may be spread more widely). This certainly qualifies as
distributed metadata, even though management of any single metadatum is not
distributed, and scales well at the file system level as long as the
partitioning granularity manages to spread out metadata loads reasonably
(e.g., you'd have a problem if all system activity concentrated on one
humongous database file - though with some ingenuity subdividing at least
some of the metadata management even within a single file is possible).

Until SAN bandwidths and latencies approach those of local RAM much more
closely than they do today, there will likely be little reason to use
hardware management of distributed file caches - especially since the most
effective location for such caching is very often (save in cases of intense
contention) at the client itself, where special hardware is least likely to
be found.

My guess is we will see something like NVRAM cards inside the hosts
all connected via IB or some other high bandwidth low latency
interconnect. For now though gigabit seems to do the trick primarily
for the reason you mentioned.
Although I gotta say, it is extremely appealing to use NVRAM in thos
hosts anyway just for the added write performance boost. If only they
could be clustered by themselves and not require host-based
clustering.... Ah well, someday.

~F

#10 December 3rd 04, 08:45 PM

In article ,
"Bill Todd" writes:
|
| "Keith Michaels" wrote in message
| ...
| In article ,
| Faeandar writes:
| ...
|
| | There is something I don't understand about these distributed
| | filesystems: how is the namespace mapped to the devices in a fully
| | distributed environment? If there is complete independence between
| | the name layer and the storage layer then there is an n-squared
| | locking problem, if devices are tied to the namespace then there
| | is a load-leveling/utilization problem.
| |
| | Yes, and yes (though maybe not squared). In most cases the products
| | use a VIP that clients mount. In cases like Polyserve and GFS the
| | client requests are round-robin'd among the node servers. Not exactly
| | load balancing but over time and enough node servers the balance finds
| | itself.
| |
| | Products like Acopia and Panasas have internal algorithms that they
| | user for latency testing and then write to the node server that
| | returns the fastest. Good to some extent but in the case of Acopia,
| | not being a clustered file system, it means a potential for one system
| | to house alot more data than the others. Maybe a problem, maybe not.
| |
| | The only time you have a potential locking issue is with a clustered
| | file system and a dedicated metadata server. This can become a
| | serious bottleneck. Some products use distributed metadata which,
| | while not alleviating the problem completely, greatly reduces it.
| |
| | ~F
|
| If all nodes are equal (distributed metadata case) then there needs
| to be a mapping from filesystem semantics to the underlying
| consistency model implemented by the storage system e.g., if
| filesystems guarantee strict write ordering (reads always return
| the latest write) then the storage system must implement equally
| strong cache coherency between all nodes, which doesn't scale.
| So there's a tradeoff between performance and utilization. I was
| just wondering how modern DFSs deal with this fundamental
| limitation. SMPs have hardware support for cache management;
| do we need similar functionality in the SAN?
|
| 'Distributed metadata' means different things in different contexts.
|
| On VMS clusters, it refers to distributed management of metadata stored on
| shared devices by cooperating hosts. But even there only a single host
| manages a given metadatum at any given time (using the distributed locking
| facility). This approach scales to hundreds of host nodes (the nominal
| supported limit on cluster size is 96, but sizes close to 200 have been used
| in practice), though contention varies with the degree of access locality to
| a given datum (the worst case obviously being if all hosts access it
| equally, with intense update activity).
|
| In implementations like Lustre, it refers to low-level local metadata stored
| on all storage units, plus file-level metadata stored in a coordinating
| central metadata server (or metadata server cluster). The low-level local
| metadata is completely partitioned (i.e., of only local significance) and
| hence has no scaling problem - but by offloading that portion of metadata
| management from the central metadata server helps it scale too.
|
| Another alternative is to partition higher-level metadata as well (e.g.,
| localizing the management of metadata for any given file - or even
| directory - to a single server (plus perhaps a mirror partner), even if the
| file data may be spread more widely). This certainly qualifies as
| distributed metadata, even though management of any single metadatum is not
| distributed, and scales well at the file system level as long as the
| partitioning granularity manages to spread out metadata loads reasonably
| (e.g., you'd have a problem if all system activity concentrated on one
| humongous database file - though with some ingenuity subdividing at least
| some of the metadata management even within a single file is possible).
|
| Until SAN bandwidths and latencies approach those of local RAM much more
| closely than they do today, there will likely be little reason to use
| hardware management of distributed file caches - especially since the most
| effective location for such caching is very often (save in cases of intense
| contention) at the client itself, where special hardware is least likely to
| be found.
|
| - bill

Thanks Bill, that's a very interesting assessment. It seems to me
that the promise of grid-based/object storage is scalability; if
we lose that we're back where we started. Lustre cleverly exploits
the difference between local and global attributes but in the end
all of these DFSs lose scalability when trying to emulate strictly
ordered filesystem semantics and centralized resource management
without making assumptions about sharing and access patterns.
Maybe there's a perfect technology coming that will do it all; in
the meantime what is the business case for DFS deployment? Maybe
a better question is how do we move beyond the filesystem model
so that these technologies fit our applications better?

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode