If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
PanFS?
Any know know much about the solutions from Panasas? The PanFS looks
similar to clustered filesystem like CxFS, GFS, StorNext, Lustre...etc. except it is object-based. Their solution seems to include the backend storage, control blades, and some NFS/CIFS heads for interop. What is the benefit of using them versus getting Lustre and a regular SAN? Any idea? Regards, Ernest |
#2
|
|||
|
|||
|
#3
|
|||
|
|||
"Faeandar" wrote in message
... On 1 Dec 2004 10:20:06 -0800, (Ernest Siu) wrote: Any know know much about the solutions from Panasas? The PanFS looks similar to clustered filesystem like CxFS, GFS, StorNext, Lustre...etc. except it is object-based. Their solution seems to include the backend storage, control blades, and some NFS/CIFS heads for interop. What is the benefit of using them versus getting Lustre and a regular SAN? Any idea? Regards, Ernest Hey Guys. I run a large cluster with a lot of storage so that is my perspective. Well first thing is don't think about Lustre, it's a beast still and not ready for primetime. Lustre is a lot of work-there is no doubt about it. There are no real tools (GUIs etc) to help. But they make their money in customization I figure. What is really a good thing about Lustre is that it is very scaleable. Most of the GFSes I have seen like SANs and most of our boxes are NAS boxes. Lustre will let us manage all our boxes (OSTs). The others you mentioned are viable but the leaders today seem to be Polyserve, Ibrix, GPFS, and GFS. Personally I like Polyserve personally but Ibrix and GPFS both look good as well. We have a small CX500 w/ 18TB and IBM 2.5GHZ dual Opteron/QLogic heads for testing. We will be testing IBrix (GPFS and Lustre. I am going to install RedHat's next week if all goes well with my other work. IBrix has some really good features to help us manage our disks-including a nice GUI. There are some sophisticated technical features for fault-tolerance and recovery you may want to look at. PanFS is very similar to Polyserve except they do it turnkey and in a box. The box is proprietary (not necessarily bad) but from what I understand also fairly expensive. Otherwise I've only heard good things about them. We have a fair bit of experience with the Panases product. It does not work for us and we are looking elsewhere-but they are making serious in-roads. One of my issues is getting as much of my disk servers off NFS as I can. With it is that I have 150TB (we just ordered another 30+) I won't be able to get of NFS with. Performance was really good. I would go for something a little more open. Any of the software solutions plus my pick of hardware. I like the options. So if I want to use HDS disk with qlogic controllers and Dell blades, I can. Not so with PanFS as I understand it. Pretty much, but then if it works who cares? But if you want something all-in-one then they are likely the best in town right now. I would recommend you test, test, test before plucking down $$$. Also you may want to look at Terragrid from Terrascale. Performance is also really good. Anyway, that's my two cents. -- Wolf ---------------------------------------------------------------- Please post all responses to UseNet. All email cheerfully and automagically routed to Dave Null |
#4
|
|||
|
|||
|
#6
|
|||
|
|||
On Thu, 02 Dec 2004 12:38:17 GMT, "Wolf"
wrote: "Faeandar" wrote in message .. . On 1 Dec 2004 10:20:06 -0800, (Ernest Siu) wrote: Any know know much about the solutions from Panasas? The PanFS looks similar to clustered filesystem like CxFS, GFS, StorNext, Lustre...etc. except it is object-based. Their solution seems to include the backend storage, control blades, and some NFS/CIFS heads for interop. What is the benefit of using them versus getting Lustre and a regular SAN? Any idea? Regards, Ernest Hey Guys. I run a large cluster with a lot of storage so that is my perspective. Well first thing is don't think about Lustre, it's a beast still and not ready for primetime. Lustre is a lot of work-there is no doubt about it. There are no real tools (GUIs etc) to help. But they make their money in customization I figure. What is really a good thing about Lustre is that it is very scaleable. Most of the GFSes I have seen like SANs and most of our boxes are NAS boxes. Lustre will let us manage all our boxes (OSTs). It's scaleable on the nodes but not on performance. You can have 1000 nodes or more but performance will begin to suffer seriously after about 30 or 40. Read performance is scaleble, but not the write. And my guess is most people looking at these types of solutions need write performance as well as read. The others you mentioned are viable but the leaders today seem to be Polyserve, Ibrix, GPFS, and GFS. Personally I like Polyserve personally but Ibrix and GPFS both look good as well. We have a small CX500 w/ 18TB and IBM 2.5GHZ dual Opteron/QLogic heads for testing. We will be testing IBrix (GPFS and Lustre. I am going to install RedHat's next week if all goes well with my other work. Polyserve beat out GFS by about 30% for us. Plus it was immensely more simple to install and configure. I have yet to get a real low-down on GPFS but it sounds about the same as the others, not sure what they may have that the others don't. Ibrix is extremely attractive if you already have alot of DAS since each one can be it's own segment server and be part of the namespace as well. Nice re-use of existing hardware and storage, no one else offers this in software (Acopia can take advantage of that but it's a hw solution). I would go for something a little more open. Any of the software solutions plus my pick of hardware. I like the options. So if I want to use HDS disk with qlogic controllers and Dell blades, I can. Not so with PanFS as I understand it. Pretty much, but then if it works who cares? I'm thinking in terms of expansion. Depending ont he company you may or may not get a serious discount on enterprise grade hardware. If you do then you'd want to take advantage of that since Panasas is pretty expensive. But if you want something all-in-one then they are likely the best in town right now. I would recommend you test, test, test before plucking down $$$. Also you may want to look at Terragrid from Terrascale. Performance is also really good. I'll have to look at terragrid. Heard of it but that's it. ~F |
#7
|
|||
|
|||
In article ,
Faeandar writes: .... | There is something I don't understand about these distributed | filesystems: how is the namespace mapped to the devices in a fully | distributed environment? If there is complete independence between | the name layer and the storage layer then there is an n-squared | locking problem, if devices are tied to the namespace then there | is a load-leveling/utilization problem. | | Yes, and yes (though maybe not squared). In most cases the products | use a VIP that clients mount. In cases like Polyserve and GFS the | client requests are round-robin'd among the node servers. Not exactly | load balancing but over time and enough node servers the balance finds | itself. | | Products like Acopia and Panasas have internal algorithms that they | user for latency testing and then write to the node server that | returns the fastest. Good to some extent but in the case of Acopia, | not being a clustered file system, it means a potential for one system | to house alot more data than the others. Maybe a problem, maybe not. | | The only time you have a potential locking issue is with a clustered | file system and a dedicated metadata server. This can become a | serious bottleneck. Some products use distributed metadata which, | while not alleviating the problem completely, greatly reduces it. | | ~F If all nodes are equal (distributed metadata case) then there needs to be a mapping from filesystem semantics to the underlying consistency model implemented by the storage system e.g., if filesystems guarantee strict write ordering (reads always return the latest write) then the storage system must implement equally strong cache coherency between all nodes, which doesn't scale. So there's a tradeoff between performance and utilization. I was just wondering how modern DFSs deal with this fundamental limitation. SMPs have hardware support for cache management; do we need similar functionality in the SAN? |
#8
|
|||
|
|||
"Keith Michaels" wrote in message ... In article , Faeandar writes: ... | There is something I don't understand about these distributed | filesystems: how is the namespace mapped to the devices in a fully | distributed environment? If there is complete independence between | the name layer and the storage layer then there is an n-squared | locking problem, if devices are tied to the namespace then there | is a load-leveling/utilization problem. | | Yes, and yes (though maybe not squared). In most cases the products | use a VIP that clients mount. In cases like Polyserve and GFS the | client requests are round-robin'd among the node servers. Not exactly | load balancing but over time and enough node servers the balance finds | itself. | | Products like Acopia and Panasas have internal algorithms that they | user for latency testing and then write to the node server that | returns the fastest. Good to some extent but in the case of Acopia, | not being a clustered file system, it means a potential for one system | to house alot more data than the others. Maybe a problem, maybe not. | | The only time you have a potential locking issue is with a clustered | file system and a dedicated metadata server. This can become a | serious bottleneck. Some products use distributed metadata which, | while not alleviating the problem completely, greatly reduces it. | | ~F If all nodes are equal (distributed metadata case) then there needs to be a mapping from filesystem semantics to the underlying consistency model implemented by the storage system e.g., if filesystems guarantee strict write ordering (reads always return the latest write) then the storage system must implement equally strong cache coherency between all nodes, which doesn't scale. So there's a tradeoff between performance and utilization. I was just wondering how modern DFSs deal with this fundamental limitation. SMPs have hardware support for cache management; do we need similar functionality in the SAN? 'Distributed metadata' means different things in different contexts. On VMS clusters, it refers to distributed management of metadata stored on shared devices by cooperating hosts. But even there only a single host manages a given metadatum at any given time (using the distributed locking facility). This approach scales to hundreds of host nodes (the nominal supported limit on cluster size is 96, but sizes close to 200 have been used in practice), though contention varies with the degree of access locality to a given datum (the worst case obviously being if all hosts access it equally, with intense update activity). In implementations like Lustre, it refers to low-level local metadata stored on all storage units, plus file-level metadata stored in a coordinating central metadata server (or metadata server cluster). The low-level local metadata is completely partitioned (i.e., of only local significance) and hence has no scaling problem - but by offloading that portion of metadata management from the central metadata server helps it scale too. Another alternative is to partition higher-level metadata as well (e.g., localizing the management of metadata for any given file - or even directory - to a single server (plus perhaps a mirror partner), even if the file data may be spread more widely). This certainly qualifies as distributed metadata, even though management of any single metadatum is not distributed, and scales well at the file system level as long as the partitioning granularity manages to spread out metadata loads reasonably (e.g., you'd have a problem if all system activity concentrated on one humongous database file - though with some ingenuity subdividing at least some of the metadata management even within a single file is possible). Until SAN bandwidths and latencies approach those of local RAM much more closely than they do today, there will likely be little reason to use hardware management of distributed file caches - especially since the most effective location for such caching is very often (save in cases of intense contention) at the client itself, where special hardware is least likely to be found. - bill |
#9
|
|||
|
|||
On Fri, 3 Dec 2004 14:29:21 -0500, "Bill Todd"
wrote: "Keith Michaels" wrote in message ... In article , Faeandar writes: ... | There is something I don't understand about these distributed | filesystems: how is the namespace mapped to the devices in a fully | distributed environment? If there is complete independence between | the name layer and the storage layer then there is an n-squared | locking problem, if devices are tied to the namespace then there | is a load-leveling/utilization problem. | | Yes, and yes (though maybe not squared). In most cases the products | use a VIP that clients mount. In cases like Polyserve and GFS the | client requests are round-robin'd among the node servers. Not exactly | load balancing but over time and enough node servers the balance finds | itself. | | Products like Acopia and Panasas have internal algorithms that they | user for latency testing and then write to the node server that | returns the fastest. Good to some extent but in the case of Acopia, | not being a clustered file system, it means a potential for one system | to house alot more data than the others. Maybe a problem, maybe not. | | The only time you have a potential locking issue is with a clustered | file system and a dedicated metadata server. This can become a | serious bottleneck. Some products use distributed metadata which, | while not alleviating the problem completely, greatly reduces it. | | ~F If all nodes are equal (distributed metadata case) then there needs to be a mapping from filesystem semantics to the underlying consistency model implemented by the storage system e.g., if filesystems guarantee strict write ordering (reads always return the latest write) then the storage system must implement equally strong cache coherency between all nodes, which doesn't scale. So there's a tradeoff between performance and utilization. I was just wondering how modern DFSs deal with this fundamental limitation. SMPs have hardware support for cache management; do we need similar functionality in the SAN? 'Distributed metadata' means different things in different contexts. On VMS clusters, it refers to distributed management of metadata stored on shared devices by cooperating hosts. But even there only a single host manages a given metadatum at any given time (using the distributed locking facility). This approach scales to hundreds of host nodes (the nominal supported limit on cluster size is 96, but sizes close to 200 have been used in practice), though contention varies with the degree of access locality to a given datum (the worst case obviously being if all hosts access it equally, with intense update activity). In implementations like Lustre, it refers to low-level local metadata stored on all storage units, plus file-level metadata stored in a coordinating central metadata server (or metadata server cluster). The low-level local metadata is completely partitioned (i.e., of only local significance) and hence has no scaling problem - but by offloading that portion of metadata management from the central metadata server helps it scale too. Help me understand this. If file level metadata is coordinated by a single server how is that not a large bottleneck? For reads I can see why it's not, but for writes (in the 10,00 node cluster they talk about) this would be like trying to stuff the Atlantic through a hose. At least in my understanding. Another alternative is to partition higher-level metadata as well (e.g., localizing the management of metadata for any given file - or even directory - to a single server (plus perhaps a mirror partner), even if the file data may be spread more widely). This certainly qualifies as distributed metadata, even though management of any single metadatum is not distributed, and scales well at the file system level as long as the partitioning granularity manages to spread out metadata loads reasonably (e.g., you'd have a problem if all system activity concentrated on one humongous database file - though with some ingenuity subdividing at least some of the metadata management even within a single file is possible). Until SAN bandwidths and latencies approach those of local RAM much more closely than they do today, there will likely be little reason to use hardware management of distributed file caches - especially since the most effective location for such caching is very often (save in cases of intense contention) at the client itself, where special hardware is least likely to be found. My guess is we will see something like NVRAM cards inside the hosts all connected via IB or some other high bandwidth low latency interconnect. For now though gigabit seems to do the trick primarily for the reason you mentioned. Although I gotta say, it is extremely appealing to use NVRAM in thos hosts anyway just for the added write performance boost. If only they could be clustered by themselves and not require host-based clustering.... Ah well, someday. ~F |
#10
|
|||
|
|||
In article ,
"Bill Todd" writes: | | "Keith Michaels" wrote in message | ... | In article , | Faeandar writes: | ... | | | There is something I don't understand about these distributed | | filesystems: how is the namespace mapped to the devices in a fully | | distributed environment? If there is complete independence between | | the name layer and the storage layer then there is an n-squared | | locking problem, if devices are tied to the namespace then there | | is a load-leveling/utilization problem. | | | | Yes, and yes (though maybe not squared). In most cases the products | | use a VIP that clients mount. In cases like Polyserve and GFS the | | client requests are round-robin'd among the node servers. Not exactly | | load balancing but over time and enough node servers the balance finds | | itself. | | | | Products like Acopia and Panasas have internal algorithms that they | | user for latency testing and then write to the node server that | | returns the fastest. Good to some extent but in the case of Acopia, | | not being a clustered file system, it means a potential for one system | | to house alot more data than the others. Maybe a problem, maybe not. | | | | The only time you have a potential locking issue is with a clustered | | file system and a dedicated metadata server. This can become a | | serious bottleneck. Some products use distributed metadata which, | | while not alleviating the problem completely, greatly reduces it. | | | | ~F | | If all nodes are equal (distributed metadata case) then there needs | to be a mapping from filesystem semantics to the underlying | consistency model implemented by the storage system e.g., if | filesystems guarantee strict write ordering (reads always return | the latest write) then the storage system must implement equally | strong cache coherency between all nodes, which doesn't scale. | So there's a tradeoff between performance and utilization. I was | just wondering how modern DFSs deal with this fundamental | limitation. SMPs have hardware support for cache management; | do we need similar functionality in the SAN? | | 'Distributed metadata' means different things in different contexts. | | On VMS clusters, it refers to distributed management of metadata stored on | shared devices by cooperating hosts. But even there only a single host | manages a given metadatum at any given time (using the distributed locking | facility). This approach scales to hundreds of host nodes (the nominal | supported limit on cluster size is 96, but sizes close to 200 have been used | in practice), though contention varies with the degree of access locality to | a given datum (the worst case obviously being if all hosts access it | equally, with intense update activity). | | In implementations like Lustre, it refers to low-level local metadata stored | on all storage units, plus file-level metadata stored in a coordinating | central metadata server (or metadata server cluster). The low-level local | metadata is completely partitioned (i.e., of only local significance) and | hence has no scaling problem - but by offloading that portion of metadata | management from the central metadata server helps it scale too. | | Another alternative is to partition higher-level metadata as well (e.g., | localizing the management of metadata for any given file - or even | directory - to a single server (plus perhaps a mirror partner), even if the | file data may be spread more widely). This certainly qualifies as | distributed metadata, even though management of any single metadatum is not | distributed, and scales well at the file system level as long as the | partitioning granularity manages to spread out metadata loads reasonably | (e.g., you'd have a problem if all system activity concentrated on one | humongous database file - though with some ingenuity subdividing at least | some of the metadata management even within a single file is possible). | | Until SAN bandwidths and latencies approach those of local RAM much more | closely than they do today, there will likely be little reason to use | hardware management of distributed file caches - especially since the most | effective location for such caching is very often (save in cases of intense | contention) at the client itself, where special hardware is least likely to | be found. | | - bill Thanks Bill, that's a very interesting assessment. It seems to me that the promise of grid-based/object storage is scalability; if we lose that we're back where we started. Lustre cleverly exploits the difference between local and global attributes but in the end all of these DFSs lose scalability when trying to emulate strictly ordered filesystem semantics and centralized resource management without making assumptions about sharing and access patterns. Maybe there's a perfect technology coming that will do it all; in the meantime what is the business case for DFS deployment? Maybe a better question is how do we move beyond the filesystem model so that these technologies fit our applications better? |
|
Thread Tools | |
Display Modes | |
|
|