If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
In article ,
Faeandar wrote: In implementations like Lustre, it refers to low-level local metadata stored on all storage units, plus file-level metadata stored in a coordinating central metadata server (or metadata server cluster). The low-level local metadata is completely partitioned (i.e., of only local significance) and hence has no scaling problem - but by offloading that portion of metadata management from the central metadata server helps it scale too. Help me understand this. If file level metadata is coordinated by a single server how is that not a large bottleneck? For reads I can see why it's not, but for writes (in the 10,00 node cluster they talk about) this would be like trying to stuff the Atlantic through a hose. At least in my understanding. Depends. On the definition of "metadata", on the exact mode of caching, on how the consistency locking is implemented, on the workload, and most importantly, whether the actual workload matches the expectation that the implementors of the shared file system had. There are existance proofs of clustered / distributed filesystems that centralize metadata handling, yet work extremely well, even when writing. This topic is very well studied. There are at least dozens of papers on it in the open research literature. The number of published patents about it must be in the hundreds. Many of the experts in the field could hold talks lasting several hours apiece on the subject. I suggest that you review the literature on the subject. I don't offhand know a good overview or introductory textbook; just do a google search for things like GFS (Minnesota), Lustre, Panasas or PanFS (also look for CMU NASD, its predecessor), IBM StorageTank or SAN-FS, GFS (the Google file system), IBM GPFS, Digital Petal/Frangipani, Berkeley xFS, SGI CXFS, parallel NFSv4, and many academic predecessors (Sprite, Slice, whatevers). My guess is we will see something like NVRAM cards inside the hosts all connected via IB or some other high bandwidth low latency interconnect. For now though gigabit seems to do the trick primarily for the reason you mentioned. Although I gotta say, it is extremely appealing to use NVRAM in thos hosts anyway just for the added write performance boost. If only they could be clustered by themselves and not require host-based clustering.... Ah well, someday. If you had NVRAM, and IB (or even better, something like SCI, which has the cache-coherency built in), and the NVRAM is fault-tolerant (meaning: remains accessible even if the host has failed), then implementing a clustered file system becomes trivial. Or to put it differently: In an environment where the hardware is plentiful, even really bad architectural choices will perform adequately. But today, we don't have hardware NVRAM in every host, even less do we have fault-tolerant NVRAM, and we have high-latency networking. Or to be exact: Cost considerations prevent us from deploying NVRAM/IB/... universally (they certainly do exist, but they are expensive and not available in commodity computers). This means that implementing a cluster or distributed file system becomes an art form, and architectural choices must be well thought out, and matched to the intended usage pattern and to the customers expectations. -- The address in the header is invalid for obvious reasons. Please reconstruct the address from the information below (look for _). Ralph Becker-Szendy |
#12
|
|||
|
|||
|
#13
|
|||
|
|||
|
#14
|
|||
|
|||
"Faeandar" wrote in message ... On Fri, 3 Dec 2004 14:29:21 -0500, "Bill Todd" wrote: .... In implementations like Lustre, it refers to low-level local metadata stored on all storage units, plus file-level metadata stored in a coordinating central metadata server (or metadata server cluster). The low-level local metadata is completely partitioned (i.e., of only local significance) and hence has no scaling problem - but by offloading that portion of metadata management from the central metadata server helps it scale too. Help me understand this. If file level metadata is coordinated by a single server how is that not a large bottleneck? By making the server suitably hefty. Or, as noted above, by using a metadata server cluster to divvy up the load (though even a relatively modest SMP system can handle impressive levels of metadata management before clustering it is necessary for anything save to increase availability via fail-over mechanisms). For reads I can see why it's not, but for writes (in the 10,00 node cluster they talk about) this would be like trying to stuff the Atlantic through a hose. At least in my understanding. The actual user data doesn't pass through the metadata server, just the metadata operations. So in a product like Lustre (or, IIUC, Storage Tank), the metadata server merely need be made aware of the write operation to update things like last-modified date and EOF mark, while the grunt-level data servers handle the actual movement to disk (and at least potentially any interlocking involved, though the interaction of byte-range locks with full-file locks on a file large enough to span many data servers becomes an interesting challenge then). Directory operations may often place a heavier load on metadata servers than even write-intensive file activity. .... Until SAN bandwidths and latencies approach those of local RAM much more closely than they do today, there will likely be little reason to use hardware management of distributed file caches - especially since the most effective location for such caching is very often (save in cases of intense contention) at the client itself, where special hardware is least likely to be found. My guess is we will see something like NVRAM cards inside the hosts all connected via IB or some other high bandwidth low latency interconnect. For now though gigabit seems to do the trick primarily for the reason you mentioned. Although I gotta say, it is extremely appealing to use NVRAM in thos hosts anyway just for the added write performance boost. If only they could be clustered by themselves and not require host-based clustering.... Ah well, someday. Use of NVRAM is eminently feasible today, and has (or at least with suitable design need have) relatively little connection to distributed caching (let alone with hardware support for same) - especially if clients aren't completely trustworthy. - bill |
#15
|
|||
|
|||
|
#16
|
|||
|
|||
On Fri, 3 Dec 2004 18:25:44 -0500, "Bill Todd"
wrote: "Faeandar" wrote in message .. . On Fri, 3 Dec 2004 14:29:21 -0500, "Bill Todd" wrote: ... In implementations like Lustre, it refers to low-level local metadata stored on all storage units, plus file-level metadata stored in a coordinating central metadata server (or metadata server cluster). The low-level local metadata is completely partitioned (i.e., of only local significance) and hence has no scaling problem - but by offloading that portion of metadata management from the central metadata server helps it scale too. Help me understand this. If file level metadata is coordinated by a single server how is that not a large bottleneck? By making the server suitably hefty. Or, as noted above, by using a metadata server cluster to divvy up the load (though even a relatively modest SMP system can handle impressive levels of metadata management before clustering it is necessary for anything save to increase availability via fail-over mechanisms). For reads I can see why it's not, but for writes (in the 10,00 node cluster they talk about) this would be like trying to stuff the Atlantic through a hose. At least in my understanding. The actual user data doesn't pass through the metadata server, just the metadata operations. So in a product like Lustre (or, IIUC, Storage Tank), the metadata server merely need be made aware of the write operation to update things like last-modified date and EOF mark, while the grunt-level data servers handle the actual movement to disk (and at least potentially any interlocking involved, though the interaction of byte-range locks with full-file locks on a file large enough to span many data servers becomes an interesting challenge then). Directory operations may often place a heavier load on metadata servers than even write-intensive file activity. I understand only metadata ops are passed through a metadata server and not the data itself, but in the case of high performance computing that load is sizeable. And the more node servers you add to the cluster the more that load increases. And locking, though handled at the node server, still has to be distributed so the other nodes know what's locked and what's not. This also can be large for compute farms. I see how you thought I was an idiot with the Atlantic analogy. I was just illustrating a point, exagerated though it was. ... Until SAN bandwidths and latencies approach those of local RAM much more closely than they do today, there will likely be little reason to use hardware management of distributed file caches - especially since the most effective location for such caching is very often (save in cases of intense contention) at the client itself, where special hardware is least likely to be found. My guess is we will see something like NVRAM cards inside the hosts all connected via IB or some other high bandwidth low latency interconnect. For now though gigabit seems to do the trick primarily for the reason you mentioned. Although I gotta say, it is extremely appealing to use NVRAM in thos hosts anyway just for the added write performance boost. If only they could be clustered by themselves and not require host-based clustering.... Ah well, someday. Use of NVRAM is eminently feasible today, and has (or at least with suitable design need have) relatively little connection to distributed caching (let alone with hardware support for same) - especially if clients aren't completely trustworthy. It certainly is feasible, but having battery backed NVRAM does little good if it isn't sync'd with a failover partner or pool. In the case of a critical host failure for long periods of time that data is essentially lost. ~F |
#17
|
|||
|
|||
"Faeandar" wrote in message ... .... I understand only metadata ops are passed through a metadata server and not the data itself, but in the case of high performance computing that load is sizeable. My impression is that HPC applications don't typically care all that much about consistent last-modified dates. And for files too large to be handled locally at a single data-server node, it's certainly possible for the metadata server just to track which data server currently contains the file's EOF and off-load end-of-file-mark maintenance to that server (thus the metadata server is only bothered each time the file grows or shrinks enough for its EOF mark to move to another data server - or not at all if the file is confined to a single data server, if that's where individual file metadata is managed). The secret of off-loading centralized metadata servers is to minimize the need to talk with them. 'Object-oriented' storage takes the first step by isolating the metadata servers from the details of file placement below the node level (even if the metadata server tracks placement explicitly at the node level rather than uses some algorithm to determine distribution across the data servers, that's still a major reduction in metadata server workload). The next step is to isolate the metadata servers from metadata management associated with individual files, farming that off to the data server on which the file resides (or the first data server of a set over which the file is spread). The final step is to distribute directories as well, eliminating the metadata servers entirely and making the system a group of peers, but this requires some ingenuity to avoid significantly increasing the overhead of path look-ups. And the more node servers you add to the cluster the more that load increases. And locking, though handled at the node server, still has to be distributed so the other nodes know what's locked and what's not. The only node that needs to know what's locked is the node controlling the data that's locked (and perhaps a mirroring partner, for availability): everyone has to go there for that portion of the file anyway, in the kind of system I was describing (where clients have sufficient intelligence to target data servers directly in most cases after acquiring the file map from the metadata server). It's also nice for that data server to be able to delegate some locking to a specific client in the absense of contention, but that doesn't change the basic approach. .... Use of NVRAM is eminently feasible today, and has (or at least with suitable design need have) relatively little connection to distributed caching (let alone with hardware support for same) - especially if clients aren't completely trustworthy. It certainly is feasible, but having battery backed NVRAM does little good if it isn't sync'd with a failover partner or pool. In the case of a critical host failure for long periods of time that data is essentially lost. NVRAM works most efficiently when in the service of a single server. Since you have to allow for the failure of that server anyway, you just arrange partnering at the server level without any special considerations for NVRAM that may be present for performance reasons. In fact, if you're sufficiently confident in your UPSs and have the two partners on separate ones, you can dispense with NVRAM entirely (the UPS-backed system RAM will suffice, as long as you can be sure that at least one of the mirror partners will continue running long enough to dump its contents to disk and revert to more conservative strategies if its partner fails). In general, when trying to avoid single points of failure, using coarse-granularity partnering with near-commodity-level components is far less expensive than any attempt to avoid failure points within any single node. The only real exception to this is when you're not shooting merely for availability but absolute reliability, and must therefore use redundant everything (with hardware comparators) within each single server to guard against the possibility that it will fail without noticing the problem and proceed to corrupt its portion of the database (even then, you still need geographically-separated mirror partners to guard against site-level failures, so the cost can easily reach an order of magnitude greater than simple commodity server mirroring). - bill |
#18
|
|||
|
|||
In article ,
Faeandar wrote: .... Can it scale to 10,000 nodes? Well not currently (read-only not withstanding) I've heard of 1024-node clusters today. Ask some of the 1st-tier vendors. And everyone in the field is working on improving scalability. Clearly, object-storage is one of the tools in the arsenal. As is NVRAM, but that has problems (that Bill explained well). but 10 years ago scaling this type of file system to anything beyond 1 server was voodoo in the open systems world. What was there in the open systems world 10 years ago? 10 years ago you could run Linux version 0.99.14 (which IMHO had ext but not ext2 file systems, it was my first Linux) if you wanted open-source, or dream of Linux V1, or you could buy BSDI. I think none of the fully free *bsds was available yet: I remember considering the Jolitzes 386bsd in 1995 (or am I one year late?). In the commercial world, there was Ultrix on Alpha, SunOS, HP-UX V8, AIX 3, and some IRIX version (and NextStep, if you were into masochism). If you wanted cluster computing, you could buy a big SGI, or an IBM SP2 (were they available in 1994? I don't remember). I refuse to count Windows as "open systems", even though MCS = wolfpack (the cluster version of WinNT) was already on peoples mind then. On the other hand, 10 years ago, a very-high-quality cluster file system had already been available for a decade, in the form of VAXcluster. I've personally used VAXclusters of up to 90 machines, in the late 80s. And if you wanted multi-node fault-tolerant NVRAM, it was already available in the form of "servernet", the Tandem's backplane technology. It's actually very sad, how little we have accomplished since then. The destruction of Digital and of Tandem contributed much to the lack of progress. -- The address in the header is invalid for obvious reasons. Please reconstruct the address from the information below (look for _). Ralph Becker-Szendy |
#19
|
|||
|
|||
wrote in message
news:1102137471.806300@smirk... I've heard of 1024-node clusters today. Ask some of the 1st-tier vendors. And everyone in the field is working on improving scalability. Clearly, object-storage is one of the tools in the arsenal. As is NVRAM, but that has problems (that Bill explained well). We have ~1800 compute nodes. 150TB of disk space. Scalability is a major problem. For instance, we push NFS further than it was probably intended. So I see a GFS as a solution for helping not only our performance, but scalability also. In the commercial world, there was Ultrix on Alpha, SunOS, HP-UX V8, AIX 3, and some IRIX version (and NextStep, if you were into masochism). If you wanted cluster computing, you could buy a big SGI, or an IBM SP2 (were they available in 1994? I don't remember). don't forget CRAY and a bunch of multi-proc proprietary solutions. But they were all very expensive. |
|
Thread Tools | |
Display Modes | |
|
|