If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#21
|
|||
|
|||
EMC to IBM SAN LUN replication
Bill Todd wrote:
Perhaps you should actually learn something about VMS clustering before presuming to expound on the deficiencies that you imagine it has. VMS's distributed file system has, since the mid-'80s, provided exactly what you describe above: an environment in which multiple instances of dumb applications executing at different sites can concurrently share (and update) the same file(s), with the same integrity guarantees that they'd enjoy if they were all running on a single machine. Jeez Bill, are you trying to play the old geezer claiming anything new is at best a poor imitation of something they've been using for 20 years :-) You are right, I know nothing about VMS. But it makes me think of mainframes and a proprietary file system, which would mean you can't just take your existing SAP, Exchange and StorNext filesystem services, and make them run on a VMS cluster, am I right ? If a storage appliance does all this, you can connect the WAN link and the SAN at each site to the storage appliances and every service can run on it's own server with whatever OS it prefers, connecting to the local SAN, using whatever block acces it prefers (raw block device, file system,...) That's what I meant by the app doesn't need to know anything about the distributed storage, nor is there any requirement for server hardware, OS or communication middle-ware. And exactly how do you think that the secondary site's portion of your hypothetical distributed cache would *know* that its local data was stale, without some form of synchronous communication with the primary site? Your hands are waving pretty fast, but I suspect you really don't know much about this subject at the level of detail that you're attempting to discuss. I think you know the answer as well as I do : by having a distributed cache manager, one that tries to put the cache directory for a given block range at the site with the most intensive I/O pattern. If you make those directory entries large enough you can reduce the overhead of having to ask the cache manager for every I/O. Yes there will of course we some synchronous traffic between the sites, for communication between the parts of the distributed cache manager. I find it curious you suspect I don't know anything about the subject. Do you think it's impossible or impractical and therefor anyone claiming the opposite must be full of it ? If one assumes a major bandwidth bottleneck between the sites, such that small communications about data currency can be synchronous but large updates cannot easily be, there might be *some* rationale for the kind of system that you seem to be describing. But in that case, secondary-site accessors aren't going to get very good service when they want something that has changed recently. If they want something that has changed recently *by another site* they won't get good service, that's right. There is no way around the fact that if you want access to the data just written at the other site, you need to pull it over the WAN and it will take a while. .... There is no lock when the storage appliance maintains the inter-site coherence; It sounds as if you may be confused again: applications don't *see* any locking using inter-site VMS distributed file access any more than they see it doing local access: the 'locking' involved is strictly internal to the file system (just as it is for a local-only file system), and is largely involved in guaranteeing the atomicity of data updates (if two accessors try to update the same byte range, for example, the result really ought to be what one or the other wrote, rather than some mixture of the two) or internal file system updates (say, an Open and a Delete operation racing each other, where they really need to be serialized rather than mixed together). Well like I said, I know squat about VMS. I wouldn't want to argue VMS with someone who's clearly a VMS guru, with such ability to produce page-long technical arguments at the slightest hint of ignorance :-) Let me retract my earlier statement about high-end services needing to be designed to run on VMS, I have offended you. Or perhaps you're suggesting that with block-level inter-site replication no inter-site locking is required to support inter-site file-level access. If so, you're simply wrong: even if the remote site applies updates in precisely the same order in which they're applied at the primary site, lack of access to the primary site's in-memory file system context remains a problem (i.e., the secondary site is still stale in that sense unless the primary site's file system does something special to ensure that it isn't, outside the confines of your 'storage appliance'). No I thought I had said that applications requiring a file system need to run a distributed file system on top of the block service. Some distributed file systems have their own distributed locking mechanism to lock blocks or files, others use a single server at the primary site. But Oracle for example can use raw block devices and does it's own locking and synchronisation (with RAC), so there's not always a need to provide one as part of the storage system. if applications need a lock manager (like distributed file systems for example), they'll need to provide their own. What's this "if"? The only instances in which they will *not* require something like a distributed lock manager are exactly those which Nick described: using snapshot-style facilities at the secondary site to create an effectively separate (usually read-only) environment which can then be operated upon in whatever manner one wants. How about the migrating services that can move between the sites but will only access data at a single site at any time ? There is no data locking required, the service can simply assume it's data is always there and it can always access it. And like I said, most applications requiring locks already provide them themselves. You can only avoid latency penalties by not accessing the same data at both sites within the async replication window. Same problem I noted above: unless you've got synchronous oversight to catch such issues (even if the actual updates can be somewhat asynchronous), it's all too easy to get stale data as input to some operation that will then use it to modify other parts of the system. But that is exactly what a distributed cache does, keeping track of dirty blocks and fetching them automatically across sites when required ! That cache flushes to an asynchronous data replication service, so that if the cache doesn't have the blocks you want to do I/O to, it's ensured to be on your local storage. Luckily there are many I/O patterns that can avoid this. My example of agent migration for example, would have a piece of data being accessed from a single site only, until the agent moves to the other side. That may work OK for read-only access, but then so does a block-level snapshot (just take one when you move the agent). Where data gets updated (or, perhaps worse, appended to), you've got allocation activity at the new site which must be carefully coordinated with the primary site. Creating a snapshot and making it available on the other site requires the migration service to integrate with the storage at both sites, which is not just a trivial matter ! With a distributed block-level cache it needs to know nothing about the storage, and in fact the storage can be from different vendors at both sites (what started this thread !), something that wouldn't work with your snapshot example. For any application where data is produced only once, rarely modified but read often, the latency hits are minimal, and the ease of just being able to read up-to-date data at all sites at any moment is a big benefit. If the updates really are that rare, then synchronous replication and VMS-style site-local access will work fine. That all depends on your definition of rare, and the latencies you are talking about. If you have say 2000 km worth of latency, synchronous is not an option even for very modest updates. The apps at all sites can be completely unaware of synchronisation windows and do not need to be made aware of each other. Just as has been the case with VMS for decades - as I said. Alright, VMS rocks and so do distributed cache appliances, but VMS has been rocking for 20 years longer :-) Sort of a Neil Young versus Green Day argument we're having here. Arne |
#22
|
|||
|
|||
EMC to IBM SAN LUN replication
Nik Simpson wrote:
Arne Joris wrote: Nik Simpson wrote: If you want to do processing of data for applications like data mining or backup, then you can take a snapshot of the remote mirror (both have the ability to do that) and serve the snapshot up to an application server for processing. What else did you have in mind, and if "there are more options" then why not share you knowledge with us? Well for example if your secondary site is not just a remote data vault but an actual production site where people need access to the data, it is pretty lame to have severs at the secondary site go over the WAN to go read the data from the primary storage when they have a copy of the data right there ! With a distributed block cache on top of asynchronous data replication, you could have both sites do I/O to the same volumes and access their local storage. If that's what you want to do, then it makes a hell of a lot more sense to do the replication at the file system level not at the LUN, because you need something in the replication layer that understands synchronization issues at the file system. If you assume every data block in every file could be being updated at any moment, you are right. But take for example seismic data files, which get produced as an hours-long sequential dump of data. Why should a processing app at site 2 need to wait until the entire file had been written at site 1 ? It could simply start reading at block 0, and perhaps even do some modifications to the data, as long as it hadn't crossed site 1's writing barrier. Or take the example of web servers running at both sites, and at each site there could be people writing new content (think a slashdot-like situation) into their own data region. The web servers might not get notified every time a new document is added at the remote site, but an occasional directory refresh will show the new content and that is more than enough for a lot of applications. Obviously you'll need some form of agreement between the sites to only use certain block ranges for new I/O so they won't all start writing to the same regions. Arne |
#23
|
|||
|
|||
EMC to IBM SAN LUN replication
Arne Joris wrote:
Bill Todd wrote: Perhaps you should actually learn something about VMS clustering before presuming to expound on the deficiencies that you imagine it has. VMS's distributed file system has, since the mid-'80s, provided exactly what you describe above: an environment in which multiple instances of dumb applications executing at different sites can concurrently share (and update) the same file(s), with the same integrity guarantees that they'd enjoy if they were all running on a single machine. Jeez Bill, are you trying to play the old geezer claiming anything new is at best a poor imitation of something they've been using for 20 years :-) No: I'm an old geezer observing that every single damn thing that you've characterized as 'emerging technology' is in fact very old hat: VMS had it over two decades ago, IBM had it a decade ago in Parallel Sysplex (and to a lesser extent in HACMP on AIX), other Unixes have been developing it more recently, as well as third-parties (Mercury's SANergy being one early example, a shared-storage/central metadata server implementation supporting both Windows and Unix later bought by IBM for Tivoli which could be used to achieve most of what you've described, though not as flexibly as VMS facilities can): it's Windows that's the real laggard (Microsoft was working with DEC in the '90s to try to move in this direction, but came up far short and never rectified that: they appear to have decided that concurrently-shared-storage architectures, whether real or virtual, were not the way they wished to go). You are right, I know nothing about VMS. Then why in hell did you presume to try to characterize its facilities as anything other than precisely what I told you they were? When I observed "Since VMS has been doing exactly these kinds of things since the mid-'80s, calling it 'emerging technology' (rather than, say, catch-up implementations) seems a bit of a stretch." you responded "The difference is..." without knowing a damn thing about whether any difference of the form that you went on to describe actually existed (and, of course, it didn't). But it makes me think of mainframes and a proprietary file system, which would mean you can't just take your existing SAP, Exchange and StorNext filesystem services, and make them run on a VMS cluster, am I right ? Well, SAP of course *used* to run natively on VMS, until it decided that cHumPaq was insufficiently committed to the platform to make continuing to do so worthwhile. As for the rest, you can use the inter-site VMS cluster as a distributed CIFS file server to serve Windows clients - in the manner that you suggested using a distributed 'storage appliance'. But none of them would run (whether on the 'storage appliance' that you're imagining or on a VMS cluster) using multiple concurrent instances unless they were designed to do so at the application level: while the 'dumb Web server' that you originally mentioned might not require any coordination between instances, things like SAP and Exchange most definitely would. If a storage appliance does all this, you can connect the WAN link and the SAN at each site to the storage appliances and every service can run on it's own server with whatever OS it prefers, connecting to the local SAN, using whatever block acces it prefers (raw block device, file system,...) Only if a) that storage appliance is interlocking raw block access synchronously (some VMS storage hardware does this and VMS cluster software can also do it at a low level when handling the replication on dumb hardware; I don't specifically remember whether they export raw block-level inter-site access for application use, but see no reason why they wouldn't) and b) you're using higher-level shared-storage distributed file system software (executing by definition *outside* the 'storage appliance' if it indeed is exporting only block-level access, as you state above) for file-level accesses. That's what I meant by the app doesn't need to know anything about the distributed storage, Nor (once again) does it with the VMS facilities that I've described. nor is there any requirement for server hardware, OS All you've done is substitute this mythical 'storage appliance' for the server hardware and OS. or communication middle-ware. *What* 'communication middle-ware', pray tell? Unless you're referring to using something like CIFS to link Windows systems to the distributed VMS servers, in which case if you want file-level access you'll instead have to use shared-storage distributed file system 'middle-ware' on those Windows servers to link to your block-level 'storage appliance'. And exactly how do you think that the secondary site's portion of your hypothetical distributed cache would *know* that its local data was stale, without some form of synchronous communication with the primary site? Your hands are waving pretty fast, but I suspect you really don't know much about this subject at the level of detail that you're attempting to discuss. I think you know the answer as well as I do : I suspect somewhat better, since I've actually designed and implemented systems of this ilk a couple of times rather than just bloviated about them. by having a distributed cache manager, one that tries to put the cache directory for a given block range at the site with the most intensive I/O pattern. Ah, you're learning something from VMS already, I see - save that the block level (rather than the file level) isn't the optimal place to do it, and (if site-local access to data is as important as you seem to think it is) using a distributed cache makes a lot less sense than using distributed locks (since you'd rather access data locally - even if you have to go to disk for it - as long as that's safe, regardless of whether some other site might have it cached). Now, if you're using 'distributed cache' (something I tend to use in the context of allowing one system to benefit from the data in another's cache rather than having to go to disk for it) to mean something much more like 'distributed locking mechanism' (which tracks potential synchronization issues such that they can be properly addressed should they occur), then we're just using different terminology to describe the same thing. If you make those directory entries large enough you can reduce the overhead of having to ask the cache manager for every I/O. Of course you have to interrogate the cache (or other inter-site synchronization facility) on every I/O - the idea is that you only have to interrogate the *site-local* portion of it, because it's being kept synchronously up to date about any temporary inconsistencies. Yes there will of course we some synchronous traffic between the sites, for communication between the parts of the distributed cache manager. Duh - you mean like the synchronous communication required between the parts of VMS's distributed lock manager? Yup. I find it curious you suspect I don't know anything about the subject. Why would you find it curious that your earlier incompetent observations about it would lead someone to that conclusion? You do seem willing to learn, though - but you still appear a bit confused about the level of inter-site coordination required by concurrent distributed file-level access. Do you think it's impossible or impractical and therefor anyone claiming the opposite must be full of it ? Au contrai it's eminently possible and practical (as I've noted, VMS has been doing it for over 20 years, just somewhat better than the approach which you sketched out would). .... But Oracle for example can use raw block devices and does it's own locking and synchronisation (with RAC), so there's not always a need to provide one as part of the storage system. If the storage system firmware (rather than the Oracle software) is handling the inter-site block-level replication (as you seem to be suggesting), then either that firmware needs to implement at least short-term inter-site interlocks to ensure that regardless of which site Oracle elects to obtain block-level data from the copy obtained is up to date, or the Oracle code must ensure that until a write-complete ACK has been received from the local storage hardware no remote access to that block can occur (and the local hardware must not return completion status until all copies have been updated). By the way, the Oracle RAC DLM implementation is based on the VMS DLM design given to them by DECpaq (the earlier Oracle Parallel Server product originated on VMS, since that was the only cluster environment that provided the required distributed lock management), and the AIX HACMP DLM was a clone based on careful study of the VMS DLM design (not surprising, since one of its main purposes was to allow OPS - with its VMS DLM-based locking interface - to run in the HACMP environment). Imitation often *is* the sincerest form of flattery. if applications need a lock manager (like distributed file systems for example), they'll need to provide their own. What's this "if"? The only instances in which they will *not* require something like a distributed lock manager are exactly those which Nick described: using snapshot-style facilities at the secondary site to create an effectively separate (usually read-only) environment which can then be operated upon in whatever manner one wants. How about the migrating services that can move between the sites but will only access data at a single site at any time ? No competent engineer would design a storage system claiming to be multi-site accessible without appropriate interlocks (even supporting reliable snapshot-style access at the remote site requires that remote updates occur in the same *order* that primary-site updates do): at most, they'd *allow* multi-site access with vehement and copious "on your head be it" warnings about the potential perils of using it - which would only be exacerbated (in terms of race windows) if the asynchronous replication you've talked about were used. As a specific example of a potential problem *above* the hardware level, if your hypothetical service migrates at any speed exceeding that of sneakernet, dirty data that it just wrote to the OS file system cache on one site may not have been flushed to the underlying (distributed) storage media by the time the service pops up on the other site (delays of up to 30 seconds are common in Unix environments, for example). There is no data locking required, the service can simply assume it's data is always there and it can always access it. And like I said, most applications requiring locks already provide them themselves. Horse****: most applications requiring locks don't need to think about them at all, because the underlying file system is doing all that transparently for them. Just *getting to* the data involves following a multi-link path through the file system directory structure, an operation that can't occur reliably on a 'secondary' site without some degree of inter-site update coordination (for God's sake, if you're using an optimized journaling file system the secondary site may have the pertinent *log* entries but won't have the associated in-memory update context: you can't use the file system there in anything resembling up-to-date fashion without first performing a recovery from the on-disk log). .... Luckily there are many I/O patterns that can avoid this. My example of agent migration for example, would have a piece of data being accessed from a single site only, until the agent moves to the other side. That may work OK for read-only access, but then so does a block-level snapshot (just take one when you move the agent). Where data gets updated (or, perhaps worse, appended to), you've got allocation activity at the new site which must be carefully coordinated with the primary site. Creating a snapshot and making it available on the other site requires the migration service to integrate with the storage at both sites, which is not just a trivial matter ! *None* of this is as trivial a matter as you imagine it to be, as I hope you're starting to learn (if you respond again, we should get a pretty good idea of just how educable you are). .... For any application where data is produced only once, rarely modified but read often, the latency hits are minimal, and the ease of just being able to read up-to-date data at all sites at any moment is a big benefit. If the updates really are that rare, then synchronous replication and VMS-style site-local access will work fine. That all depends on your definition of rare, and the latencies you are talking about. If you have say 2000 km worth of latency, synchronous is not an option even for very modest updates. If update latency makes synchronous replication prohibitively slow (VMS-style distributed lock management can allow the *reads* to proceed at all sites without delay, as long as no nearby updates are occurring at the time), then your only real option is to use ordered asynchronous replication plus snapshots upon which any required recovery operations are then performed before use (if you still don't understand why, reread the existing material until the light dawns). Unless, of course, you are using a 'careful update' file system such as VMS's (Berkeley's 'soft update' mechanism may also qualify), which should avoid the need for the recovery part (desirable in cases where writable snapshots are not supported) - though you'll still need to use a snapshot based on ordered underlying storage updates, not just wing it with the underlying storage contents continuing to change beneath you. - bill |
#24
|
|||
|
|||
EMC to IBM SAN LUN replication
Arne Joris wrote:
Nik Simpson wrote: Arne Joris wrote: Nik Simpson wrote: If you want to do processing of data for applications like data mining or backup, then you can take a snapshot of the remote mirror (both have the ability to do that) and serve the snapshot up to an application server for processing. What else did you have in mind, and if "there are more options" then why not share you knowledge with us? Well for example if your secondary site is not just a remote data vault but an actual production site where people need access to the data, it is pretty lame to have severs at the secondary site go over the WAN to go read the data from the primary storage when they have a copy of the data right there ! With a distributed block cache on top of asynchronous data replication, you could have both sites do I/O to the same volumes and access their local storage. If that's what you want to do, then it makes a hell of a lot more sense to do the replication at the file system level not at the LUN, because you need something in the replication layer that understands synchronization issues at the file system. If you assume every data block in every file could be being updated at any moment, you are right. Thanks ;-) But take for example seismic data files, which get produced as an hours-long sequential dump of data. Why should a processing app at site 2 need to wait until the entire file had been written at site 1 ? It could simply start reading at block 0, and perhaps even do some modifications to the data, as long as it hadn't crossed site 1's writing barrier. Sounds good in theory, but if the replication is asynchronous with no synchronous communication or lock manager, how does the application at site remote site know how far the application at the local site's "writing barrier" has moved at any given moment? You also have to assume that the writing is sequential, i.e. at it's absolute best this can only work for purely sequential I/O streams and as such is somewhat similar to a streaming media server. As soon as any element of randomness is added to the I/O the whole things goes to hell in a hand basket because neither end knows what the hell is going on at any given point, without maintaining some sort of synchronous communication. Or take the example of web servers running at both sites, and at each site there could be people writing new content (think a slashdot-like situation) into their own data region. The web servers might not get notified every time a new document is added at the remote site, but an occasional directory refresh will show the new content and that is more than enough for a lot of applications. So, we are replicating at the block level write? And now I have application #1 starting a new file, allocating an Inode and starting allocate blocks from the free list, simultaneously or at around the same time, application #2 at the remote site also opens a new file, gets an inode and some free blocks, problem is they both got the same inode. You'll note the obvious race conditions with two independent writers allocating blocks on separate and asynchronous copies of the same filesystem. I'd be interested to hear how you plan to get around this with your appliances and distributed lock manager. Obviously you'll need some form of agreement between the sites to only use certain block ranges for new I/O so they won't all start writing to the same regions. A nice little hand wave there, sort of two separate file systems free lists and inode maps, perhaps once site uses even numbers and the other one sticks to odd numbers ;-) BTW, in your original response to my post about DataCore/FalconStor you said "there are more options" and then to Bill you claimed that this was still an emerging market. To be an emerging market, I'd have to assume that there is somebody somewhere that you think is doing something like this, if so, please share. -- Nik Simpson |
#25
|
|||
|
|||
EMC to IBM SAN LUN replication
Nik Simpson wrote:
Sounds good in theory, but if the replication is asynchronous with no synchronous communication or lock manager, how does the application at site remote site know how far the application at the local site's "writing barrier" has moved at any given moment? There is synchronous messaging between the parts of the distributed cache living at each site. The cache is providing a way to transparently share the data that hasn't been replicated yet between the two sites. All writes go through the cache first where the data stays local unless the other site needs it. Then the cache drains to the asynchronous replication, which will write it to storage at both sites. You also have to assume that the writing is sequential, i.e. at it's absolute best this can only work for purely sequential I/O streams and as such is somewhat similar to a streaming media server. As soon as any element of randomness is added to the I/O the whole things goes to hell in a hand basket because neither end knows what the hell is going on at any given point, without maintaining some sort of synchronous communication. Yeah this example for seismic data uses the knowledge of the purely sequential stream to be able to start processing data at the other site. For general access not based on any of these 'tricks', you'll either need a distributed filesystem on top of the distributed cache if you want files, or have the applications at both sites provide some locking mechanisms themselves. Or take the example of web servers running at both sites, and at each site there could be people writing new content (think a slashdot-like situation) into their own data region. The web servers might not get notified every time a new document is added at the remote site, but an occasional directory refresh will show the new content and that is more than enough for a lot of applications. So, we are replicating at the block level write? And now I have application #1 starting a new file, allocating an Inode and starting allocate blocks from the free list, simultaneously or at around the same time, application #2 at the remote site also opens a new file, gets an inode and some free blocks, problem is they both got the same inode. You'll note the obvious race conditions with two independent writers allocating blocks on separate and asynchronous copies of the same filesystem. I'd be interested to hear how you plan to get around this with your appliances and distributed lock manager. I never said anything about a file system, but if you insist on using files an not blocks, you could pre-allocate a file for each site, and have each site use a database using it's own file to manage your content. Both files are available at both sites, and a web server at any site can extract the index from both files by starting a new database instance on the remote file and querying against that. BTW, in your original response to my post about DataCore/FalconStor you said "there are more options" and then to Bill you claimed that this was still an emerging market. To be an emerging market, I'd have to assume that there is somebody somewhere that you think is doing something like this, if so, please share. I thought I already did but perhaps it wasn't explicit enough : http://www.yottayotta.com/clusteredfilesys.html Arne |
#26
|
|||
|
|||
EMC to IBM SAN LUN replication
Bill Todd wrote:
No: I'm an old geezer observing that every single damn thing that you've characterized as 'emerging technology' is in fact very old hat: VMS had it over two decades ago, IBM had it a decade ago in Parallel Sysplex (and to a lesser extent in HACMP on AIX), other Unixes have been developing it more recently, as well as third-parties (Mercury's SANergy being one early example, a shared-storage/central metadata server implementation supporting both Windows and Unix later bought by IBM for Tivoli which could be used to achieve most of what you've described, though not as flexibly as VMS facilities can): it's Windows that's the real laggard (Microsoft was working with DEC in the '90s to try to move in this direction, but came up far short and never rectified that: they appear to have decided that concurrently-shared-storage architectures, whether real or virtual, were not the way they wished to go). These are all server-based solutions. Putting it inside the SAN on a storage appliance has benefits, but you don't seem to believe them. That is the emerging technology part, not as much the code running on those appliances or the ideas behind them, but the new solutions for multi-site problems they offer. rant about your offense taken at VMS ignorance deleted Alright let's get over this VMS thing and move on... But it makes me think of mainframes and a proprietary file system, which would mean you can't just take your existing SAP, Exchange and StorNext filesystem services, and make them run on a VMS cluster, am I right ? Well, SAP of course *used* to run natively on VMS, until it decided that cHumPaq was insufficiently committed to the platform to make continuing to do so worthwhile. As for the rest, you can use the inter-site VMS cluster as a distributed CIFS file server to serve Windows clients - in the manner that you suggested using a distributed 'storage appliance'. But none of them would run (whether on the 'storage appliance' that you're imagining or on a VMS cluster) using multiple concurrent instances unless they were designed to do so at the application level: while the 'dumb Web server' that you originally mentioned might not require any coordination between instances, things like SAP and Exchange most definitely would. Yup they do need it and in fact they already have their own coordination mechanisms. So why not use the servers these services run on natively and connect them to a SAN that provides a global cache mechanism so they don't have to worry about moving data between the sites ? What would VMS offer a service that was re-desgined to run on a VMS cluster that this native solution wouldn't ? ... All you've done is substitute this mythical 'storage appliance' for the server hardware and OS. Well a lot of people would rather buy some appliance boxes than re-design their software to run on a new platform. If you can't see the benefit of that, I don't know what to say anymore. ... I think you know the answer as well as I do : I suspect somewhat better, since I've actually designed and implemented systems of this ilk a couple of times rather than just bloviated about them. Oh now there's the grumpy old geezer again :-) ... Now, if you're using 'distributed cache' (something I tend to use in the context of allowing one system to benefit from the data in another's cache rather than having to go to disk for it) to mean something much more like 'distributed locking mechanism' (which tracks potential synchronization issues such that they can be properly addressed should they occur), then we're just using different terminology to describe the same thing. A distributed cache is a collection of caches on all the storage appliance at all the sites. Perhaps I misunderstand your 'locking'; the pieces of cache on all the appliances use sychronous messages to keep coherency among them, and will send a chunk of dirty data at site 1 to site 2 if a host at site 2 asks for it. Is that locking a cache region ? Yes something is keeping track of which appliance owns dirty data for a given block range. Of course you have to interrogate the cache (or other inter-site synchronization facility) on every I/O - the idea is that you only have to interrogate the *site-local* portion of it, because it's being kept synchronously up to date about any temporary inconsistencies. Right, all I/O goes through the distributed cache. If it knows a remote appliance has an entry for a block you're doing I/O to, it will go get it for you. If it knows there is no dirty data for a block and no local read cache either, it will go read it from local storage for you. I find it curious you suspect I don't know anything about the subject. Why would you find it curious that your earlier incompetent observations about it would lead someone to that conclusion? Incompetent how ? By stating not just any app on any platorm can run on VMS ? By failing to convince you I'm not an idiot ? You do seem willing to learn, though - but you still appear a bit confused about the level of inter-site coordination required by concurrent distributed file-level access. Thank you oh storage guru, I shall try to be worthy of your time. These posts feel a lot like the wax-on/wax-off routine :-) Seriously though, my main point has been that there are two levels of inter-site coordination : one at the app level and one at the storage level. ... But Oracle for example can use raw block devices and does it's own locking and synchronisation (with RAC), so there's not always a need to provide one as part of the storage system. If the storage system firmware (rather than the Oracle software) is handling the inter-site block-level replication (as you seem to be suggesting), then either that firmware needs to implement at least short-term inter-site interlocks to ensure that regardless of which site Oracle elects to obtain block-level data from the copy obtained is up to date, or the Oracle code must ensure that until a write-complete ACK has been received from the local storage hardware no remote access to that block can occur (and the local hardware must not return completion status until all copies have been updated). Indeed. Until the storage says "yes I've accepted your I/O" (into distributed cache in this case), Oracle must not allow other I/O to the same block, or else there are no guarantees about what will be the final data in there. But you see, the storage appliances don't lock anything, it's Oracle that has to implement it this way. ... As a specific example of a potential problem *above* the hardware level, if your hypothetical service migrates at any speed exceeding that of sneakernet, dirty data that it just wrote to the OS file system cache on one site may not have been flushed to the underlying (distributed) storage media by the time the service pops up on the other site (delays of up to 30 seconds are common in Unix environments, for example). Yes host caches are bad, if you want file systems you need a distributed file system like CXFS, StorNext, PloyServe, ... They handle host cache issues. There is no data locking required, the service can simply assume it's data is always there and it can always access it. And like I said, most applications requiring locks already provide them themselves. Horse****: most applications requiring locks don't need to think about them at all, because the underlying file system is doing all that transparently for them. Just *getting to* the data involves following a multi-link path through the file system directory structure, an operation that can't occur reliably on a 'secondary' site without some degree of inter-site update coordination (for God's sake, if you're using an optimized journaling file system the secondary site may have the pertinent *log* entries but won't have the associated in-memory update context: you can't use the file system there in anything resembling up-to-date fashion without first performing a recovery from the on-disk log). You are thinking about file systems and I was not. At the block level, the most recent data is always present from all sites. No data locking required, at least not at the block level. Apps need to implement locking themselves. .... Creating a snapshot and making it available on the other site requires the migration service to integrate with the storage at both sites, which is not just a trivial matter ! *None* of this is as trivial a matter as you imagine it to be, as I hope you're starting to learn (if you respond again, we should get a pretty good idea of just how educable you are). Well I hope I'm not disappointing ! Did you think you scared me into hiding with your biting sarcasm :-) It is as trivial in this particular example of migrating agents. The migration service only needs to ensure an agent is halted at site 1 before getting started at site 2 (and ofcourse save any of the agent's state it needs to persist to resume service), and doesn't need to bother with getting the data there. That all depends on your definition of rare, and the latencies you are talking about. If you have say 2000 km worth of latency, synchronous is not an option even for very modest updates. If update latency makes synchronous replication prohibitively slow (VMS-style distributed lock management can allow the *reads* to proceed at all sites without delay, as long as no nearby updates are occurring at the time), then your only real option is to use ordered asynchronous replication plus snapshots upon which any required recovery operations are then performed before use (if you still don't understand why, reread the existing material until the light dawns). Are we talking about recovery now ? Yes you need write order coherency in your asynchronous replication, and your app still needs to be able to recover from partial but ordered loss of data. Arne |
#27
|
|||
|
|||
EMC to IBM SAN LUN replication
Arne Joris wrote:
Nik Simpson wrote: Sounds good in theory, but if the replication is asynchronous with no synchronous communication or lock manager, how does the application at site remote site know how far the application at the local site's "writing barrier" has moved at any given moment? There is synchronous messaging between the parts of the distributed cache living at each site. So the cache is synchronously mirrored between the two sites? If so, I don't see how this is asynchronous in any conventional sense of the word and will have the performance problems of any synchronous system when the site-to-site latency is significant. Also, your example of having a large sequential data set being written at the source site and a reader at the remote site can be handled so much more easily with a TCP/IP socket that I'm not sure what you think this approach would buy given it's undoubted cost and complexity. The cache is providing a way to transparently share the data that hasn't been replicated yet between the two sites. But in order for it to be seen at the remote site, there must be a copy of the changed data at the remote site, so how does it allow for sharing of "data that hasn't been replicated yet between the two sites." I thought I already did but perhaps it wasn't explicit enough : http://www.yottayotta.com/clusteredfilesys.html No, you never mentioned Yotta in any post that I can recall. So now we've got one clustered file system from a vendor with a history of not delivering anything than marketing materials and implementing new (and apparently efficient) ways of blowing through VC money at high speed. This is actually a completely new incarnation of Yotta who managed to spend the best part of $100M on a hi-end array that never shipped, apparently there selling a different variety of snake oil now. Based on what I can tell from the website, what they have appears to be a distributed synchronous storage appliance, i.e. blocks are replciated synchronously with users accessing a local filesystem. If so, that's really not that special (as has been pointed out, it's nothing new) Problems will be latency for writes as the various site lock-managers negotiate for access which will be in addition to the "write" being mirrored to each cache. -- Nik Simpson |
#28
|
|||
|
|||
EMC to IBM SAN LUN replication
Arne Joris wrote:
Bill Todd wrote: No: I'm an old geezer observing that every single damn thing that you've characterized as 'emerging technology' is in fact very old hat: VMS had it over two decades ago, IBM had it a decade ago in Parallel Sysplex (and to a lesser extent in HACMP on AIX), other Unixes have been developing it more recently, as well as third-parties (Mercury's SANergy being one early example, a shared-storage/central metadata server implementation supporting both Windows and Unix later bought by IBM for Tivoli which could be used to achieve most of what you've described, though not as flexibly as VMS facilities can): it's Windows that's the real laggard (Microsoft was working with DEC in the '90s to try to move in this direction, but came up far short and never rectified that: they appear to have decided that concurrently-shared-storage architectures, whether real or virtual, were not the way they wished to go). These are all server-based solutions. Putting it inside the SAN on a storage appliance has benefits, but you don't seem to believe them. Rather, trying to 'put it inside the SAN on a storage appliance' has severe limitations, but you don't seem to understand them. That is the emerging technology part, No, it's the bull**** part. Try to follow along this time: I won't bother attempting to educate you again. .... Yup they do need it and in fact they already have their own coordination mechanisms. So why not use the servers these services run on natively and connect them to a SAN that provides a global cache mechanism so they don't have to worry about moving data between the sites ? Because when you limit yourself to raw block-level access (as you claim is your intent later on in your post) 'moving data between the sites' is the *easy* part, and these coordinating instances of a block-level-access application are in a better position to do so intelligently than some generic (and somewhat mis-labeled, as should become obvious) 'caching' mechanism in the hardware. If any real intelligence is required due to significant geographical site separation, that is. If not, then the application just synchronously mirrors between sites - even easier. If you posit a shared-storage file system to allow your applications transparent file-level access, then the observations above about applications apply equally to the file system's internal operation (and since you already needed special shared-storage file system facilities, having the file system handle the replication in software is far less expensive than using custom hardware). What would VMS offer a service that was re-desgined to run on a VMS cluster that this native solution wouldn't ? The point, which you seem to have forgotten, is exactly the opposite: what would this so-called 'emerging technology' offer that VMS didn't offer two decades ago - at the block level as well as at the file level (I think that Oracle Parallel Server used the former)? The only remotely novel feature appears to be the lazy replication with synchronous cache-coherence (itself of somewhat debatable merit, unless you're replicating remotely solely for remote access performance rather than for guaranteed availability: letting the application use data that has not yet been replicated entails some danger if the only copy is then lost) - a rather narrow market niche upon which to base a product, and (as should also become evident) definitely not the best way to handle bandwidth-constrained long links (they have to be bandwidth-constrained, since if they were only latency-constrained then the synchronous communication required just by the cache-coherence mechanisms that you describe wouldn't be feasible). ... All you've done is substitute this mythical 'storage appliance' for the server hardware and OS. Well a lot of people would rather buy some appliance boxes than re-design their software to run on a new platform. You really do seem intent on forgetting that you were not talking about anything platform-specific in your initial drivel but rather about 'emerging technology'. So I'll remind you once again of why it made you appear incompetent (and also suggest that the longer you continue to try to bluster your way out of it, the more incompetent you appear). If you can't see the benefit of that, I don't know what to say anymore. You haven't known what to say all along, but that hasn't kept you from saying it. .. I think you know the answer as well as I do : I suspect somewhat better, since I've actually designed and implemented systems of this ilk a couple of times rather than just bloviated about them. Oh now there's the grumpy old geezer again :-) Nah - I'm just not very tolerant of people who are as unaware of the limits of their knowledge as you are, yet insist on arguing about things they don't understand rather than concentrating on remedying that deficiency. And this is in no way age-related: I've lacked tolerance for incompetent and ineducable blowhards since I was young. Now, 'incompetent and ineducable' are characterizations that are indeed relative to the subject under discussion: if you were not bloviating at such a detailed level, they might be less applicable. Unfortunately, you've chosen to try to argue about details that you're just not equipped to address (or, apparently, even understand - though you've now got another shot at that here). ... Now, if you're using 'distributed cache' (something I tend to use in the context of allowing one system to benefit from the data in another's cache rather than having to go to disk for it) to mean something much more like 'distributed locking mechanism' (which tracks potential synchronization issues such that they can be properly addressed should they occur), then we're just using different terminology to describe the same thing. A distributed cache is a collection of caches on all the storage appliance at all the sites. 'Fraid not: that's just a dumb bag of unconnected caches. Perhaps I misunderstand your 'locking'; Seems likely. the pieces of cache on all the appliances use sychronous messages to keep coherency among them, That's a cache-coherency mechanism, rather than a 'distributed cache' per se. VMS uses its distributed lock manager to (among many other things) create such a distributed coherency mechanism for its distributed storage. and will send a chunk of dirty data at site 1 to site 2 if a host at site 2 asks for it. Now, *that's* starting to resemble a real distributed cache, rather than just a bag o' caches connected by a coherence mechanism (e.g., of the 'invalidate' variety that passes updates only through the underlying storage layer). Of course, what you've described isn't a very broadly-useful cache, but just a means of supporting lazy inter-site replication (more on that later). .... my main point has been that there are two levels of inter-site coordination : one at the app level and one at the storage level. Well, if you limit the discussion to apps that don't use a file system, I suppose. But as I already noted above, not only is that a rather narrow market niche upon which to base a product, but the amount of value that storage-level cache-coherence adds is at best debatable (in fact, there's a specific example just coming up below). ... But Oracle for example can use raw block devices and does it's own locking and synchronisation (with RAC), so there's not always a need to provide one as part of the storage system. If the storage system firmware (rather than the Oracle software) is handling the inter-site block-level replication (as you seem to be suggesting), then either that firmware needs to implement at least short-term inter-site interlocks to ensure that regardless of which site Oracle elects to obtain block-level data from the copy obtained is up to date, or the Oracle code must ensure that until a write-complete ACK has been received from the local storage hardware no remote access to that block can occur (and the local hardware must not return completion status until all copies have been updated). Indeed. Until the storage says "yes I've accepted your I/O" (into distributed cache in this case), Oracle must not allow other I/O to the same block, or else there are no guarantees about what will be the final data in there. But you see, the storage appliances don't lock anything, it's Oracle that has to implement it this way. And (along the lines that I mentioned above) it's trivial (and likely necessary, for other reasons) for Oracle to handle this itself (since it already has to coordinate synchronously between instances about any data that may be in the process of being updated), rather than depend upon some specialized, proprietary storage-level inter-site caching mechanism (Oracle has always preferred to provide such facilities itself, precisely so that they will be available in *all* the environments it runs on). In fact, if Oracle handles it itself it need not even wait for the update to occur at all (locally or remotely), but can just send the up-to-date in-memory copy to the other site to use there. And that, perhaps more than anything else, illustrates why trying to shove inter-site cache coherence (you're actually talking more about underlying data coherence than 'cache coherence', since any caching is strictly short-term to cover things until the replication has completed) down into the storage layer is ill-conceived and decidedly sub-optimal. Oracle (using block-level access) and distributed file systems (using block-level access underneath to support distributed file-level access above) *know what they can afford to cache internally rather than force immediately (even if lazily) to disk*. They can coordinate use of such non-persistent data between sites, while the underlying storage (including your proposed storage appliance 'caching' mechanism) *can't even see it yet*. When they use transaction logs, they can capture small logical updates synchronously, propagate these small log updates to remote sites synchronously (with negligibly greater latency than it takes just to propagate the information that the log has been persistently updated - if log persistence was required for that log record), and propagate the larger related block updates lazily (since as long as the log information has been made persistent, there's no rush about making the related block updates persistent - they can always be redone from the log data if necessary; for that matter, if inter-site bandwidth is a major constraint, the remote block content can be reconstructed from the log data there rather than have to be sent at all). Implementations that aren't journaled can also send small updates (or other coordinating information) synchronously with approximately the same cost as the synchronous propagation of cache-coherence data that you're advocating - and with far greater control over the inter-site sharing semantics than a dumb lower-level persistent data-coherence protocol allows. If you want to optimize lazy replication to a distant site where concurrent access to the data is permitted, *that's* the way to go about it. And with reasonably fine-grained distributed locking it supports your example of reading gargantuan amounts of seismic data behind the writer as well - even if it's done through a file system instead of with raw blocks: while the writer (and the lazy remote updating facility) will temporarily lock the end of the file while appending to it, the rest of the file (and the path to it) can be accessed directly at the remote site as long as it's not also being actively changed at the originating site. The bottom line is that there's no free lunch: if you want remote-site access to its local copies of data at other than the snapshot level, you need synchronous inter-site coordination of some kind for anything save read-only access at *all* sites. You can use fine-grained revocable distributed locking (coherence) mechanisms to minimize the need to check with other sites on accesses (synchronous VMS clusters have been successfully used at separations of up to 500 miles that I know of, and that's certainly not a hard limit - in particular, special-case situations such as your seismic data example might well tolerate much larger separations), and where bulk access to data is concerned you can (in the absence of updating conflict) gain access to a large amount of it with only a single inter-site coherence check even without distributed coherence locks (i.e., the latency of the inter-site permission check can be small compared to the local bulk-transfer time); in either case, the facilities are in most cases better implemented in the inter-site software coordination layer than at a lower level - only when they have *not* been flexibly implemented at that higher level would less comprehensive lower-level facilities have any value. Which is what YottaYotta (which you mentioned in your original post) provides. Last I knew, they weren't doing all that well, which suggests that their product (while it may admirably do what it says on the tin) may be as limited in *general* applicability as I suggested above. I do recall that they are (or at least began as) a Canadian company, and you seem to have a Canadian email address - but wouldn't want to jump to any conclusion based on that... ... As a specific example of a potential problem *above* the hardware level, if your hypothetical service migrates at any speed exceeding that of sneakernet, dirty data that it just wrote to the OS file system cache on one site may not have been flushed to the underlying (distributed) storage media by the time the service pops up on the other site (delays of up to 30 seconds are common in Unix environments, for example). Yes host caches are bad, if you want file systems you need a distributed file system like CXFS, StorNext, PloyServe, ... They handle host cache issues. And while doing so can handle the kinds of issues you'd like to stick down in the hardware better (and less expensively) than they can be handled there, as described above. There is no data locking required, the service can simply assume it's data is always there and it can always access it. And like I said, most applications requiring locks already provide them themselves. Horse****: most applications requiring locks don't need to think about them at all, because the underlying file system is doing all that transparently for them. Just *getting to* the data involves following a multi-link path through the file system directory structure, an operation that can't occur reliably on a 'secondary' site without some degree of inter-site update coordination (for God's sake, if you're using an optimized journaling file system the secondary site may have the pertinent *log* entries but won't have the associated in-memory update context: you can't use the file system there in anything resembling up-to-date fashion without first performing a recovery from the on-disk log). You are thinking about file systems and I was not. Then what, exactly, were you referring to when you said "If a storage appliance does all this, you can connect the WAN link and the SAN at each site to the storage appliances and every service can run on it's own server with whatever OS it prefers, connecting to the local SAN, using whatever block access it prefers (raw block device, file system,...)" It certainly *sounded* as if you thought that your magical inter-site block replication mechanism would allow use of any OS (and any related file system - see your last words above) to access data on both ends of the inter-site link. And, of course, that's dead wrong. Or when Nick said "you need something in the replication layer that understands synchronization issues at the file system" and you responded "If you assume every data block in every file could be being updated at any moment, you are right. But take for example seismic data files, which get produced as an hours-long sequential dump of data. Why should a processing app at site 2 need to wait until the entire file had been written at site 1 ? It could simply start reading at block 0, and perhaps even do some modifications to the data, as long as it hadn't crossed site 1's writing barrier." and "The web servers might not get notified every time a new document is added at the remote site, but an occasional directory refresh will show the new content and that is more than enough for a lot of applications." That certainly *sounded* as if you a) were talking about file access and b) didn't have a clue about other file system synchronization dependencies (e.g., as in the mere path *to* the file that I mentioned above) that were a problem. In any event, if you are not *now* talking about file systems, then (as I'm getting tired of observing) you're talking about a rather specialized product niche - and still one where whatever the application is doing internally to make up for the lack of file-level access facilities may be in a much better position to manage inter-site replication (as well as the other inter-site issues it must be aware of) than your proposed box would be. .... Creating a snapshot and making it available on the other site requires the migration service to integrate with the storage at both sites, which is not just a trivial matter ! *None* of this is as trivial a matter as you imagine it to be, as I hope you're starting to learn (if you respond again, we should get a pretty good idea of just how educable you are). Well I hope I'm not disappointing ! Since I've encountered similar people frequently over the years, I'm neither disappointed nor surprised (though I'm always somewhat hopeful - guess I'm just an optimist at heart). But you now have another opportunity to mitigate the impression that you've built up. .... That all depends on your definition of rare, and the latencies you are talking about. If you have say 2000 km worth of latency, synchronous is not an option even for very modest updates. If update latency makes synchronous replication prohibitively slow (VMS-style distributed lock management can allow the *reads* to proceed at all sites without delay, as long as no nearby updates are occurring at the time), then your only real option is to use ordered asynchronous replication plus snapshots upon which any required recovery operations are then performed before use (if you still don't understand why, reread the existing material until the light dawns). Are we talking about recovery now ? Perhaps you've not very familiar with how storage-level crash-consistent snapshots work either (when you're using a *planned* snapshot you have the option, if the higher layers support it, of orchestrating things such that recovery is not required to use said snapshot, but if you want to be able to take ad hoc remote snapshots at any point in time - since you seemed to be saying that you didn't want to have to involve the application in any explicit coordination - they'll only be crash-consistent). Yes you need write order coherency in your asynchronous replication, and your app still needs to be able to recover from partial but ordered loss of data. As does any underlying file system: that's what 'recovery' above referred to. - bill |
#29
|
|||
|
|||
EMC to IBM SAN LUN replication
Nik Simpson wrote:
So the cache is synchronously mirrored between the two sites? If so, I don't see how this is asynchronous in any conventional sense of the word and will have the performance problems of any synchronous system when the site-to-site latency is significant. The cache is not mirrored, dirty data is only present in a single appliance. The messages between the cache components at both sites are synchronous. When an appliance receives an I/O request for a given block, the distributed cache uses synchronous messages to find out if there is cached data on any appliances, be it at the same site or a remote site. Also, your example of having a large sequential data set being written at the source site and a reader at the remote site can be handled so much more easily with a TCP/IP socket that I'm not sure what you think this approach would buy given it's undoubted cost and complexity. Both sites need the file at the end of the day; site 1 produces it, site 2 processes it and modifies it, site 1 might then be required to visualize the data, etc. If you use a TCP socket to push data back and forth over a WAN, your apps will need to be in lockstep with each other (with perhaps a buffer at each end), which seems a whole lot harder to implement than have both apps simply write and read a set of blocks. You could take the same apps that were used to produce, process and visualize the data in a single site using their local SAN, and move them out to different sites without having to modify them. The cache is providing a way to transparently share the data that hasn't been replicated yet between the two sites. But in order for it to be seen at the remote site, there must be a copy of the changed data at the remote site, so how does it allow for sharing of "data that hasn't been replicated yet between the two sites." No you don't need to copy all the cache to the other site; all you need is a cache coherency protocol that can fetch dirty data at the remote site on demand. And keep in mind that only the dirty cached data needs to be fetched remotely; any clean remote caches can be ignored and you can go straight to local disk to get the data. The distributed cache can hide all these details from the initiators. ... Based on what I can tell from the website, what they have appears to be a distributed synchronous storage appliance, i.e. blocks are replciated synchronously with users accessing a local filesystem. If so, that's really not that special (as has been pointed out, it's nothing new) Not according to this article : http://www.byteandswitch.com/document.asp?doc_id=94013 Arne |
#30
|
|||
|
|||
EMC to IBM SAN LUN replication
Good grief Bill, are you one of those people that type more as they
grow angrier ? Thank you for all your replies, but I'm starting to get a bit ticked off by your rather abrasive style. If we can get back to what started this whole discussion, I stated that ... if your secondary site is not just a remote data vault but an actual production site where people need access to the data, it is pretty lame to have severs at the secondary site go over the WAN to go read the data from the primary storage when they have a copy of the data right there ! With a distributed block cache on top of asynchronous data replication, you could have both sites do I/O to the same volumes and access their local storage. I never heard you make any successful arguments against this, your only beef with me is that you claim it would be stupid to use a distributed block cache appliance (yes I know you don't like that wording) to do this, seeing how the only useful way for apps at the two sites to use the data requires synchronous messaging between them anyway. Then there was a whole lot of talk about VMS, I'm not sure why that was relevant. Despite increasing levels of toxicity (unilaterally from your part I might add), I have kept on trying to argue that for certain solutions, the appliance solution makes sense. I think you have been thinking about clustered applications with tight coherency requirements and can't or won't see beyond them. I'm fine with the fact you think I deserve a newsgroup lashing for arguing for something you are convinced is impossible or impractible, I'll ignore your claims about stupidity needing to be punished as I'm sure you'll have your own personal reasons for this. Keep on VMSing Bill ! Arne |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
EMC Celerra Replication | Fred | Storage & Hardrives | 1 | February 9th 06 05:37 PM |
replication and tape backup | [email protected] | Storage & Hardrives | 4 | February 22nd 05 03:21 AM |
SAN replication - WAN | Hal Kuff | Storage & Hardrives | 1 | October 22nd 04 01:21 AM |
SAN filesystem uses local storage for reads with synchronous replication | Bill Todd | Storage & Hardrives | 6 | October 21st 04 04:54 PM |
Sync Replication over CWDM lines | Andy S | Storage & Hardrives | 1 | July 21st 04 09:06 PM |