SAN filesystem uses local storage for reads with synchronous replication

#1 October 16th 04, 02:23 AM

"George Orwell" wrote in message
...
I have servers and storage at two sites, with an FC bridge over an ip WAN
in
between them to create a SAN.
Now I would like to run a distributed filesystem on servers at both sites,
accessing the same storage.

For redundancy reasons, I would also like to replicate all the storage so
both
sites have a copy of the data at all times. If I chose synchronous
replication,
I would expect each site to be able to use the local copy for reads,
giving
me local read speeds.

Does anyone know of a filesystem/SAN appliance combination that can give
me
this ?

The short answer is, I don't. Most storage-level 'synchronous replication'
mechanisms were not set up to allow any (coordinated) access at all to the
'remote' replica, because they were created back when storage-level
replication was usually done with a 'passive' remote copy meant only to take
over if the primary copy became inaccessible.

Some distributed systems (VMS's comes to mind) will cheerfully optimize
reads to target the nearest replica if you make them aware that this is
desired. But that facility is part of VMS's system-software-controlled
mirroring facilities rather than of an underlying hardware array doing its
own remote mirroring (though its HSx controllers may offer similar
facilities nowadays).

Finally, even if you find some hardware set up to allow concurrent read
activity at both replicas, unless you are using a distributed file system
that optimizes its coordination (locking) mechanisms across both sites so
that at least when possible locks are managed at the end that's actually
using the associated data (again, VMS comes to mind here), you still may
have inter-site lock traffic slowing things down even if the actual read
accesses are performed locally (though that will only affect latency, rather
than create the serious inter-site bandwidth demands that reading only at
the 'primary' site might).

Take a look at Lust it created a clone of the VMS distributed lock
manager, and might (or might not) offer what you're looking for (though in a
distributed-file-server-style environment rather than a conventional SAN).
AIX's GPFS might too (IBM cloned the VMS lock manager about 10 years ago,
primarily to support Oracle Parallel Server if I understand correctly, and
may later have used it with GPFS in a manner you might find useful).

But last I knew (and things might have changed since), most SAN file systems
thought they were doing pretty well just coordinating shared access to
storage from multiple hosts, without getting into finer points such as
optimizing disaster-tolerant configurations where both sites were active.
So if you find out differently, please let us know.

- bill

#2 October 19th 04, 05:31 AM

"Tarapia Tapioco" wrote in message
x.it...

....

It depends on the type of traffic; for large files (hundreds of megabytes)
the locking isn't a big issue if it's done at the file level.

Indeed - if you're not doing concurrent distributed writes to the same file.

But I'm interested in the locking mechanism you descibed; is this a truly
distributed
meta data mechanism ?

Well, it's certainly distributed *management* of concurrently-shared
metadata, rather than the more common central-metadata-server approach. The
metadata itself may be distributed across sites, or centralized.

I have heard that the overhead to distribute meta data
operations isn't worth it unless you intend to do I/O to many small files
(so
the ratio of meta data operations to I/O is high).

I'm not sure how that makes sense: the less metadata is used, the less
impact it will have on overall activity regardless of how it may be managed.
The main reason metadata management is distributed in VMS is likely to make
the distributed file system resilient to single-node failures and
peer-to-peer rather than client/server in nature, but in practice the
overheads do not seem significant.

Does VMS use a directory server to show which server "owns" a certain
range
of files or directories ?

If the accesses to a file are dominated by a single system, all lock
management for that file migrates automatically to that system (so the
relatively infrequent accesses by other systems take a minor hit by having
to send lock activity there, but most activity executes at local-host
speeds). A new accessor queries one of the cluster's lock directory servers
(nothing special about them - they just avoid the need for every system in
the cluster to maintain high-level lock directory information) to find out
where a given file's locks are 'mastered'.

Take a look at Lust it created a clone of the VMS distributed lock
manager, and might (or might not) offer what you're looking for (though
in a distributed-file-server-style environment rather than a
conventional SAN).
AIX's GPFS might too (IBM cloned the VMS lock manager about 10 years
ago, primarily to support Oracle Parallel Server if I understand
correctly, and may later have used it with GPFS in a manner you might
find useful).

I am looking at CXFS (SGI's DFS). It doesn't do distributed locking, but
that
is not a big issue for me. I think all I need is a replicating appliance
that
allows I/O to the replica; if I zone my sites so each can only see the
local
copy, CXFS has no choice but to use the local one.

As long as absolutely no writes are going to disk, that should work fine.
As soon as any write activity occurs (even background writes such as
updating 'last access times', where no application-level writes exist), the
local file system's cached data (which it believes it can cache safely
because it 'owns' that file system exclusively) will start to get stale,
and, worse, may then be written back out to disk stale.

The safest way to ensure that no writes are occurring is to write-lock the
disks. And you'll then soon find out whether the file system is OK with
that (early NTFS, for example, was not, IIRC).

Unless I'm misunderstanding your intent, and the instances of CXFS on the
various systems *will* in fact be aware of each other (you said no
distributed locking, though, so unless CXFS operates as a client/server
architecture - at least as far as locking goes, and my vague recollection
now is that it may - I'm not sure how they could coordinate with each
other). Interesting thought about making different (local) mirror-disks
appear to be the same to all instances and letting a transparent disk-level
replicator keep them in synch: as long as CXFS does effectively lock
anything being updated such that no one else can look at it until the update
completes (at *all* copies), it might work - but you'd still lose a site if
its local mirror copy failed unless some provision for automatic revectoring
to a surviving remote disk existed.

- bill

#3 October 20th 04, 05:59 AM

Just use a pair of NetApp Filers (one at each site) running SnapMirror -
very easy. Since NetApp's file system is not unix or Windows you are less
susceptible to virus propagation. Most synchronous replication software
replicates real-time the viruses - this erodes both data and the OS.
"George Orwell" wrote in message
...
I have servers and storage at two sites, with an FC bridge over an ip WAN
in
between them to create a SAN.
Now I would like to run a distributed filesystem on servers at both sites,
accessing the same storage.

For redundancy reasons, I would also like to replicate all the storage so
both
sites have a copy of the data at all times. If I chose synchronous
replication,
I would expect each site to be able to use the local copy for reads,
giving
me local read speeds.

Does anyone know of a filesystem/SAN appliance combination that can give
me
this ?

Arne Joris

#4 October 21st 04, 03:55 PM

Bill Todd wrote:

It depends on the type of traffic; for large files (hundreds of
megabytes)
the locking isn't a big issue if it's done at the file level.

Indeed - if you're not doing concurrent distributed writes to the same
file.

Yeah you are right, concurrent writes will hammer the locks. But my
traffic consists of a single writer and multiple readers.

I have heard that the overhead to distribute meta data
operations isn't worth it unless you intend to do I/O to many small
files
(so the ratio of meta data operations to I/O is high).

I'm not sure how that makes sense: the less metadata is used, the
less
impact it will have on overall activity regardless of how it may be
managed.
The main reason metadata management is distributed in VMS is likely to
make
the distributed file system resilient to single-node failures and
peer-to-peer rather than client/server in nature, but in practice the
overheads do not seem significant.

I imagine the indirection of going to a directory server to look up the
server owning the meta data *can* introduce extra latency (over the WAN)
plus cpu cycles for non-metadata intensive workloads.

If the accesses to a file are dominated by a single system, all lock
management for that file migrates automatically to that system (so the
relatively infrequent accesses by other systems take a minor hit by
having
to send lock activity there, but most activity executes at local-host
speeds). A new accessor queries one of the cluster's lock directory
servers
(nothing special about them - they just avoid the need for every
system in
the cluster to maintain high-level lock directory information) to find
out
where a given file's locks are 'mastered'.

Yeah it sounds like the distributed meta data service would create a
more site aware solution than the single meta data server model, just by
virtue of doing all meta data operations at the site where the data is
produced or consumed.

Now for my traffic, I have a writer at site 1, and several readers at
both sites 1 and 2. So really the data is being accessed at both sites,
so we'll have to go over the WAN regardless of where the metadata is
kept.

As long as absolutely no writes are going to disk, that should work
fine.
As soon as any write activity occurs (even background writes such as
updating 'last access times', where no application-level writes
exist), the
local file system's cached data (which it believes it can cache safely
because it 'owns' that file system exclusively) will start to get
stale,
and, worse, may then be written back out to disk stale.

No CXFS should maintain host cache coherency, that's one of the main
tasks a distributed file system should do (along with file locking).

Interesting thought about making different (local) mirror-disks
appear to be the same to all instances and letting a transparent
disk-level
replicator keep them in synch: as long as CXFS does effectively lock
anything being updated such that no one else can look at it until the
update
completes (at *all* copies), it might work - but you'd still lose a
site if
its local mirror copy failed unless some provision for automatic
revectoring
to a surviving remote disk existed.

My applications do not require a very strict data coherency model; the
readers are processing data in the files that is well behind the data
that the writer is appending to the file (think seismic data is being
written, and several readers do different kinds of processing on data
that is at least 20 minutes old).

The only requirement is that when the readers keep reading the same file
without opening or closing them, they should eventually read the new
data the writer has been appending to them. So even if the writer's very
latest data is still in the host cache, eventually it should be flushed
and the synchronous mirroring ought to make it show up at the other
site. I hope CXFS' host cache coherency will cause the writer to flush
within guaranteed time lines so the readers can see the data (I'm not
sure what those time lines are though, but I have a big time margin).

Arne Joris

#5 October 21st 04, 04:18 PM

"Monte Oates" wrote in message
news:dhmdd.157595$a41.79236@pd7tw2no...
Just use a pair of NetApp Filers (one at each site) running SnapMirror -
very easy. Since NetApp's file system is not unix or Windows you are less
susceptible to virus propagation. Most synchronous replication software
replicates real-time the viruses - this erodes both data and the OS.

Regardless of the merit of such a suggestion for some people, the original
poster specified the behavior he wants - and being an asynchronous
replication mechanism SnapMirror won't give it to him.

- bill

#6 October 21st 04, 04:31 PM

"Arne Joris" wrote in message
news:v5Qdd.790510$M95.165193@pd7tw1no...
Bill Todd wrote:

....

I have heard that the overhead to distribute meta data
operations isn't worth it unless you intend to do I/O to many small
files
(so the ratio of meta data operations to I/O is high).

I'm not sure how that makes sense: the less metadata is used, the
less
impact it will have on overall activity regardless of how it may be
managed.
The main reason metadata management is distributed in VMS is likely to
make
the distributed file system resilient to single-node failures and
peer-to-peer rather than client/server in nature, but in practice the
overheads do not seem significant.

I imagine the indirection of going to a directory server to look up the
server owning the meta data *can* introduce extra latency (over the WAN)
plus cpu cycles for non-metadata intensive workloads.

Since in the absence of active contention it goes to the directory server
only once per file or directory accessed, that latency (which should be less
than a single disk access) should usually be negligible.

....

it sounds like the distributed meta data service would create a
more site aware solution than the single meta data server model, just by
virtue of doing all meta data operations at the site where the data is
produced or consumed.

Not really: where the metadata is processed has relatively little to do
with what disk it's obtained from, unless the system goes to some
essentially orthogonal effort to access the most local copy available.

Now for my traffic, I have a writer at site 1, and several readers at
both sites 1 and 2. So really the data is being accessed at both sites,
so we'll have to go over the WAN regardless of where the metadata is
kept.

It doesn't sound as if you really care much about remote metadata access, at
least as long as the amount of metadata processed is negligible compared
with the amount of data fetched. Inter-site latencies tend not to become
*really* noticeable until distances in the hundreds of miles are involved
(100 miles being on the order of a millisecond, one-way).

And if you're mostly reading large files sequentially, it doesn't sound as
if latency should be any real concern there, either. But unless you've got
unlimited bandwidth between the two sites, your desire to do reads locally
is understandable.

- bill

#7 October 21st 04, 04:54 PM

Bill Todd wrote:
it sounds like the distributed meta data service would create a
more site aware solution than the single meta data server model, just by
virtue of doing all meta data operations at the site where the data is
produced or consumed.

Not really: where the metadata is processed has relatively little to do
with what disk it's obtained from, unless the system goes to some
essentially orthogonal effort to access the most local copy available.

But my metadata would also be replicated to both sites, so disk access
is *always* local (but writes are being replicated synchronously so
would be a bit slow). In that case, having meta data operations handled at
the site that is about to do I/O would not introduce *any* cross-WAN
operations at all.

Now for my traffic, I have a writer at site 1, and several readers at
both sites 1 and 2. So really the data is being accessed at both sites,
so we'll have to go over the WAN regardless of where the metadata is
kept.

It doesn't sound as if you really care much about remote metadata access, at
least as long as the amount of metadata processed is negligible compared
with the amount of data fetched. Inter-site latencies tend not to become
*really* noticeable until distances in the hundreds of miles are involved
(100 miles being on the order of a millisecond, one-way).

I have latencies of 100ms round trip (thousands of miles plus the WAN is
routed which introduces switching latencies).

And if you're mostly reading large files sequentially, it doesn't sound as
if latency should be any real concern there, either. But unless you've got
unlimited bandwidth between the two sites, your desire to do reads locally
is understandable.

Yeah my main concern is not meta data access, I care mostly about data
being accessed locally.

Arne Joris

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Enterprise Storage Management (ESM) FAQ Revision 2004/06/23 - Part 1/1	Will Spencer	Storage & Hardrives	0	June 23rd 04 06:58 AM
CFP - Extended Deadline (12 June) - Workshop on Scalable File Systems and Storage Technologies	Vijay Velusamy	Storage & Hardrives	0	June 8th 04 04:53 AM
Enterprise Storage Management (ESM) FAQ Revision 2004/04/11 - Part 1/1	Will Spencer	Storage & Hardrives	0	April 11th 04 07:24 AM
Enterprise Storage Management (ESM) FAQ Revision 2004/02/16 - Part 1/1	Will Spencer	Storage & Hardrives	0	February 16th 04 09:23 PM
Terabyte Storage By Real-Storage	Real-Storage	Storage & Hardrives	2	October 23rd 03 04:18 PM