View Single Post
  #6  
Old September 14th 03, 03:02 AM
Bill Todd
external usenet poster
 
Posts: n/a
Default


"Leon Woestenberg" wrote in message
...

"Nik Simpson" wrote in message
...
Leon Woestenberg wrote:


....

The SAN is written to and read from through a cluster of, say N, Linux
servers.
The cluster processes 200 datastreams coming in a steady 1 Mbit/second

each,
of which the results (also about 1 Mbit/second) are stored.

As a result of processing, some very low bitrate metadata about the

streams
is
entered into a database, which is stored on the SAN as well.


So far (25 MB/sec in, 25 MB/sec out plus a bit of metadata) that doesn't
sound beyond the capacity of a single not even very muscular commodity IA32
box to handle - unless the required processing is prodigious (or you need
TCP/IP offload help and don't have it).


We need redundancy throughout, i.e. no single point of failure. The

cluster
servers will have to fail-over their processing applications.


That would seem to be a possible sticking point for the kind of NAS solution
Doug proposed: the NAS box itself is a single point of failure, unless it
employs strict synchronous replication to a partner box (I think that, e.g.,
NetApp has this now) which could take over close to instantaneously on the
primary NAS box's failure. Even many 'SAN file systems' may have a single
point of failure in their central metadata server unless it's implemented
with fail-over capability (GFS is a rare exception, having distributed
metadata management).


Every data stream is received by two servers out of the cluster, where the
secondary acts as a hot spare, processing and storing the data stream if
the primary server fails to do so.


This should make possible 'N+1' redundancy, where you really only need about
one extra server (beyond what's required just to handle the overall load):
when a server fails, processing of its multiple streams is divided fairly
equally among the survivors.

Of course, given that the overall system shouldn't be all that expensive
anyway, the added complexity of such an optimization may not be warranted.
If you simply paired each server with an otherwise redundant partner, you
could not only avoid such a load-division strategy but potentially could
dispense with SAN and NAS entirely and just replicate the entire operation
(data receipt, data processing, data storage on server-attached disks). If
nothing fails, use either server's copy; if a server fails, just have its
partner continue doing what it was doing all along.


The hardest part (IMHO) will be to make sure, that in case of network
or server failure, the secondary will notice exactly where the primary
server
was in the data stream and take over from there.


It really shouldn't matter exactly where the primary was when it failed:
what matters is how much of the output stream it had written to disk, and
the secondary should just be able to interrogate that file (assuming it's
something like a file) to find out.

Unless (as described above) the secondary was already performing its own
redundant processing and storage, in which case it doesn't even have to
notice that the primary died (though whoever eventually needs the data needs
to know which copy to use). In such a case, it might be desirable to
incorporate a mechanism whereby a spare server could be pressed into
service, 'catch up' to the current state of the remaining partner's disk
state (possibly taking over the original - failed - primary server's disks
to help in this), and then become the new secondary - just in case over time
the original partner decided to fail too (but *any* storage system you use
should have similar provisions to restore redundancy after a failure).

However, at an overall storage rate of over 2 TB daily you're going to need
far more disks to reach your 120 TB total than you need servers to do the
processing (assuming a maximum of, say, eight 250 GB disks per server, you'd
need about 60 servers just to hold the disk space required for a *single*
copy of the data, but only a handful of servers to handle the processing).
This makes your original proposal for somethign like a shared-SAN file
system more understandable: not for reasons of concurrent sharing, but
simply for bulk storage. However, if all you need is bulk, you can obtain
that by carving out a private portion of the SAN storage for each server
pair, with no need for a SAN file system at all (think of each such private
chunk as a single file system that fails over to the partner on primary
failure - though since the partner is already active it will need to
understand how to perform a virtual file system restart to pick up the
current state and move on from there).

Or, if you don't need fast access to older data you've acquired, you could
just use attached server disks as staging areas for relatively quick dumps
to tape. This reverts to the idea of having paired servers that accumulate
data to attached disks: if the primary dies, the secondary picks up the
process of dumping to tape (restarting from the last tape started, perhaps,
to avoid the 'exactly were were we?' issue you noted). The mechanics of
your bulk storage and how it's used *after* you've gathered the data seem
more important here than the mechanics of gathering and processing it.


2. Application for the SAN, is it one application with different OS
platforms, or different OS platforms supporting different applications.

What
I'm really driving at here is the need for a shared filesystem which

will
inevitably slow things down, limit the OS choices (portentially) and

make
things more complex.

I though that having a shared filesystem will limit the complexicity by
offering
a distributed locking mechanism, which is used on the cluster to manage
the application I mentioned above?


Possibly, but this particular application doesn't actually *share* any of
its data among processors (at least not concurrently, in the shared-update m
anner that distributed locking typically facilitates), so something
considerably simpler might suffice.

- bill