Long term archival storage

#1 March 29th 05, 04:49 AM

I have around 20 Tb of data that I (a) want to store for a very (50
years)long time and also have available for search and download.

The data consists of two types:

(a) the preservation masters which is the data we want to keep and is
in tiff
and bwf formats among others

(b) the viewing copies which are in derived formals such as png and
mp3.

I am coming down to an HSM type of solution which a large enough front
end cache to allow us to keep the viewing copies online at all times
but which allows the archival copies to disappear off to tape to be
cloned and duplicated etc.

Anyone else doing this? Anyone got a better idea?

-dgm

#2 March 29th 05, 06:49 AM

(dgm) writes:
I have around 20 Tb of data that I (a) want to store for a very (50
years)long time and also have available for search and download...
(a) the preservation masters which is the data we want to keep and is
in tiff and bwf formats among others
(b) the viewing copies which are in derived formals such as png and mp3.

You have a bunch of images and audio recordings. If the images are
scanned paper docs, the tiff files won't be substantially smaller than
the png files. If they're photographs, you probably want to view jpg
rather than png. In this case the mp3 and jpg files will probably be
less than 1/10th the size of the originals, or about 2 GB total, not
much at all. A small RAID disk system can hold that much easily, on
say six 400GB drives.

I am coming down to an HSM type of solution which a large enough front
end cache to allow us to keep the viewing copies online at all times
but which allows the archival copies to disappear off to tape to be
cloned and duplicated etc.

Sounds kind of complicated. Where's this data now, how is it stored,
and how fast are you adding to it and through what kind of system? 20
TB isn't really big storage these days. You could have a small tape
library online and move incoming raw data to tape immediately while
also making the online viewing copies on disk. HSM systems with
automatic migration and retrieval are probably overkill.

#3 March 29th 05, 04:33 PM

depends, a lot of companies have regulatory requirements that data be
kept for a long time. If it needs to be accessible quickly, the ATA
drive solution is advisable. I think the cost is around $7 per GB for
stuff like netapp nearstore solutions. The particular product which
would simulate a worm drive is called snaplock.

#4 April 4th 05, 12:42 AM

Paul Rubin wrote:
[...]

Sounds kind of complicated. Where's this data now, how is it stored,
and how fast are you adding to it and through what kind of system?
20
TB isn't really big storage these days. You could have a small tape
library online and move incoming raw data to tape immediately while
also making the online viewing copies on disk. HSM systems with
automatic migration and retrieval are probably overkill.

It is kind of complicated. Currently we have 6Tb digitised and are
adding 0.1Tb/week.

Now this is data that's stuff that needs to be kept for ever - the
audio stuff is world heritage stuff. The driver for using HSM is two
fold

1) keeping multiple copies securely including offsite
2) we know we have a 900kg gorilla called video waiting in the wings
....

#5 March 29th 05, 08:06 PM

50 years is a long time by archival standards. Most media/drive
manufacturers guarantee 30 years under nominal conditions. You'll have
to store your masters under optimal conditions and check them regularly
(read them every couple of years, check error rate, copy if needed
etc.) to go beyond that. Which raises the question where you'll find
the equipment (spares, service) a couple decades down the road. Example
from my perspective: If you have DLTtape III tapes lying around from
the early nineties, you'd better do something about them now since
Quantum EOL'd the DLT8000, which is the last generation that will read
those tapes. Service will be available for another 5 years. That's 20
of the 50 years.
Basically you'll have to copy the data to state of the art media about
every decade. Don't try to store spare drives for the future, that
doesn't usually work - electromechanical devices age when they're not
in use too.
There have been numerous stories about the problems NASA has retrieving
old data recordings. Your project will face the same. Fortunately 20 TB
isn't a big deal any more and will be less so in the future. The front
end doesn't really matter, but the archive will need a lot of thought
and care. Think what the state of the art was 50 years ago.

#6 March 29th 05, 10:36 PM

"RPR" wrote in message
oups.com...
Basically you'll have to copy the data to state of the art media about
every decade. Don't try to store spare drives for the future, that
doesn't usually work - electromechanical devices age when they're not
in use too.

In addition to Ralf-Peter's comment, you better think long and hard about
how you will be accessing that data 50 years from now, from an application
point of view. 50 years from now, the computing devices will be radically
different from today's PC's. Unless you have documented every bit about the
format of the files you stored and the environment you need to recreate the
information, even migration to state of the art media will not help.

Consider a Word Perfect 4.2 file from 20 years ago. You'll need some effort
today to open and read such a file. Because the format is relatively simple,
you can still read the text using any hex editor. But recreating the page
formatting maybe harder already.

Now consider your MP3 and picture files which are heavily encoded en
compressed, and fast forward to the year 2055. Unless you know exactly how
they are recreated, all you'll have 50 years from now is a bunch of zeroes
and ones. This is scary for single files, but things are even worse when
multipple files form a single context. Think databases with external
pointers. Think HTML files with web links. How much of that will exist 50
years from now?

For permanent long-term records, store the information on a medium that can
be interpreted by the most universal and long-term computer you have - the
one between your ears -. Microfiche and dead trees aren't obsolete just
yet...

Rob

#7 March 30th 05, 05:16 AM

On Tue, 29 Mar 2005 23:36:05 +0200, "Rob Turk"
wrote:

"RPR" wrote in message
roups.com...
Basically you'll have to copy the data to state of the art media about
every decade. Don't try to store spare drives for the future, that
doesn't usually work - electromechanical devices age when they're not
in use too.

In addition to Ralf-Peter's comment, you better think long and hard about
how you will be accessing that data 50 years from now, from an application
point of view. 50 years from now, the computing devices will be radically
different from today's PC's. Unless you have documented every bit about the
format of the files you stored and the environment you need to recreate the
information, even migration to state of the art media will not help.

Consider a Word Perfect 4.2 file from 20 years ago. You'll need some effort
today to open and read such a file. Because the format is relatively simple,
you can still read the text using any hex editor. But recreating the page
formatting maybe harder already.

Ok so a lot of converters do an incomplete job, but is this really so
complicated? Save a copy of the application(s) and maybe the OS that
ran it with the data. Between backwards compatibility and improving
emulation technology it might be more doable than you think.

Also keeping data for 50 years doesn't necessarily imply keeping
storage devices for 50 years. Periodic upgrades of the storage and
maybe even the file format of the data might be what needs to happen
to realistically keep useable information for many decades. A major
overhaul like this around every 10 years seems to be working for me
pretty well. Waiting 15 years or more tends to be problematic.

Your mileage may vary and, well, the past is not always a good
indicator of the future.

#8 March 30th 05, 05:27 AM

Curious George writes:
Consider a Word Perfect 4.2 file from 20 years ago. You'll need
some effort today to open and read such a file. Because the format
is relatively simple, you can still read the text using any hex
editor. But recreating the page formatting maybe harder already.

Ok so a lot of converters do an incomplete job, but is this really so
complicated? Save a copy of the application(s) and maybe the OS that
ran it with the data. Between backwards compatibility and improving
emulation technology it might be more doable than you think.

I would say that most of these conversion problems have stemmed from
secret, undocumented formats. Formats like jpg and mp3, which are well
documented and have reference implementations available as free source
code, should be pretty well immune to the problems.

#9 April 4th 05, 12:47 AM

The point about file formats is well made, but we've been through the
same arguement in detail already. We're choosing file formats which are
publically described for which there are multiple (open source)
clients.

The idea is to be able to ensure that we have the format description
and enough example code to be able to recreate viewers in the future.
That's why we're using tiff and bwf as the archival masters. I don't
care about the mp3's as they are *derived* copies - we can as easily
use ogg vorbis, or whatever we're using in 2055 as long as we can parse
the original compression free datastream

#10 March 30th 05, 03:44 AM

On 28 Mar 2005 19:49:21 -0800, (dgm) wrote:

I have around 20 Tb of data that I (a) want to store for a very (50
years)long time and also have available for search and download.

The data consists of two types:

(a) the preservation masters which is the data we want to keep and is
in tiff
and bwf formats among others

(b) the viewing copies which are in derived formals such as png and
mp3.

I am coming down to an HSM type of solution which a large enough front
end cache to allow us to keep the viewing copies online at all times
but which allows the archival copies to disappear off to tape to be
cloned and duplicated etc.

Anyone else doing this? Anyone got a better idea?

-dgm

Other replies have made several good points. Here's what we did at a
former employer.

All archived data was stored on NetApp Nearstore (any cheap disk will
do though). No if's, and's, or but's. Reason being is whenever the
next disk upgrade comes in the data is migrated along with it. no
issue of recovery or media type not being available, the data set
follows the technology.
Disks were more expensive than tape (and may still be) but the
guarantee of being able to at least access the data was worth it. As
someone pointed out, you still have to deal with the application to
read it but that can be tested along the way much easier if it's on
disk. Heck, you could even package the app with the data; that's what
we did.

And as technology progresses, no matter what the main media storage
type, there will always be migration techniques. Any vendor wanting
you to migrate from your 15PB 4billion k magentic drives to their
solid light storage will provide a migration path, guaranteed.

The data can be backed up to tape for DR as you see fit. We sent
copies offsite just for "smoking hole' purposes but mostly they were
rotated weekly.

As a proof of concept we did a forklift upgrade from the R100 to the
R200. Just roll in the new and roll out the old. Went from 144GB
drives to 266GB drives so existing data set took up about half of what
it did. This will always be the case.
The migration went fine and data that is now 8 years old is still
spinning away on new disk with their applications. Now whether or not
anyone knows how to work the app is another issue...

~F

Thread Tools
Show Printable Version Email this Page
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Long term 100% CPU	'Captain' Kirk DeHaan	Dell Computers	6	December 15th 04 06:45 AM
Redundant network storage	Andrew Gideon	Storage & Hardrives	0	December 6th 04 05:12 PM
SAN (Storage Area Network) Security FAQ Revision 2004/06/23 - Part 1/1	Will Spencer	Storage & Hardrives	0	June 23rd 04 07:04 AM
Enterprise Storage Management (ESM) FAQ Revision 2004/06/23 - Part 1/1	Will Spencer	Storage & Hardrives	0	June 23rd 04 06:58 AM
Get the Serial Number with Visual Basic	Michael Wittmann	General	15	November 15th 03 06:03 PM