View Single Post
  #10  
Old March 30th 05, 07:12 PM
_firstname_@lr_dot_los-gatos_dot_ca.us
external usenet poster
 
Posts: n/a
Default

Imagine that today (2004) you would need to read 20-year old data.
Say it is the content of a hierarchical database (not a relational
database). The source code of the database still exists, but it is
written in IBM 360 assembly, and only runs under OS/VSE, being run 20
years ago on a 3081 under VM. The last guy who maintained it died of
cancer 10 years back; his widow threw out his files.

Or the data was written 20 years ago with a cp/m machine, in binary
format using Borland dBase. Say for grins the cp/m program was doing
data acquisition on custom-built hardware (this was very common back
then), and requires special hardware interfaces to external sensors
and actuators to run.

In the former case, you have to deal with a huge problem: The data is
probably not in 512-byte blocks, but is written in IBM CKD
(count-key-data) format, on special disks (probably 3350 or 3380); a
sensible database application on the 370 would use CKD search
instructions for performance. Fortunately, IBM will today still sell
you a mainframe that is instruction-set compatible with the 360, and
disk arrays that can still execute CKD searches. And you can still
get an OS that somewhat resembles OS/VSE and VM. So for the next few
years, a few million $ and several months of hard work would recover
the information.

Or you could read 50000 lines of IBM assembly code to determine what
the exact data format really is, and write a converter. Enjoy.

The second case is even worse. Most likely, the old cp/m hardware
(even if you have managed to preserve it) will no longer boot, because
the EPROMs and boot floppies have decayed. You can no longer buy a
legal copy of cp/m or dbase. Running an illegal copy on a cp/m
emulator on a modern computer won't work, because the program requires
custom-built hardware sensors and actuators (I carefully constructed
the problem to maximally inconvenience you). Finding dbase manuals
today to decode what the dbase code was doing and understand the data
format will be very time consuming.

What I'm trying to say: The problem of preserving the bit pattern of
the raw data is the absolute least of the issues. It can be solved
trivially: Write the data to good-quality rewriteable CDs, make a few
copies of each, and every few years read all of them back, and write
them to newly current media. Done. The real problem is documenting
the semantics of the bits. The easy way out is to preserve the
complete computing environment used to read the data (including all
hardware, documentation, and wetware that is required to operate it).
This is hard, because hardware, paperware and wetware don't preserve
very well. The second-best way it to convert the data to formats that
are well-documented (say plain text files, or formats that are
enshrined in standards, like ISO9660 or JPG), and also preserve a
human-readable description of that format in a universally readable
way (like enclose a copy of the standard that defines ISO9660 in an
ASCII text file with the data).

I'm not saying that preserving the raw bits should be abandoned. This
is absolutely the most important step; if you fail at that, all other
problems become irrelevant. But please don't believe that it solves
the problem.

The long-term preservation of data is a huge research topic. Please
read the abundant literature on it, to get a flavor of the difficulty.
The real issue you need to think about is this: How valuable is this
data really? How valuable will it be in 20 years? What is the
expected cost of recovering it in 20 years (above I budgeted M$ for
buying the hardware for reading CKD data)? How much do you need to
invest now to minimize the expected data recovery cost in 20 years?
Is you CEO cool with investing this much money now, given that in 20
years he will no longer be the CEO? Will it be economically viable to
use the old data in 20 years? Wouldn't it be easier to print it all
on acid-free paper now, store it in a mine or an old railroad tunnel,
and scan the printout in 20 years?

As an example: I used to be an astrophysicist. I happen to have the
original data tape of the 8 neutrinos from the 1987 supernova that hit
the particle detector in the northern US at home. The tape is 6250
bpi open reel, with a highly complex data format on it; fortunately,
the data format was described on paper people's PhD thesis, but
finding the old decoding software and getting it to run would be very
hard (anyone got a VAX with VMS 4?). Reading it and decoding the data
would take several months of work. As this point, the tape has only
emotional value.

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy