If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Long term archival storage
I have around 20 Tb of data that I (a) want to store for a very (50
years)long time and also have available for search and download. The data consists of two types: (a) the preservation masters which is the data we want to keep and is in tiff and bwf formats among others (b) the viewing copies which are in derived formals such as png and mp3. I am coming down to an HSM type of solution which a large enough front end cache to allow us to keep the viewing copies online at all times but which allows the archival copies to disappear off to tape to be cloned and duplicated etc. Anyone else doing this? Anyone got a better idea? -dgm |
#2
|
|||
|
|||
|
#3
|
|||
|
|||
depends, a lot of companies have regulatory requirements that data be
kept for a long time. If it needs to be accessible quickly, the ATA drive solution is advisable. I think the cost is around $7 per GB for stuff like netapp nearstore solutions. The particular product which would simulate a worm drive is called snaplock. |
#4
|
|||
|
|||
50 years is a long time by archival standards. Most media/drive
manufacturers guarantee 30 years under nominal conditions. You'll have to store your masters under optimal conditions and check them regularly (read them every couple of years, check error rate, copy if needed etc.) to go beyond that. Which raises the question where you'll find the equipment (spares, service) a couple decades down the road. Example from my perspective: If you have DLTtape III tapes lying around from the early nineties, you'd better do something about them now since Quantum EOL'd the DLT8000, which is the last generation that will read those tapes. Service will be available for another 5 years. That's 20 of the 50 years. Basically you'll have to copy the data to state of the art media about every decade. Don't try to store spare drives for the future, that doesn't usually work - electromechanical devices age when they're not in use too. There have been numerous stories about the problems NASA has retrieving old data recordings. Your project will face the same. Fortunately 20 TB isn't a big deal any more and will be less so in the future. The front end doesn't really matter, but the archive will need a lot of thought and care. Think what the state of the art was 50 years ago. |
#5
|
|||
|
|||
"RPR" wrote in message
oups.com... Basically you'll have to copy the data to state of the art media about every decade. Don't try to store spare drives for the future, that doesn't usually work - electromechanical devices age when they're not in use too. In addition to Ralf-Peter's comment, you better think long and hard about how you will be accessing that data 50 years from now, from an application point of view. 50 years from now, the computing devices will be radically different from today's PC's. Unless you have documented every bit about the format of the files you stored and the environment you need to recreate the information, even migration to state of the art media will not help. Consider a Word Perfect 4.2 file from 20 years ago. You'll need some effort today to open and read such a file. Because the format is relatively simple, you can still read the text using any hex editor. But recreating the page formatting maybe harder already. Now consider your MP3 and picture files which are heavily encoded en compressed, and fast forward to the year 2055. Unless you know exactly how they are recreated, all you'll have 50 years from now is a bunch of zeroes and ones. This is scary for single files, but things are even worse when multipple files form a single context. Think databases with external pointers. Think HTML files with web links. How much of that will exist 50 years from now? For permanent long-term records, store the information on a medium that can be interpreted by the most universal and long-term computer you have - the one between your ears -. Microfiche and dead trees aren't obsolete just yet... Rob |
#6
|
|||
|
|||
|
#7
|
|||
|
|||
On Tue, 29 Mar 2005 23:36:05 +0200, "Rob Turk"
wrote: "RPR" wrote in message roups.com... Basically you'll have to copy the data to state of the art media about every decade. Don't try to store spare drives for the future, that doesn't usually work - electromechanical devices age when they're not in use too. In addition to Ralf-Peter's comment, you better think long and hard about how you will be accessing that data 50 years from now, from an application point of view. 50 years from now, the computing devices will be radically different from today's PC's. Unless you have documented every bit about the format of the files you stored and the environment you need to recreate the information, even migration to state of the art media will not help. Consider a Word Perfect 4.2 file from 20 years ago. You'll need some effort today to open and read such a file. Because the format is relatively simple, you can still read the text using any hex editor. But recreating the page formatting maybe harder already. Ok so a lot of converters do an incomplete job, but is this really so complicated? Save a copy of the application(s) and maybe the OS that ran it with the data. Between backwards compatibility and improving emulation technology it might be more doable than you think. Also keeping data for 50 years doesn't necessarily imply keeping storage devices for 50 years. Periodic upgrades of the storage and maybe even the file format of the data might be what needs to happen to realistically keep useable information for many decades. A major overhaul like this around every 10 years seems to be working for me pretty well. Waiting 15 years or more tends to be problematic. Your mileage may vary and, well, the past is not always a good indicator of the future. |
#8
|
|||
|
|||
Faeandar writes:
All archived data was stored on NetApp Nearstore (any cheap disk will do though). No if's, and's, or but's. Reason being is whenever the next disk upgrade comes in the data is migrated along with it. no issue of recovery or media type not being available, the data set follows the technology. You seriously think you'll still be using that Netapp stuff in 2055? |
#9
|
|||
|
|||
Curious George writes:
Consider a Word Perfect 4.2 file from 20 years ago. You'll need some effort today to open and read such a file. Because the format is relatively simple, you can still read the text using any hex editor. But recreating the page formatting maybe harder already. Ok so a lot of converters do an incomplete job, but is this really so complicated? Save a copy of the application(s) and maybe the OS that ran it with the data. Between backwards compatibility and improving emulation technology it might be more doable than you think. I would say that most of these conversion problems have stemmed from secret, undocumented formats. Formats like jpg and mp3, which are well documented and have reference implementations available as free source code, should be pretty well immune to the problems. |
#10
|
|||
|
|||
Imagine that today (2004) you would need to read 20-year old data.
Say it is the content of a hierarchical database (not a relational database). The source code of the database still exists, but it is written in IBM 360 assembly, and only runs under OS/VSE, being run 20 years ago on a 3081 under VM. The last guy who maintained it died of cancer 10 years back; his widow threw out his files. Or the data was written 20 years ago with a cp/m machine, in binary format using Borland dBase. Say for grins the cp/m program was doing data acquisition on custom-built hardware (this was very common back then), and requires special hardware interfaces to external sensors and actuators to run. In the former case, you have to deal with a huge problem: The data is probably not in 512-byte blocks, but is written in IBM CKD (count-key-data) format, on special disks (probably 3350 or 3380); a sensible database application on the 370 would use CKD search instructions for performance. Fortunately, IBM will today still sell you a mainframe that is instruction-set compatible with the 360, and disk arrays that can still execute CKD searches. And you can still get an OS that somewhat resembles OS/VSE and VM. So for the next few years, a few million $ and several months of hard work would recover the information. Or you could read 50000 lines of IBM assembly code to determine what the exact data format really is, and write a converter. Enjoy. The second case is even worse. Most likely, the old cp/m hardware (even if you have managed to preserve it) will no longer boot, because the EPROMs and boot floppies have decayed. You can no longer buy a legal copy of cp/m or dbase. Running an illegal copy on a cp/m emulator on a modern computer won't work, because the program requires custom-built hardware sensors and actuators (I carefully constructed the problem to maximally inconvenience you). Finding dbase manuals today to decode what the dbase code was doing and understand the data format will be very time consuming. What I'm trying to say: The problem of preserving the bit pattern of the raw data is the absolute least of the issues. It can be solved trivially: Write the data to good-quality rewriteable CDs, make a few copies of each, and every few years read all of them back, and write them to newly current media. Done. The real problem is documenting the semantics of the bits. The easy way out is to preserve the complete computing environment used to read the data (including all hardware, documentation, and wetware that is required to operate it). This is hard, because hardware, paperware and wetware don't preserve very well. The second-best way it to convert the data to formats that are well-documented (say plain text files, or formats that are enshrined in standards, like ISO9660 or JPG), and also preserve a human-readable description of that format in a universally readable way (like enclose a copy of the standard that defines ISO9660 in an ASCII text file with the data). I'm not saying that preserving the raw bits should be abandoned. This is absolutely the most important step; if you fail at that, all other problems become irrelevant. But please don't believe that it solves the problem. The long-term preservation of data is a huge research topic. Please read the abundant literature on it, to get a flavor of the difficulty. The real issue you need to think about is this: How valuable is this data really? How valuable will it be in 20 years? What is the expected cost of recovering it in 20 years (above I budgeted M$ for buying the hardware for reading CKD data)? How much do you need to invest now to minimize the expected data recovery cost in 20 years? Is you CEO cool with investing this much money now, given that in 20 years he will no longer be the CEO? Will it be economically viable to use the old data in 20 years? Wouldn't it be easier to print it all on acid-free paper now, store it in a mine or an old railroad tunnel, and scan the printout in 20 years? As an example: I used to be an astrophysicist. I happen to have the original data tape of the 8 neutrinos from the 1987 supernova that hit the particle detector in the northern US at home. The tape is 6250 bpi open reel, with a highly complex data format on it; fortunately, the data format was described on paper people's PhD thesis, but finding the old decoding software and getting it to run would be very hard (anyone got a VAX with VMS 4?). Reading it and decoding the data would take several months of work. As this point, the tape has only emotional value. -- The address in the header is invalid for obvious reasons. Please reconstruct the address from the information below (look for _). Ralph Becker-Szendy |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Long term 100% CPU | 'Captain' Kirk DeHaan | Dell Computers | 6 | December 15th 04 07:45 AM |
Redundant network storage | Andrew Gideon | Storage & Hardrives | 0 | December 6th 04 06:12 PM |
SAN (Storage Area Network) Security FAQ Revision 2004/06/23 - Part 1/1 | Will Spencer | Storage & Hardrives | 0 | June 23rd 04 07:04 AM |
Enterprise Storage Management (ESM) FAQ Revision 2004/06/23 - Part 1/1 | Will Spencer | Storage & Hardrives | 0 | June 23rd 04 06:58 AM |
Get the Serial Number with Visual Basic | Michael Wittmann | General | 15 | November 15th 03 07:03 PM |