Sunday, June 10, 2012

Going Gentle Into That Good Night

My master's thesis was published in 1983. The text of the digital version was written using a word processing program called WordStar that ran on a microcomputer using a Motorola 6800 processor running the CP/M operating system and stored on eight-inch floppy disks. The source code for my software that I was writing about was stored on eight-inch floppies in RT-11 format and ran on a Digital Equipment Corporation PDP-11 running the RSX-11M operating system. I'm pretty sure none of that exists now.

If you absolutely had to recover that digital information, it would require tens or even hundreds of thousands of dollars, in terms of hardware, software, and time, if it could be done at all. You'd have to find eight-inch floppy drives, devise a way to hook them up to some modern computer, track down the description of the CP/M and RT-11 file system formats and the WordStar and RSX-11M file formats, implement both their documented behavior and the undocumented behavior, pray that the floppies were still readable nearly thirty years later, and hope that I didn't use some compression and/or encryption algorithm that was in vogue at the time.

The effort would be made somewhat more complicated by the fact that I threw away those eight-inch floppy disks years ago. But I do have a hardcopy of my thesis, which includes the text and the complete source listings, printed on acid free paper, and hardbound. If this tome were to survive fires, floods, and the inevitable collapse of civilization, it would likely still be readable a thousand years from now. If anyone cared. Which is highly unlikely.

So it is with preservation of digital information. While I can actually read an original book by Galileo, and make out the icons on a shard of four thousand year old Greek pottery, digital data are ephemeral. My enormous collection of motion picture and television soundtrack music on compact disk, about two thousand of them, has a shelf life. Estimates range from 100 years to as few as two years. And what are my chances of buying a new CD player twenty years from now? Let's ask the people who invested in eight-track tapes, cassette tapes, or any number of other media technologies that have gone by the wayside. It is a similar story for DVDs, nine-track magnetic tape, flash memory, and anything else that stores bits magnetically, optically, or electronically.

What's worse, it's not just the physical bits, it's what the bits mean: encryption, compression, application file format, operating system file system format, all have something to do with how the ones and zeros are interpreted. So the problem just gets exponentially complicated, because information about all that stuff is stored digitally too. How much of it do you need to read the digital copy of my thesis? All of it. 

The general consensus is that there are two approaches to solving the long-term digital preservation problem: migration and emulation.

Migration is the act of more or less continually copying all archived digital information to the latest and greatest formats and technologies. This can work. Sorta. I once worked at a place that had to purchase every single used tape drive of the discontinued model they were using in order to keep enough working for long enough so that they could move their enormous tape archive containing the sole copies of historical climate data and archived output from climate simulations to a format and technology that might buy them another decade before panic sat in again. When you do this you hope you can get the migration competed before the new technology is obsolete.

Emulation is building a system using new technology that acts like old technology. This is non-trivial, for the reasons stated above: the documentation you need, if it exists at all, is also stored digitally. And there's the matter of the physical media and the devices to read it. This approach is the reason I was once tasked with writing UNIX code that grokked files stored in a variety of IBM VM/370 file formats. My software, which I designed by reverse engineering the original VM/370 system software written in IBM assembler, was used to access hundreds of thousands of files. Successfully, I'm told.

Barry Karafin, formerly the head of Bell Labs, once observed that most high technologies have a half-life of about five years. Some technologies, like C and TCP/IP, have done better. But for most of it, there is no long term. The need to migrate or emulate is, like adaptive maintenance, a continuous on-going process that never ends.

The definitive work on this topic in my opinion is

Jeff Rothenberg, "Ensuring the Longevity of Digital Documents", Scientific American, vol. 272, no. 1, January 1995

a digital copy of which, so far, can be found here, providing its still readable and you have software that understands Adobe's Portable Document Format. Just today I read another article on this topic written more recently

David Anderson, "Historical Reflections: The Future of the Past", Communications of the ACM, vol. 55, no. 5, May 2012

that I also recommend, and which, at least for now, can be found here. If the links are broken, or your browser can't render the text, well, then, welcome to the information age.

This is serious stuff. I'm not kidding. The value of information often can only be ascertained in hindsight once its historical context is known. We are contemporaneously the worst judges of the value of the data that we produce. We can still read the original letters John Adams wrote to his wife. But the chances that two hundred years from now anything that any of us may have written, important or not, will still exist and be readable are vanishingly slim.

No comments: