Open Computing ``Hands-On'': ``PC-Unix Connection'' Column: February 94

Nurturing Paranoia

Trusting your system makes sense when you have a disaster-recovery plan to back you up.

By Tom Yager

A long-distance friend of mine doesn't share my affection for technology. In a recent conversation, she laid it out plain: She hates computers. When I asked her why, she said, ``I don't trust them.'' I was about to reprimand her for her backward attitude when an odd thought hit me: What reason should I have for trusting the nasty beasts? After all, I've got a big box in my garage that holds the dusty remains of past unpleasant experiences: hard drives, circuit cards, modems. In my 15-year experience, I have had at least one of every class of hardware fail me at the least convenient moment.

It's my belief that every responsible system administrator, whether you operate only the system on your desk or a network of thousands, should always be thinking about disaster. Of all the idle thoughts that roam your mind during periods of calm, ``what would happen if this or that went wrong'' is among the most productive. You can shape such postulation into the heart of a viable disaster plan.

You can't be everywhere at once, so you need to do a little triage on your list of fantasy calamities. I recommend prioritizing your list according to three criteria: those most likely to occur, those that could destroy the most data, and those that would take the most time to repair. I'll use a simple example of each of these to illustrate how disaster strategies work; however, this list is by no means complete.

Gimme Power

In many parts of the country, power failures (fluctuations or complete outages) occur several times a year. Whether the fault lies with nature, the power company, or the yutz who keeps plugging the coffee-maker into the same circuit as your system, computers are not equipped to ride out power problems. That you should have a surge protector on your computer goes without saying. What too many administrators overlook is an uninterruptible power supply (UPS). Five years ago when UPS units were noisy, hot, and expensive, there was ample reason to leave your system at the mercy of its AC jack. Now, with a 600VA UPS going for about $300 and running silent and cool, no reasonable excuse remains. Technology has advanced also, so that many affordable UPSes are equipped with the smarts to inform the system they're protecting when battery juice is about to run out.

Whether you have your system's power protected by an intelligent or dumb UPS, be sure you match the unit to the job. First, consider the power-failure patterns for your area. When the lights go out, do they generally stay out for long periods (five minutes or more) or come back within a few seconds? Make a list of essential components, those devices you feel must be preserved during a power outage. Keep the list small: the fewer the devices, the smaller the UPS. I chose to keep my primary system, the console monitor, and three Telebit modems power protected. I then selected a rating of UPS that could power a load of that size for 5 to 10 minutes. Computer stores selling UPSes should have a selection guide. You may also find a quick reference on the back of the UPS box.

It's important to test your UPS under full load. Plug the UPS in overnight, or long enough for a complete battery charge. The next day, install the intelligent power-failure software, if you have it, and bring your system down. Insert an alternate boot floppy (DOS is good enough for this) and power up the system with all other protected components. Once all the drives are spun up, either yank the power cord or push the ``test'' button on the UPS. Your UPS's battery should take over, and there should be no visible change in activity in the battery-powered equipment. If the UPS is too wimpy for the load you have attached to it, it will probably shut down immediately or give you only a few seconds' protection.

If it survives this test, move on to a test of the intelligent power-failure software. Boot Unix and make sure the software is installed and running correctly. Yank the plug and watch. The UPS's alarm should make a more insistent racket when the battery runs low, and at about that time, it should signal your computer that doom is nigh. The power-management software should kick in, with the console showing signs of shutdown. If the battery gives out before shutdown completes, you either need a bigger UPS or you need to change the notification period. On the APC unit I use, a switch determines whether the system gets warned at two or five minutes before battery failure.

Unhappy Campers

If you serve groups of users, you're already accustomed to having your phone ring off the hook every time the system goes down. ``I was in the middle of something,'' they'll cry. ``Did I lose my work?'' Grumble, as you've a right to, that users never save their work as often as they should. But be understanding, because data loss is every user's worst nightmare.

Even with a UPS, systems can crash from operating-system and device-driver bugs, errant programs that suck up too much memory or disk, and configuration problems. My old Maxx, running The Santa Cruz Operation Inc.'s Unix, used to take periodic kernel panics as a sort of catharsis. The trouble with that setup was the standard System V file system would just blow out whatever hadn't been written to disk. Now USG's Unixware's default vxfs file system, licensed from Veritas, uses journaling. This practice keeps pending file-system changes in a reserved area on disk, so that after a system crash the system need only replay the journal to bring the file system, pending changes included, up to date.

While vxfs offers one type of protection, you can keep your system safe through other means. Mirroring, whether automatic or manual, keeps two copies of vital data constantly online. Automatic mirroring will write all data sent to mirrored file systems twice: once to the primary (mounted) file system and once to the unmounted copy. If the mounted drive fails, the most you'll have to do is remove the dead drive and possibly change the SCSI ID of the mirror to take its place.

You can create your own mirroring-like scheme in a number of ways. You can add a cron-table entry entry that uses dd to copy a crucial disk to an identical alternative during off-hours. If you don't want to tie up a full-sized drive, you might have a half-sized spare for which you can use cpio to make periodic incremental copies of files that have changed since the last complete tape or disk backup. The cpio method is less draining of system resources and more susceptible to lowering its priority with nice during busy hours.

Remember to consider all the resources in your network when you make your disaster plan. If you don't have spare disk space in your local system, no problem; find a box or combination of boxes on your network that do have the space. It's slower than a local drive, but offers the same degree of protection. If disk space is at a premium, automatic incremental backups to tape are a viable alternative.

Build It Once

The best protection starts the day you install your system. Before you do anything else, make copies of all your system's boot diskettes. DOS diskcopy may work in case you lack another Unix box on which to run dd. These boot diskettes will be your lifeline if anything happens to your root file system. As soon as you have done the bulk of the installation, including the installation of optional packages, make a backup. Each time you make significant changes to the system's configuration and have proven it to be stable, make a backup. The reason is simple: It's always quicker to reload your system from tape than to reinstall it from scratch. Tape outstrips even CD-ROM for restore speed.

When you create these safety backups, use cpio or some other tool that archives device nodes as well as regular files. If your system dies on you, reinstall only enough of the operating system to get the tape device working. Then do a complete restore from tape, remembering to specify the overwrite parameter to your restore tool. Otherwise, the kernel, along with files like /etc/passwd and /usr/lib/uucp/Systems, won't reflect your post- installation changes.

Paranoid system administrators frequently draw chuckles from their more laid-back colleagues. But the sysadmin who invests the time to create a workable disaster plan will reap the rewards of that effort. Those rewards, in the shape of greater up-time ratios, less lost data, and quicker recoveries, bring direct bottom-line benefits that far outstrip the costs involved. Keep your plans fluid enough to adapt to changes in technology. Consider, for example, that manufacturers will soon release multigigabyte hard drives that cost about 50 cents a megabyte. More affordable multisession writable CD-ROMs and other removable random-access media will break new ground for those maintaining safe systems. Keep your eyes peeled.


Copyright © 1995 The McGraw-Hill Companies, Inc. All Rights Reserved.
Edited by Becca Thomas / Online Editor / UnixWorld Online / beccat@wcmh.com

[Go to Content] [Search Editorial]

Last Modified: Tuesday, 22-Aug-95 15:49:03 PDT