Startup Problems When you Boot the System

There are many things that can go wrong and keep a Unix system from successfully booting. The most common problems are the following:

Hardware problems

Hardware problems are a fact of life and there is not much you can do to circumvent them. That said, it is always in your best interest to rationally examine the symptoms and document exactly what is (or isn't) happening. Most maintenance services carry a limited number of spare parts, you can greatly improve your service time by doing some simple diagnosis. Worse yet, many groups have maintenance contracts with different vendors for different products. Calling your system vendor when the problem is with a different vendor's disk drive doesn't help. The goal of attempting to diagnose the problems should be to eliminate components known to be working and isolate the problem to one area. Things to examine when booting:

Defective boot media

Most modern workstations now boot initially from ROM or PROM; however if you have a machine that boots from floppy. If you boot from floppy or tape, insist of having a backup copy of the boot media. Most systems that support this method of booting have utilities for creating duplicate copies of the boot media. Insist that you always have backup copies made when the boot media changes. When you have down time test your backup copies and verify they work!

With ROM or PROM based systems sit down and read the manual, understand what the ROM monitor is capable of doing. Many of these monitor have functions for performing hardware diagnostics.

Damaged filesystems

This has historically been the most common type of problem. However, modern disk drives have improved their reliability dramatically over the last few years. Many disk manufacturers now give a 5 year warranty on SCSI disk drives. Still, the mechanical nature of disk drives makes them more suseptible to failure.

Under unix, the command fsck is used to do a file consistency check. The manual page for fsck describes the options and functions of the command. The fsck comman checks the following:

  1. Inode block addressing checks: Too many direct or indirect extents, extents out of order, bad magic number in extents, blocks that are not in a legal data area of the file system, blocks that are claimed by more than one inode.
  2. Size checks: Number of blocks claimed by inode inconsistent with inode size, directory size not block aligned.
  3. Directory checks: Illegal number of entries in a directory block, bad freespace pointer in directory block, entry pointing to unallocated or outrange inode, overlapping entries, missing or incorrect dot and dotdot entries.
  4. Pathname checks: Files or directories not referenced by a pathname starting from the file system root.
  5. Link count checks: Link counts that do not agree with the number of directory references to the inode.
  6. Freemap checks: Blocks claimed free by the freemap but also claimed by an inode, blocks unclaimed by any inode but not appearing in the freemap.
  7. Super Block checks: Total free block and/or free i-node count incorrect.
Orphaned files and directories (allocated but unreferenced) are, with the user's concurrence, reconnected by placing them in the lost+found directory, if the files are nonempty. The user will be notified if the file or directory is empty or not. Empty files or directories are removed, as long as the -n option is not specified. fsck will force the reconnection of nonempty directories. The name assigned is the i-node number. The directory lost+found must preexist in the root of the file system being checked and must have empty slots in which entries can be made. This directory is always created by mkfs(1m) when a file system is first created.

The fsck command can usually recover from minor errors. They may be some file system corruption but usually the file system will be usuable.

Other than root file systems, if fsck cannot repair your disk or file system you can generally boot in single-user mode and recover your file system from archival backups. To do this you would use whatever command is associated with the backup software you used. For new system administrators it is a very good idea to practice this during test time, preferably on a new disk without any critical data. It is much better to become comfortable with file system restoration when it is not a crisis situation.

The root file system is obviously the most critical file system, without it you cannot boot the machine. If the root filesystem is damaged the fsck command cannot be loaded to correct a file system. If the root file system is damaged other things must be done. One option, available on many workstations is to boot from a CD-ROM or some alternate device (tape) and then use the utilities available to recover your file system. It is critical that as a system manager you understand what utilities are available on the alternate device and use one of those utilities to back up your root file system. For example, on SGI systems we can boot from CD-ROM and have access to the utility bru. Because of this we use the bru command to backup up our root and usr file systems. If a problem occurs we can recover from backup tapes. If you don't have a means of booting from alternate media it is probably a very good idea to create an alternate root partition on a different disk. Then if your primary root disk is corrupted you can boot from the alternate drive and restore the true root partition from tape. As with all of this, it is a very good idea for system administrators to try this ahead of time.

Improperly configured kernels

We will spend more time discussing kernels later in the semester. However, it is possible to build a kernel that won't boot. For example, we worked with SGI to build a kernel containing a very large number of streams buffers. When we went to load the kernel we had to found that it was too large and conflicted with some limitations imposed by SGI's SASH utility. The key thing when building a kernel is to have a consistent mechanism for naming the new and old kernels. If a new kernel won't boot it is often impossible at that to go back and see what the previous kernel was named. Given a kernel that won't boot each OS has some means of specifying an alternate kernel file to load.

Errors in the system startup scripts

Errors in the startup scripts (either /etc/rc or /etc/init.d) are usually not fatal. Sometimes the error is minor and can be fixed after the machine has come up by re-runing a set of commands as root. If the problem is more sever, you will have to boot the system in single-user mode and fix the problem by editing the startup scripts. Remember that usually you only have vi or ed editors available so make sure you have a basic understanding of those editors. Jack Suess/jack@umbc.edu